qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/27] vDPA software assisted live migration
@ 2020-11-20 18:50 Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 01/27] vhost: Add vhost_dev_can_log Eugenio Pérez
                   ` (31 more replies)
  0 siblings, 32 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This series enable vDPA software assisted live migration for vhost-net
devices. This is a new method of vhost devices migration: Instead of
relay on vDPA device's dirty logging capability, SW assisted LM
intercepts dataplane, forwarding the descriptors between VM and device.

In this migration mode, qemu offers a new vring to the device to
read and write into, and disable vhost notifiers, processing guest and
vhost notifications in qemu. On used buffer relay, qemu will mark the
dirty memory as with plain virtio-net devices. This way, devices does
not need to have dirty page logging capability.

This series is a POC doing SW LM for vhost-net devices, which already
have dirty page logging capabilities. None of the changes have actual
effect with current devices until last two patches (26 and 27) are
applied, but they can be rebased on top of any other. These checks the
device to meet all requirements, and disable vhost-net devices logging
so migration goes through SW LM. This last patch is not meant to be
applied in the final revision, it is in the series just for testing
purposes.

For use SW assisted LM these vhost-net devices need to be instantiated:
* With IOMMU (iommu_platform=on,ats=on)
* Without event_idx (event_idx=off)

Just the notification forwarding (with no descriptor relay) can be
achieved with patches 7 and 9, and starting migration. Partial applies
between 13 and 24 will not work while migrating on source, and patch
25 is needed for the destination to resume network activity.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ .

Comments are welcome.

Thanks!

Eugenio Pérez (27):
  vhost: Add vhost_dev_can_log
  vhost: Add device callback in vhost_migration_log
  vhost: Move log resize/put to vhost_dev_set_log
  vhost: add vhost_kernel_set_vring_enable
  vhost: Add hdev->dev.sw_lm_vq_handler
  virtio: Add virtio_queue_get_used_notify_split
  vhost: Route guest->host notification through qemu
  vhost: Add a flag for software assisted Live Migration
  vhost: Route host->guest notification through qemu
  vhost: Allocate shadow vring
  virtio: const-ify all virtio_tswap* functions
  virtio: Add virtio_queue_full
  vhost: Send buffers to device
  virtio: Remove virtio_queue_get_used_notify_split
  vhost: Do not invalidate signalled used
  virtio: Expose virtqueue_alloc_element
  vhost: add vhost_vring_set_notification_rcu
  vhost: add vhost_vring_poll_rcu
  vhost: add vhost_vring_get_buf_rcu
  vhost: Return used buffers
  vhost: Add vhost_virtqueue_memory_unmap
  vhost: Add vhost_virtqueue_memory_map
  vhost: unmap qemu's shadow virtqueues on sw live migration
  vhost: iommu changes
  vhost: Do not commit vhost used idx on vhost_virtqueue_stop
  vhost: Add vhost_hdev_can_sw_lm
  vhost: forbid vhost devices logging

 hw/virtio/vhost-sw-lm-ring.h      |  39 +++
 include/hw/virtio/vhost.h         |   5 +
 include/hw/virtio/virtio-access.h |   8 +-
 include/hw/virtio/virtio.h        |   4 +
 hw/net/virtio-net.c               |  39 ++-
 hw/virtio/vhost-backend.c         |  29 ++
 hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
 hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
 hw/virtio/virtio.c                |  18 +-
 hw/virtio/meson.build             |   2 +-
 10 files changed, 758 insertions(+), 85 deletions(-)
 create mode 100644 hw/virtio/vhost-sw-lm-ring.h
 create mode 100644 hw/virtio/vhost-sw-lm-ring.c

-- 
2.18.4



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 01/27] vhost: Add vhost_dev_can_log
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log Eugenio Pérez
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Just syntactic sugar.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 614ccc2bcb..2bd8cdf893 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -61,6 +61,11 @@ bool vhost_has_free_slot(void)
     return slots_limit > used_memslots;
 }
 
+static bool vhost_dev_can_log(const struct vhost_dev *hdev)
+{
+    return hdev->features & (0x1ULL << VHOST_F_LOG_ALL);
+}
+
 static void vhost_dev_sync_region(struct vhost_dev *dev,
                                   MemoryRegionSection *section,
                                   uint64_t mfirst, uint64_t mlast,
@@ -1347,7 +1352,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     };
 
     if (hdev->migration_blocker == NULL) {
-        if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
+        if (!vhost_dev_can_log(hdev)) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature.");
         } else if (vhost_dev_log_is_shared(hdev) && !qemu_memfd_alloc_check()) {
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 01/27] vhost: Add vhost_dev_can_log Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-07 16:19   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 03/27] vhost: Move log resize/put to vhost_dev_set_log Eugenio Pérez
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This allows code to reuse the logic to not to re-enable or re-disable
migration mechanisms. Code works the same way as before.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2bd8cdf893..2adb2718c1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -862,7 +862,9 @@ err_features:
     return r;
 }
 
-static int vhost_migration_log(MemoryListener *listener, bool enable)
+static int vhost_migration_log(MemoryListener *listener,
+                               bool enable,
+                               int (*device_cb)(struct vhost_dev *, bool))
 {
     struct vhost_dev *dev = container_of(listener, struct vhost_dev,
                                          memory_listener);
@@ -877,14 +879,14 @@ static int vhost_migration_log(MemoryListener *listener, bool enable)
 
     r = 0;
     if (!enable) {
-        r = vhost_dev_set_log(dev, false);
+        r = device_cb(dev, false);
         if (r < 0) {
             goto check_dev_state;
         }
         vhost_log_put(dev, false);
     } else {
         vhost_dev_log_resize(dev, vhost_get_log_size(dev));
-        r = vhost_dev_set_log(dev, true);
+        r = device_cb(dev, true);
         if (r < 0) {
             goto check_dev_state;
         }
@@ -916,7 +918,7 @@ static void vhost_log_global_start(MemoryListener *listener)
 {
     int r;
 
-    r = vhost_migration_log(listener, true);
+    r = vhost_migration_log(listener, true, vhost_dev_set_log);
     if (r < 0) {
         abort();
     }
@@ -926,7 +928,7 @@ static void vhost_log_global_stop(MemoryListener *listener)
 {
     int r;
 
-    r = vhost_migration_log(listener, false);
+    r = vhost_migration_log(listener, false, vhost_dev_set_log);
     if (r < 0) {
         abort();
     }
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 03/27] vhost: Move log resize/put to vhost_dev_set_log
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 01/27] vhost: Add vhost_dev_can_log Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Software assisted live migration does not allocate vhost log.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2adb2718c1..9cbd52a7f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -828,6 +828,10 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
     int r, i, idx;
     hwaddr addr;
 
+    if (enable_log) {
+        vhost_dev_log_resize(dev, vhost_get_log_size(dev));
+    }
+
     r = vhost_dev_set_features(dev, enable_log);
     if (r < 0) {
         goto err_features;
@@ -850,6 +854,10 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
             goto err_vq;
         }
     }
+
+    if (!enable_log) {
+        vhost_log_put(dev, false);
+    }
     return 0;
 err_vq:
     for (; i >= 0; --i) {
@@ -877,22 +885,8 @@ static int vhost_migration_log(MemoryListener *listener,
         return 0;
     }
 
-    r = 0;
-    if (!enable) {
-        r = device_cb(dev, false);
-        if (r < 0) {
-            goto check_dev_state;
-        }
-        vhost_log_put(dev, false);
-    } else {
-        vhost_dev_log_resize(dev, vhost_get_log_size(dev));
-        r = device_cb(dev, true);
-        if (r < 0) {
-            goto check_dev_state;
-        }
-    }
+    r = device_cb(dev, enable);
 
-check_dev_state:
     dev->log_enabled = enable;
     /*
      * vhost-user-* devices could change their state during log
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 03/27] vhost: Move log resize/put to vhost_dev_set_log Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-07 16:43   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler Eugenio Pérez
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 222bbcc62d..317f1f96fa 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
     return idx - dev->vq_index;
 }
 
+static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
+                                      bool enable)
+{
+    struct vhost_vring_file file = {
+        .index = idx,
+    };
+
+    if (!enable) {
+        file.fd = -1; /* Pass -1 to unbind from file. */
+    } else {
+        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
+        file.fd = vn_dev->backend;
+    }
+
+    return vhost_kernel_net_set_backend(dev, &file);
+}
+
+static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+    int i;
+
+    for (i = 0; i < dev->nvqs; ++i) {
+        vhost_kernel_set_vq_enable(dev, i, enable);
+    }
+
+    return 0;
+}
+
 #ifdef CONFIG_VHOST_VSOCK
 static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
                                             uint64_t guest_cid)
@@ -317,6 +345,7 @@ static const VhostOps kernel_ops = {
         .vhost_set_owner = vhost_kernel_set_owner,
         .vhost_reset_device = vhost_kernel_reset_device,
         .vhost_get_vq_index = vhost_kernel_get_vq_index,
+        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
 #ifdef CONFIG_VHOST_VSOCK
         .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
         .vhost_vsock_set_running = vhost_kernel_vsock_set_running,
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-07 16:52   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split Eugenio Pérez
                   ` (26 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Only virtio-net honors it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h |  1 +
 hw/net/virtio-net.c       | 39 ++++++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 4a8bc75415..b5b7496537 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -83,6 +83,7 @@ struct vhost_dev {
     bool started;
     bool log_enabled;
     uint64_t log_size;
+    VirtIOHandleOutput sw_lm_vq_handler;
     Error *migration_blocker;
     const VhostOps *vhost_ops;
     void *opaque;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9179013ac4..9a69ae3598 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2628,24 +2628,32 @@ static void virtio_net_tx_bh(void *opaque)
     }
 }
 
-static void virtio_net_add_queue(VirtIONet *n, int index)
+static void virtio_net_add_queue(VirtIONet *n, int index,
+                                 VirtIOHandleOutput custom_handler)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
+    VirtIOHandleOutput rx_vq_handler = virtio_net_handle_rx;
+    VirtIOHandleOutput tx_vq_handler;
+    bool tx_timer = n->net_conf.tx && !strcmp(n->net_conf.tx, "timer");
+
+    if (custom_handler) {
+        rx_vq_handler = tx_vq_handler = custom_handler;
+    } else if (tx_timer) {
+        tx_vq_handler = virtio_net_handle_tx_timer;
+    } else {
+        tx_vq_handler = virtio_net_handle_tx_bh;
+    }
 
     n->vqs[index].rx_vq = virtio_add_queue(vdev, n->net_conf.rx_queue_size,
-                                           virtio_net_handle_rx);
+                                           rx_vq_handler);
+    n->vqs[index].tx_vq = virtio_add_queue(vdev, n->net_conf.tx_queue_size,
+                                           tx_vq_handler);
 
-    if (n->net_conf.tx && !strcmp(n->net_conf.tx, "timer")) {
-        n->vqs[index].tx_vq =
-            virtio_add_queue(vdev, n->net_conf.tx_queue_size,
-                             virtio_net_handle_tx_timer);
+    if (tx_timer) {
         n->vqs[index].tx_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                               virtio_net_tx_timer,
                                               &n->vqs[index]);
     } else {
-        n->vqs[index].tx_vq =
-            virtio_add_queue(vdev, n->net_conf.tx_queue_size,
-                             virtio_net_handle_tx_bh);
         n->vqs[index].tx_bh = qemu_bh_new(virtio_net_tx_bh, &n->vqs[index]);
     }
 
@@ -2677,6 +2685,10 @@ static void virtio_net_del_queue(VirtIONet *n, int index)
 static void virtio_net_change_num_queues(VirtIONet *n, int new_max_queues)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
+    NetClientState *nc = n->nic->conf->peers.ncs[0];
+    struct vhost_net *hdev = get_vhost_net(nc);
+    VirtIOHandleOutput custom_handler = hdev ? hdev->dev.sw_lm_vq_handler
+                                             : NULL;
     int old_num_queues = virtio_get_num_queues(vdev);
     int new_num_queues = new_max_queues * 2 + 1;
     int i;
@@ -2702,7 +2714,7 @@ static void virtio_net_change_num_queues(VirtIONet *n, int new_max_queues)
 
     for (i = old_num_queues - 1; i < new_num_queues - 1; i += 2) {
         /* new_num_queues > old_num_queues */
-        virtio_net_add_queue(n, i / 2);
+        virtio_net_add_queue(n, i / 2, custom_handler);
     }
 
     /* add ctrl_vq last */
@@ -3256,6 +3268,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIONet *n = VIRTIO_NET(dev);
     NetClientState *nc;
+    struct vhost_net *hdev;
+    VirtIOHandleOutput custom_handler;
     int i;
 
     if (n->net_conf.mtu) {
@@ -3347,8 +3361,11 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     n->net_conf.tx_queue_size = MIN(virtio_net_max_tx_queue_size(n),
                                     n->net_conf.tx_queue_size);
 
+    nc = n->nic_conf.peers.ncs[0];
+    hdev = get_vhost_net(nc);
+    custom_handler = hdev ? hdev->dev.sw_lm_vq_handler : NULL;
     for (i = 0; i < n->max_queues; i++) {
-        virtio_net_add_queue(n, i);
+        virtio_net_add_queue(n, i, custom_handler);
     }
 
     n->ctrl_vq = virtio_add_queue(vdev, 64, virtio_net_handle_ctrl);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-07 16:58   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 07/27] vhost: Route guest->host notification through qemu Eugenio Pérez
                   ` (25 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This function is just used for a few commits, so SW LM is developed
incrementally, and it is deleted after it is useful.

For a few commits, only the events (irqfd, eventfd) are forwarded.
This series adds descriptors forwarding on top of that.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h |  1 +
 hw/virtio/virtio.c         | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b7ece7a6a8..b9b8497ea0 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -225,6 +225,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id);
 
 void virtio_notify_config(VirtIODevice *vdev);
 
+bool virtio_queue_get_used_notify_split(VirtQueue *vq);
 bool virtio_queue_get_notification(VirtQueue *vq);
 void virtio_queue_set_notification(VirtQueue *vq, int enable);
 
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ceb58fda6c..3469946538 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -377,6 +377,20 @@ static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
     vq->used_idx = val;
 }
 
+bool virtio_queue_get_used_notify_split(VirtQueue *vq)
+{
+    VRingMemoryRegionCaches *caches;
+    hwaddr pa = offsetof(VRingUsed, flags);
+    uint16_t flags;
+
+    RCU_READ_LOCK_GUARD();
+
+    caches = vring_get_region_caches(vq);
+    assert(caches);
+    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
+    return !(VRING_USED_F_NO_NOTIFY & flags);
+}
+
 /* Called within rcu_read_lock().  */
 static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask)
 {
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 07/27] vhost: Route guest->host notification through qemu
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-07 17:42   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration Eugenio Pérez
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |  26 +++++++++
 include/hw/virtio/vhost.h    |   3 ++
 hw/virtio/vhost-sw-lm-ring.c |  60 +++++++++++++++++++++
 hw/virtio/vhost.c            | 100 +++++++++++++++++++++++++++++++++--
 hw/virtio/meson.build        |   2 +-
 5 files changed, 187 insertions(+), 4 deletions(-)
 create mode 100644 hw/virtio/vhost-sw-lm-ring.h
 create mode 100644 hw/virtio/vhost-sw-lm-ring.c

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
new file mode 100644
index 0000000000..86dc081b93
--- /dev/null
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -0,0 +1,26 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2020
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef VHOST_SW_LM_RING_H
+#define VHOST_SW_LM_RING_H
+
+#include "qemu/osdep.h"
+
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/vhost.h"
+
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
+bool vhost_vring_kick(VhostShadowVirtqueue *vq);
+
+VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx);
+
+void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq);
+
+#endif
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index b5b7496537..93cc3f1ae3 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -54,6 +54,8 @@ struct vhost_iommu {
     QLIST_ENTRY(vhost_iommu) iommu_next;
 };
 
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
 typedef struct VhostDevConfigOps {
     /* Vhost device config space changed callback
      */
@@ -83,6 +85,7 @@ struct vhost_dev {
     bool started;
     bool log_enabled;
     uint64_t log_size;
+    VhostShadowVirtqueue *sw_lm_shadow_vq[2];
     VirtIOHandleOutput sw_lm_vq_handler;
     Error *migration_blocker;
     const VhostOps *vhost_ops;
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
new file mode 100644
index 0000000000..0192e77831
--- /dev/null
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -0,0 +1,60 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2020
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "hw/virtio/vhost-sw-lm-ring.h"
+#include "hw/virtio/vhost.h"
+
+#include "standard-headers/linux/vhost_types.h"
+#include "standard-headers/linux/virtio_ring.h"
+
+#include "qemu/event_notifier.h"
+
+typedef struct VhostShadowVirtqueue {
+    EventNotifier hdev_notifier;
+    VirtQueue *vq;
+} VhostShadowVirtqueue;
+
+static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
+{
+    return virtio_queue_get_used_notify_split(vq->vq);
+}
+
+bool vhost_vring_kick(VhostShadowVirtqueue *vq)
+{
+    return vhost_vring_should_kick(vq) ? event_notifier_set(&vq->hdev_notifier)
+                                       : true;
+}
+
+VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
+{
+    struct vhost_vring_file file = {
+        .index = idx
+    };
+    VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
+    VhostShadowVirtqueue *svq;
+    int r;
+
+    svq = g_new0(VhostShadowVirtqueue, 1);
+    svq->vq = vq;
+
+    r = event_notifier_init(&svq->hdev_notifier, 0);
+    assert(r == 0);
+
+    file.fd = event_notifier_get_fd(&svq->hdev_notifier);
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    assert(r == 0);
+
+    return svq;
+}
+
+void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq)
+{
+    event_notifier_cleanup(&vq->hdev_notifier);
+    g_free(vq);
+}
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 9cbd52a7f1..a55b684b5f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -13,6 +13,8 @@
  * GNU GPL, version 2 or (at your option) any later version.
  */
 
+#include "hw/virtio/vhost-sw-lm-ring.h"
+
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "hw/virtio/vhost.h"
@@ -61,6 +63,20 @@ bool vhost_has_free_slot(void)
     return slots_limit > used_memslots;
 }
 
+static struct vhost_dev *vhost_dev_from_virtio(const VirtIODevice *vdev)
+{
+    struct vhost_dev *hdev;
+
+    QLIST_FOREACH(hdev, &vhost_devices, entry) {
+        if (hdev->vdev == vdev) {
+            return hdev;
+        }
+    }
+
+    assert(hdev);
+    return NULL;
+}
+
 static bool vhost_dev_can_log(const struct vhost_dev *hdev)
 {
     return hdev->features & (0x1ULL << VHOST_F_LOG_ALL);
@@ -148,6 +164,12 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
     return 0;
 }
 
+static void vhost_log_sync_nop(MemoryListener *listener,
+                               MemoryRegionSection *section)
+{
+    return;
+}
+
 static void vhost_log_sync(MemoryListener *listener,
                           MemoryRegionSection *section)
 {
@@ -928,6 +950,71 @@ static void vhost_log_global_stop(MemoryListener *listener)
     }
 }
 
+static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
+{
+    struct vhost_dev *hdev = vhost_dev_from_virtio(vdev);
+    uint16_t idx = virtio_get_queue_index(vq);
+
+    VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
+
+    vhost_vring_kick(svq);
+}
+
+static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
+{
+    int idx;
+
+    vhost_dev_enable_notifiers(dev, dev->vdev);
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
+    }
+
+    return 0;
+}
+
+static int vhost_sw_live_migration_start(struct vhost_dev *dev)
+{
+    int idx;
+
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
+    }
+
+    vhost_dev_disable_notifiers(dev, dev->vdev);
+
+    return 0;
+}
+
+static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
+                                          bool enable_lm)
+{
+    if (enable_lm) {
+        return vhost_sw_live_migration_start(dev);
+    } else {
+        return vhost_sw_live_migration_stop(dev);
+    }
+}
+
+static void vhost_sw_lm_global_start(MemoryListener *listener)
+{
+    int r;
+
+    r = vhost_migration_log(listener, true, vhost_sw_live_migration_enable);
+    if (r < 0) {
+        abort();
+    }
+}
+
+static void vhost_sw_lm_global_stop(MemoryListener *listener)
+{
+    int r;
+
+    r = vhost_migration_log(listener, false, vhost_sw_live_migration_enable);
+    if (r < 0) {
+        abort();
+    }
+}
+
 static void vhost_log_start(MemoryListener *listener,
                             MemoryRegionSection *section,
                             int old, int new)
@@ -1334,9 +1421,14 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         .region_nop = vhost_region_addnop,
         .log_start = vhost_log_start,
         .log_stop = vhost_log_stop,
-        .log_sync = vhost_log_sync,
-        .log_global_start = vhost_log_global_start,
-        .log_global_stop = vhost_log_global_stop,
+        .log_sync = !vhost_dev_can_log(hdev) ?
+                    vhost_log_sync_nop :
+                    vhost_log_sync,
+        .log_global_start = !vhost_dev_can_log(hdev) ?
+                            vhost_sw_lm_global_start :
+                            vhost_log_global_start,
+        .log_global_stop = !vhost_dev_can_log(hdev) ? vhost_sw_lm_global_stop :
+                                                      vhost_log_global_stop,
         .eventfd_add = vhost_eventfd_add,
         .eventfd_del = vhost_eventfd_del,
         .priority = 10
@@ -1364,6 +1456,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
             error_free(hdev->migration_blocker);
             goto fail_busyloop;
         }
+    } else {
+        hdev->sw_lm_vq_handler = handle_sw_lm_vq;
     }
 
     hdev->mem = g_malloc0(offsetof(struct vhost_memory, regions));
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..17419cb13e 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-sw-lm-ring.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 07/27] vhost: Route guest->host notification through qemu Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  7:20   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 09/27] vhost: Route host->guest notification through qemu Eugenio Pérez
                   ` (23 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h |  1 +
 hw/virtio/vhost.c         | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 93cc3f1ae3..ef920a8076 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -84,6 +84,7 @@ struct vhost_dev {
     uint64_t backend_cap;
     bool started;
     bool log_enabled;
+    bool sw_lm_enabled;
     uint64_t log_size;
     VhostShadowVirtqueue *sw_lm_shadow_vq[2];
     VirtIOHandleOutput sw_lm_vq_handler;
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a55b684b5f..1d55e26d45 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -988,11 +988,16 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
                                           bool enable_lm)
 {
+    int r;
+
     if (enable_lm) {
-        return vhost_sw_live_migration_start(dev);
+        r = vhost_sw_live_migration_start(dev);
     } else {
-        return vhost_sw_live_migration_stop(dev);
+        r = vhost_sw_live_migration_stop(dev);
     }
+
+    dev->sw_lm_enabled = enable_lm;
+    return r;
 }
 
 static void vhost_sw_lm_global_start(MemoryListener *listener)
@@ -1466,6 +1471,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->log = NULL;
     hdev->log_size = 0;
     hdev->log_enabled = false;
+    hdev->sw_lm_enabled = false;
     hdev->started = false;
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
@@ -1571,6 +1577,13 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
     int i, r;
 
+    if (hdev->sw_lm_enabled) {
+        /* We've been called after migration is completed, so no need to
+           disable it again
+        */
+        return;
+    }
+
     for (i = 0; i < hdev->nvqs; ++i) {
         r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
                                          false);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 09/27] vhost: Route host->guest notification through qemu
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  7:34   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 10/27] vhost: Allocate shadow vring Eugenio Pérez
                   ` (22 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.c |  3 +++
 hw/virtio/vhost.c            | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index 0192e77831..cbf53965cd 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -50,6 +50,9 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
     r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
     assert(r == 0);
 
+    vhost_virtqueue_mask(dev, dev->vdev, idx, true);
+    vhost_virtqueue_pending(dev, idx);
+
     return svq;
 }
 
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 1d55e26d45..9352c56bfa 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -960,12 +960,29 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
     vhost_vring_kick(svq);
 }
 
+static void vhost_handle_call(EventNotifier *n)
+{
+    struct vhost_virtqueue *hvq = container_of(n,
+                                              struct vhost_virtqueue,
+                                              masked_notifier);
+    struct vhost_dev *vdev = hvq->dev;
+    int idx = vdev->vq_index + (hvq == &vdev->vqs[0] ? 0 : 1);
+    VirtQueue *vq = virtio_get_queue(vdev->vdev, idx);
+
+    if (event_notifier_test_and_clear(n)) {
+        virtio_queue_invalidate_signalled_used(vdev->vdev, idx);
+        virtio_notify_irqfd(vdev->vdev, vq);
+    }
+}
+
 static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 {
     int idx;
 
     vhost_dev_enable_notifiers(dev, dev->vdev);
     for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_virtqueue_mask(dev, dev->vdev, idx, false);
+        vhost_virtqueue_pending(dev, idx);
         vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
     }
 
@@ -977,7 +994,10 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
     int idx;
 
     for (idx = 0; idx < dev->nvqs; ++idx) {
+        struct vhost_virtqueue *vq = &dev->vqs[idx];
+
         dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
+        event_notifier_set_handler(&vq->masked_notifier, vhost_handle_call);
     }
 
     vhost_dev_disable_notifiers(dev, dev->vdev);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 10/27] vhost: Allocate shadow vring
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 09/27] vhost: Route host->guest notification through qemu Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  7:49   ` Stefan Hajnoczi
  2020-12-08  8:17   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 11/27] virtio: const-ify all virtio_tswap* functions Eugenio Pérez
                   ` (21 subsequent siblings)
  31 siblings, 2 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index cbf53965cd..cd7b5ba772 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -16,8 +16,11 @@
 #include "qemu/event_notifier.h"
 
 typedef struct VhostShadowVirtqueue {
+    struct vring vring;
     EventNotifier hdev_notifier;
     VirtQueue *vq;
+
+    vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
 static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
@@ -37,10 +40,12 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
         .index = idx
     };
     VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
+    unsigned num = virtio_queue_get_num(dev->vdev, idx);
+    size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
     VhostShadowVirtqueue *svq;
     int r;
 
-    svq = g_new0(VhostShadowVirtqueue, 1);
+    svq = g_malloc0(sizeof(*svq) + ring_size);
     svq->vq = vq;
 
     r = event_notifier_init(&svq->hdev_notifier, 0);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 11/27] virtio: const-ify all virtio_tswap* functions
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 10/27] vhost: Allocate shadow vring Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 12/27] virtio: Add virtio_queue_full Eugenio Pérez
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

They do not modify vdev, so these should be const as qemu coding style
guideline.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio-access.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 6818a23a2d..7474f89b5f 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -24,7 +24,7 @@
 #define LEGACY_VIRTIO_IS_BIENDIAN 1
 #endif
 
-static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
+static inline bool virtio_access_is_big_endian(const VirtIODevice *vdev)
 {
 #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
     return virtio_is_big_endian(vdev);
@@ -147,7 +147,7 @@ static inline uint64_t virtio_ldq_p(VirtIODevice *vdev, const void *ptr)
     }
 }
 
-static inline uint16_t virtio_tswap16(VirtIODevice *vdev, uint16_t s)
+static inline uint16_t virtio_tswap16(const VirtIODevice *vdev, uint16_t s)
 {
 #ifdef HOST_WORDS_BIGENDIAN
     return virtio_access_is_big_endian(vdev) ? s : bswap16(s);
@@ -213,7 +213,7 @@ static inline void virtio_tswap16s(VirtIODevice *vdev, uint16_t *s)
     *s = virtio_tswap16(vdev, *s);
 }
 
-static inline uint32_t virtio_tswap32(VirtIODevice *vdev, uint32_t s)
+static inline uint32_t virtio_tswap32(const VirtIODevice *vdev, uint32_t s)
 {
 #ifdef HOST_WORDS_BIGENDIAN
     return virtio_access_is_big_endian(vdev) ? s : bswap32(s);
@@ -227,7 +227,7 @@ static inline void virtio_tswap32s(VirtIODevice *vdev, uint32_t *s)
     *s = virtio_tswap32(vdev, *s);
 }
 
-static inline uint64_t virtio_tswap64(VirtIODevice *vdev, uint64_t s)
+static inline uint64_t virtio_tswap64(const VirtIODevice *vdev, uint64_t s)
 {
 #ifdef HOST_WORDS_BIGENDIAN
     return virtio_access_is_big_endian(vdev) ? s : bswap64(s);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 12/27] virtio: Add virtio_queue_full
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 11/27] virtio: const-ify all virtio_tswap* functions Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 13/27] vhost: Send buffers to device Eugenio Pérez
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h |  2 ++
 hw/virtio/virtio.c         | 15 +++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b9b8497ea0..0a7f5cc63e 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -233,6 +233,8 @@ int virtio_queue_ready(VirtQueue *vq);
 
 int virtio_queue_empty(VirtQueue *vq);
 
+bool virtio_queue_full(const VirtQueue *vq);
+
 /* Host binding interface.  */
 
 uint32_t virtio_config_readb(VirtIODevice *vdev, uint32_t addr);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 3469946538..77ca5f6b6f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -684,6 +684,17 @@ int virtio_queue_empty(VirtQueue *vq)
     }
 }
 
+static bool virtio_queue_full_rcu(const VirtQueue *vq)
+{
+    return vq->inuse >= vq->vring.num;
+}
+
+bool virtio_queue_full(const VirtQueue *vq)
+{
+    RCU_READ_LOCK_GUARD();
+    return virtio_queue_full_rcu(vq);
+}
+
 static void virtqueue_unmap_sg(VirtQueue *vq, const VirtQueueElement *elem,
                                unsigned int len)
 {
@@ -1453,7 +1464,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
 
     max = vq->vring.num;
 
-    if (vq->inuse >= vq->vring.num) {
+    if (unlikely(virtio_queue_full_rcu(vq))) {
         virtio_error(vdev, "Virtqueue size exceeded");
         goto done;
     }
@@ -1588,7 +1599,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
 
     max = vq->vring.num;
 
-    if (vq->inuse >= vq->vring.num) {
+    if (unlikely(virtio_queue_full_rcu(vq))) {
         virtio_error(vdev, "Virtqueue size exceeded");
         goto done;
     }
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 13/27] vhost: Send buffers to device
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 12/27] virtio: Add virtio_queue_full Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  8:16   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 14/27] virtio: Remove virtio_queue_get_used_notify_split Eugenio Pérez
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |   3 +
 hw/virtio/vhost-sw-lm-ring.c | 134 +++++++++++++++++++++++++++++++++--
 hw/virtio/vhost.c            |  59 ++++++++++++++-
 3 files changed, 189 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
index 86dc081b93..29d21feaf4 100644
--- a/hw/virtio/vhost-sw-lm-ring.h
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -18,6 +18,9 @@
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 bool vhost_vring_kick(VhostShadowVirtqueue *vq);
+int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem);
+void vhost_vring_write_addr(const VhostShadowVirtqueue *vq,
+	                    struct vhost_vring_addr *addr);
 
 VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx);
 
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index cd7b5ba772..aed005c2d9 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -9,6 +9,7 @@
 
 #include "hw/virtio/vhost-sw-lm-ring.h"
 #include "hw/virtio/vhost.h"
+#include "hw/virtio/virtio-access.h"
 
 #include "standard-headers/linux/vhost_types.h"
 #include "standard-headers/linux/virtio_ring.h"
@@ -19,21 +20,140 @@ typedef struct VhostShadowVirtqueue {
     struct vring vring;
     EventNotifier hdev_notifier;
     VirtQueue *vq;
+    VirtIODevice *vdev;
+
+    /* Map for returning guest's descriptors */
+    VirtQueueElement **ring_id_maps;
+
+    /* Next head to expose to device */
+    uint16_t avail_idx_shadow;
+
+    /* Number of descriptors added since last notification */
+    uint16_t num_added;
+
+    /* Next free descriptor */
+    uint16_t free_head;
 
     vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
-static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
+static bool vhost_vring_should_kick_rcu(VhostShadowVirtqueue *vq)
 {
-    return virtio_queue_get_used_notify_split(vq->vq);
+    VirtIODevice *vdev = vq->vdev;
+    vq->num_added = 0;
+
+    smp_rmb();
+    return !(vq->vring.used->flags
+             & virtio_tswap16(vdev, VRING_USED_F_NO_NOTIFY));
 }
 
+static bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
+{
+    RCU_READ_LOCK_GUARD();
+    return vhost_vring_should_kick_rcu(vq);
+}
+
+
 bool vhost_vring_kick(VhostShadowVirtqueue *vq)
 {
     return vhost_vring_should_kick(vq) ? event_notifier_set(&vq->hdev_notifier)
                                        : true;
 }
 
+static void vhost_vring_write_descs(VhostShadowVirtqueue *vq,
+                                    const struct iovec *iovec,
+                                    size_t num, bool more_descs, bool write)
+{
+    uint16_t i = vq->free_head, last = vq->free_head;
+    unsigned n;
+    const VirtIODevice *vdev = vq->vdev;
+    uint16_t flags = write ? virtio_tswap16(vdev, VRING_DESC_F_WRITE) : 0;
+    vring_desc_t *descs = vq->vring.desc;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (n = 0; n < num; n++) {
+        if (more_descs || (n + 1 < num)) {
+            descs[i].flags = flags | virtio_tswap16(vdev, VRING_DESC_F_NEXT);
+        } else {
+            descs[i].flags = flags;
+        }
+        descs[i].addr = virtio_tswap64(vdev, (hwaddr)iovec[n].iov_base);
+        descs[i].len = virtio_tswap32(vdev, iovec[n].iov_len);
+
+        last = i;
+        i = virtio_tswap16(vdev, descs[i].next);
+    }
+
+    vq->free_head = virtio_tswap16(vdev, descs[last].next);
+}
+
+/* virtqueue_add:
+ * @vq: The #VirtQueue
+ * @elem: The #VirtQueueElement
+ *
+ * Add an avail element to a virtqueue.
+ */
+static int vhost_vring_add_split(VhostShadowVirtqueue *vq,
+                                 const VirtQueueElement *elem)
+{
+    int head;
+    unsigned avail_idx;
+    const VirtIODevice *vdev;
+    vring_avail_t *avail;
+
+    RCU_READ_LOCK_GUARD();
+    vdev = vq->vdev;
+    avail = vq->vring.avail;
+
+    head = vq->free_head;
+
+    /* We need some descriptors here */
+    assert(elem->out_num || elem->in_num);
+
+    vhost_vring_write_descs(vq, elem->out_sg, elem->out_num,
+                   elem->in_num > 0, false);
+    vhost_vring_write_descs(vq, elem->in_sg, elem->in_num, false, true);
+
+    /* Put entry in available array (but don't update avail->idx until they
+     * do sync). */
+    avail_idx = vq->avail_idx_shadow & (vq->vring.num - 1);
+    avail->ring[avail_idx] = virtio_tswap16(vdev, head);
+    vq->avail_idx_shadow++;
+
+    /* Expose descriptors to device */
+    smp_wmb();
+    avail->idx = virtio_tswap16(vdev, vq->avail_idx_shadow);
+
+    /* threoretically possible. Kick just in case */
+    if (unlikely(vq->num_added++ == (uint16_t)-1)) {
+        vhost_vring_kick(vq);
+    }
+
+    return head;
+}
+
+int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem)
+{
+    int host_head = vhost_vring_add_split(vq, elem);
+    if (vq->ring_id_maps[host_head]) {
+        g_free(vq->ring_id_maps[host_head]);
+    }
+
+    vq->ring_id_maps[host_head] = elem;
+    return 0;
+}
+
+void vhost_vring_write_addr(const VhostShadowVirtqueue *vq,
+                            struct vhost_vring_addr *addr)
+{
+    addr->desc_user_addr = (uint64_t)vq->vring.desc;
+    addr->avail_user_addr = (uint64_t)vq->vring.avail;
+    addr->used_user_addr = (uint64_t)vq->vring.used;
+}
+
 VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
 {
     struct vhost_vring_file file = {
@@ -43,9 +163,11 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
     unsigned num = virtio_queue_get_num(dev->vdev, idx);
     size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
     VhostShadowVirtqueue *svq;
-    int r;
+    int r, i;
 
     svq = g_malloc0(sizeof(*svq) + ring_size);
+    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
+    svq->vdev = dev->vdev;
     svq->vq = vq;
 
     r = event_notifier_init(&svq->hdev_notifier, 0);
@@ -55,8 +177,9 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
     r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
     assert(r == 0);
 
-    vhost_virtqueue_mask(dev, dev->vdev, idx, true);
-    vhost_virtqueue_pending(dev, idx);
+    vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
+    for (i = 0; i < num - 1; i++)
+        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
 
     return svq;
 }
@@ -64,5 +187,6 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
 void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq)
 {
     event_notifier_cleanup(&vq->hdev_notifier);
+    g_free(vq->ring_id_maps);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 9352c56bfa..304e0baa61 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -956,8 +956,34 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
     uint16_t idx = virtio_get_queue_index(vq);
 
     VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
+    VirtQueueElement *elem;
 
-    vhost_vring_kick(svq);
+    /*
+     * Make available all buffers as possible.
+     */
+    do {
+        if (virtio_queue_get_notification(vq)) {
+            virtio_queue_set_notification(vq, false);
+        }
+
+        while (true) {
+            int r;
+            if (virtio_queue_full(vq)) {
+                break;
+            }
+
+            elem = virtqueue_pop(vq, sizeof(*elem));
+            if (!elem) {
+                break;
+            }
+
+            r = vhost_vring_add(svq, elem);
+            assert(r >= 0);
+            vhost_vring_kick(svq);
+        }
+
+        virtio_queue_set_notification(vq, true);
+    } while(!virtio_queue_empty(vq));
 }
 
 static void vhost_handle_call(EventNotifier *n)
@@ -975,6 +1001,11 @@ static void vhost_handle_call(EventNotifier *n)
     }
 }
 
+static void vhost_virtqueue_stop(struct vhost_dev *dev,
+                                 struct VirtIODevice *vdev,
+                                 struct vhost_virtqueue *vq,
+                                 unsigned idx);
+
 static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 {
     int idx;
@@ -991,17 +1022,41 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 
 static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 {
-    int idx;
+    int idx, r;
+
+    assert(dev->vhost_ops->vhost_set_vring_enable);
+    dev->vhost_ops->vhost_set_vring_enable(dev, false);
 
     for (idx = 0; idx < dev->nvqs; ++idx) {
         struct vhost_virtqueue *vq = &dev->vqs[idx];
+        struct vhost_vring_addr addr = {
+            .index = idx,
+        };
+        struct vhost_vring_state s = {
+            .index = idx,
+        };
+
+        vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], idx);
 
         dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
         event_notifier_set_handler(&vq->masked_notifier, vhost_handle_call);
+
+        vhost_vring_write_addr(dev->sw_lm_shadow_vq[idx], &addr);
+        r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
+        assert(r == 0);
+
+        r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
+        assert(r == 0);
     }
 
+    dev->vhost_ops->vhost_set_vring_enable(dev, true);
     vhost_dev_disable_notifiers(dev, dev->vdev);
 
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_virtqueue_mask(dev, dev->vdev, idx, true);
+        vhost_virtqueue_pending(dev, idx);
+    }
+
     return 0;
 }
 
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 14/27] virtio: Remove virtio_queue_get_used_notify_split
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 13/27] vhost: Send buffers to device Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 15/27] vhost: Do not invalidate signalled used Eugenio Pérez
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

No more uses beyond this point

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h |  1 -
 hw/virtio/virtio.c         | 14 --------------
 2 files changed, 15 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 0a7f5cc63e..79212141a6 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -225,7 +225,6 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id);
 
 void virtio_notify_config(VirtIODevice *vdev);
 
-bool virtio_queue_get_used_notify_split(VirtQueue *vq);
 bool virtio_queue_get_notification(VirtQueue *vq);
 void virtio_queue_set_notification(VirtQueue *vq, int enable);
 
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 77ca5f6b6f..ad9dc5dfa7 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -377,20 +377,6 @@ static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
     vq->used_idx = val;
 }
 
-bool virtio_queue_get_used_notify_split(VirtQueue *vq)
-{
-    VRingMemoryRegionCaches *caches;
-    hwaddr pa = offsetof(VRingUsed, flags);
-    uint16_t flags;
-
-    RCU_READ_LOCK_GUARD();
-
-    caches = vring_get_region_caches(vq);
-    assert(caches);
-    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
-    return !(VRING_USED_F_NO_NOTIFY & flags);
-}
-
 /* Called within rcu_read_lock().  */
 static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask)
 {
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 15/27] vhost: Do not invalidate signalled used
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (13 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 14/27] virtio: Remove virtio_queue_get_used_notify_split Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element Eugenio Pérez
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Since we are in control of guest' VQ again, we can trust on it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 304e0baa61..ac2bc14190 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -996,7 +996,6 @@ static void vhost_handle_call(EventNotifier *n)
     VirtQueue *vq = virtio_get_queue(vdev->vdev, idx);
 
     if (event_notifier_test_and_clear(n)) {
-        virtio_queue_invalidate_signalled_used(vdev->vdev, idx);
         virtio_notify_irqfd(vdev->vdev, vq);
     }
 }
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (14 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 15/27] vhost: Do not invalidate signalled used Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  8:25   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 17/27] vhost: add vhost_vring_set_notification_rcu Eugenio Pérez
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Specify VirtQueueElement * as return type makes no harm at this moment.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h | 2 ++
 hw/virtio/virtio.c         | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 79212141a6..ee8fe96f32 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -196,6 +196,8 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len, unsigned int idx);
 
 void virtqueue_map(VirtIODevice *vdev, VirtQueueElement *elem);
+VirtQueueElement *virtqueue_alloc_element(size_t sz, unsigned out_num,
+                                          unsigned in_num);
 void *virtqueue_pop(VirtQueue *vq, size_t sz);
 unsigned int virtqueue_drop_all(VirtQueue *vq);
 void *qemu_get_virtqueue_element(VirtIODevice *vdev, QEMUFile *f, size_t sz);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ad9dc5dfa7..a89525f067 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1400,7 +1400,8 @@ void virtqueue_map(VirtIODevice *vdev, VirtQueueElement *elem)
                                                                         false);
 }
 
-static void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num)
+VirtQueueElement *virtqueue_alloc_element(size_t sz, unsigned out_num,
+                                          unsigned in_num)
 {
     VirtQueueElement *elem;
     size_t in_addr_ofs = QEMU_ALIGN_UP(sz, __alignof__(elem->in_addr[0]));
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 17/27] vhost: add vhost_vring_set_notification_rcu
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (15 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu Eugenio Pérez
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |  4 ++++
 hw/virtio/vhost-sw-lm-ring.c | 23 +++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
index 29d21feaf4..c537500d9e 100644
--- a/hw/virtio/vhost-sw-lm-ring.h
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -19,6 +19,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 bool vhost_vring_kick(VhostShadowVirtqueue *vq);
 int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem);
+
+/* Called within rcu_read_lock().  */
+void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable);
+
 void vhost_vring_write_addr(const VhostShadowVirtqueue *vq,
 	                    struct vhost_vring_addr *addr);
 
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index aed005c2d9..c3244c550e 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -34,6 +34,9 @@ typedef struct VhostShadowVirtqueue {
     /* Next free descriptor */
     uint16_t free_head;
 
+    /* Cache for exposed notification flag */
+    bool notification;
+
     vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
@@ -60,6 +63,26 @@ bool vhost_vring_kick(VhostShadowVirtqueue *vq)
                                        : true;
 }
 
+void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable)
+{
+    uint16_t notification_flag;
+
+    if (vq->notification == enable) {
+        return;
+    }
+
+    notification_flag = virtio_tswap16(vq->vdev, VRING_AVAIL_F_NO_INTERRUPT);
+
+    vq->notification = enable;
+    if (enable) {
+        vq->vring.avail->flags &= ~notification_flag;
+    } else {
+        vq->vring.avail->flags |= notification_flag;
+    }
+
+    smp_mb();
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *vq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (16 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 17/27] vhost: add vhost_vring_set_notification_rcu Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  8:41   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 19/27] vhost: add vhost_vring_get_buf_rcu Eugenio Pérez
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |  2 ++
 hw/virtio/vhost-sw-lm-ring.c | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
index c537500d9e..03257d60c1 100644
--- a/hw/virtio/vhost-sw-lm-ring.h
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -22,6 +22,8 @@ int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem);
 
 /* Called within rcu_read_lock().  */
 void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable);
+/* Called within rcu_read_lock().  */
+bool vhost_vring_poll_rcu(VhostShadowVirtqueue *vq);
 
 void vhost_vring_write_addr(const VhostShadowVirtqueue *vq,
 	                    struct vhost_vring_addr *addr);
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index c3244c550e..3652889d8e 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -34,6 +34,12 @@ typedef struct VhostShadowVirtqueue {
     /* Next free descriptor */
     uint16_t free_head;
 
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from device */
+    uint16_t used_idx;
+
     /* Cache for exposed notification flag */
     bool notification;
 
@@ -83,6 +89,18 @@ void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable)
     smp_mb();
 }
 
+bool vhost_vring_poll_rcu(VhostShadowVirtqueue *vq)
+{
+    if (vq->used_idx != vq->shadow_used_idx) {
+        return true;
+    }
+
+    smp_rmb();
+    vq->shadow_used_idx = virtio_tswap16(vq->vdev, vq->vring.used->idx);
+
+    return vq->used_idx != vq->shadow_used_idx;
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *vq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 19/27] vhost: add vhost_vring_get_buf_rcu
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (17 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:50 ` [RFC PATCH 20/27] vhost: Return used buffers Eugenio Pérez
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |  1 +
 hw/virtio/vhost-sw-lm-ring.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
index 03257d60c1..429a125558 100644
--- a/hw/virtio/vhost-sw-lm-ring.h
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -19,6 +19,7 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 bool vhost_vring_kick(VhostShadowVirtqueue *vq);
 int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem);
+VirtQueueElement *vhost_vring_get_buf_rcu(VhostShadowVirtqueue *vq, size_t sz);
 
 /* Called within rcu_read_lock().  */
 void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable);
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index 3652889d8e..4fafd1b278 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -187,6 +187,35 @@ int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem)
     return 0;
 }
 
+VirtQueueElement *vhost_vring_get_buf_rcu(VhostShadowVirtqueue *vq, size_t sz)
+{
+    const vring_used_t *used = vq->vring.used;
+    VirtQueueElement *ret;
+    vring_used_elem_t used_elem;
+    uint16_t last_used;
+
+    if (!vhost_vring_poll_rcu(vq)) {
+        return NULL;
+    }
+
+    last_used = vq->used_idx & (vq->vring.num - 1);
+    used_elem.id = virtio_tswap32(vq->vdev, used->ring[last_used].id);
+    used_elem.len = virtio_tswap32(vq->vdev, used->ring[last_used].len);
+
+    if (unlikely(used_elem.id >= vq->vring.num)) {
+        virtio_error(vq->vdev, "Device says index %u is available",
+                     used_elem.id);
+        return NULL;
+    }
+
+    ret = vq->ring_id_maps[used_elem.id];
+    ret->len = used_elem.len;
+
+    vq->used_idx++;
+
+    return ret;
+}
+
 void vhost_vring_write_addr(const VhostShadowVirtqueue *vq,
                             struct vhost_vring_addr *addr)
 {
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 20/27] vhost: Return used buffers
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (18 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 19/27] vhost: add vhost_vring_get_buf_rcu Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-12-08  8:50   ` Stefan Hajnoczi
  2020-11-20 18:50 ` [RFC PATCH 21/27] vhost: Add vhost_virtqueue_memory_unmap Eugenio Pérez
                   ` (11 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-sw-lm-ring.h |  3 +++
 hw/virtio/vhost-sw-lm-ring.c | 14 +++++++----
 hw/virtio/vhost.c            | 46 +++++++++++++++++++++++++++++++++---
 3 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
index 429a125558..0c4fa772c7 100644
--- a/hw/virtio/vhost-sw-lm-ring.h
+++ b/hw/virtio/vhost-sw-lm-ring.h
@@ -17,6 +17,9 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+VirtIODevice *vhost_vring_vdev(VhostShadowVirtqueue *svq);
+VirtQueue *vhost_vring_vdev_vq(VhostShadowVirtqueue *svq);
+
 bool vhost_vring_kick(VhostShadowVirtqueue *vq);
 int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem);
 VirtQueueElement *vhost_vring_get_buf_rcu(VhostShadowVirtqueue *vq, size_t sz);
diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
index 4fafd1b278..244c722910 100644
--- a/hw/virtio/vhost-sw-lm-ring.c
+++ b/hw/virtio/vhost-sw-lm-ring.c
@@ -46,6 +46,16 @@ typedef struct VhostShadowVirtqueue {
     vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
+VirtIODevice *vhost_vring_vdev(VhostShadowVirtqueue *svq)
+{
+    return svq->vdev;
+}
+
+VirtQueue *vhost_vring_vdev_vq(VhostShadowVirtqueue *svq)
+{
+    return svq->vq;
+}
+
 static bool vhost_vring_should_kick_rcu(VhostShadowVirtqueue *vq)
 {
     VirtIODevice *vdev = vq->vdev;
@@ -179,10 +189,6 @@ static int vhost_vring_add_split(VhostShadowVirtqueue *vq,
 int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem)
 {
     int host_head = vhost_vring_add_split(vq, elem);
-    if (vq->ring_id_maps[host_head]) {
-        g_free(vq->ring_id_maps[host_head]);
-    }
-
     vq->ring_id_maps[host_head] = elem;
     return 0;
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ac2bc14190..9a3c580dcf 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -986,17 +986,50 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
     } while(!virtio_queue_empty(vq));
 }
 
+static void handle_sw_lm_vq_call(struct vhost_dev *hdev,
+                                 VhostShadowVirtqueue *svq)
+{
+    VirtQueueElement *elem;
+    VirtIODevice *vdev = vhost_vring_vdev(svq);
+    VirtQueue *vq = vhost_vring_vdev_vq(svq);
+    uint16_t idx = virtio_get_queue_index(vq);
+
+    RCU_READ_LOCK_GUARD();
+    /*
+     * Make used all buffers as possible.
+     */
+    do {
+        unsigned i = 0;
+
+        vhost_vring_set_notification_rcu(svq, false);
+        while (true) {
+            elem = vhost_vring_get_buf_rcu(svq, sizeof(*elem));
+            if (!elem) {
+                break;
+            }
+
+            assert(i < virtio_queue_get_num(vdev, idx));
+            virtqueue_fill(vq, elem, elem->len, i++);
+        }
+
+        virtqueue_flush(vq, i);
+        virtio_notify_irqfd(vdev, vq);
+
+        vhost_vring_set_notification_rcu(svq, true);
+    } while (vhost_vring_poll_rcu(svq));
+}
+
 static void vhost_handle_call(EventNotifier *n)
 {
     struct vhost_virtqueue *hvq = container_of(n,
                                               struct vhost_virtqueue,
                                               masked_notifier);
     struct vhost_dev *vdev = hvq->dev;
-    int idx = vdev->vq_index + (hvq == &vdev->vqs[0] ? 0 : 1);
-    VirtQueue *vq = virtio_get_queue(vdev->vdev, idx);
+    int idx = hvq == &vdev->vqs[0] ? 0 : 1;
+    VhostShadowVirtqueue *vq = vdev->sw_lm_shadow_vq[idx];
 
     if (event_notifier_test_and_clear(n)) {
-        virtio_notify_irqfd(vdev->vdev, vq);
+        handle_sw_lm_vq_call(vdev, vq);
     }
 }
 
@@ -1028,6 +1061,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 
     for (idx = 0; idx < dev->nvqs; ++idx) {
         struct vhost_virtqueue *vq = &dev->vqs[idx];
+        unsigned num = virtio_queue_get_num(dev->vdev, idx);
         struct vhost_vring_addr addr = {
             .index = idx,
         };
@@ -1044,6 +1078,12 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
         r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
         assert(r == 0);
 
+        r = vhost_backend_update_device_iotlb(dev, addr.used_user_addr,
+                                              addr.used_user_addr,
+                                              sizeof(vring_used_elem_t) * num,
+                                              IOMMU_RW);
+        assert(r == 0);
+
         r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
         assert(r == 0);
     }
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 21/27] vhost: Add vhost_virtqueue_memory_unmap
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (19 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 20/27] vhost: Return used buffers Eugenio Pérez
@ 2020-11-20 18:50 ` Eugenio Pérez
  2020-11-20 18:51 ` [RFC PATCH 22/27] vhost: Add vhost_virtqueue_memory_map Eugenio Pérez
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This is not a huge gain but helps in later changes.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 9a3c580dcf..eafbbaa751 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -812,6 +812,21 @@ static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
     return 0;
 }
 
+static void vhost_virtqueue_memory_unmap(struct vhost_dev *dev,
+                                        struct vhost_virtqueue *vq,
+                                        bool used_is_dirty)
+{
+    if (vq->used) {
+        vhost_memory_unmap(dev, vq->used, vq->used_size, used_is_dirty, 0);
+    }
+    if (vq->avail) {
+        vhost_memory_unmap(dev, vq->avail, vq->avail_size, 0, 0);
+    }
+    if (vq->desc) {
+        vhost_memory_unmap(dev, vq->desc, vq->desc_size, 0, 0);
+    }
+}
+
 static int vhost_dev_set_features(struct vhost_dev *dev,
                                   bool enable_log)
 {
@@ -1301,21 +1316,21 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
     vq->desc = vhost_memory_map(dev, a, &l, false);
     if (!vq->desc || l != s) {
         r = -ENOMEM;
-        goto fail_alloc_desc;
+        goto fail_alloc;
     }
     vq->avail_size = s = l = virtio_queue_get_avail_size(vdev, idx);
     vq->avail_phys = a = virtio_queue_get_avail_addr(vdev, idx);
     vq->avail = vhost_memory_map(dev, a, &l, false);
     if (!vq->avail || l != s) {
         r = -ENOMEM;
-        goto fail_alloc_avail;
+        goto fail_alloc;
     }
     vq->used_size = s = l = virtio_queue_get_used_size(vdev, idx);
     vq->used_phys = a = virtio_queue_get_used_addr(vdev, idx);
     vq->used = vhost_memory_map(dev, a, &l, true);
     if (!vq->used || l != s) {
         r = -ENOMEM;
-        goto fail_alloc_used;
+        goto fail_alloc;
     }
 
     r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
@@ -1358,15 +1373,7 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
 fail_vector:
 fail_kick:
 fail_alloc:
-    vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
-                       0, 0);
-fail_alloc_used:
-    vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, idx),
-                       0, 0);
-fail_alloc_avail:
-    vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, idx),
-                       0, 0);
-fail_alloc_desc:
+    vhost_virtqueue_memory_unmap(dev, vq, false);
     return r;
 }
 
@@ -1408,12 +1415,7 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
                                                 vhost_vq_index);
     }
 
-    vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
-                       1, virtio_queue_get_used_size(vdev, idx));
-    vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, idx),
-                       0, virtio_queue_get_avail_size(vdev, idx));
-    vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, idx),
-                       0, virtio_queue_get_desc_size(vdev, idx));
+    vhost_virtqueue_memory_unmap(dev, vq, true);
 }
 
 static void vhost_eventfd_add(MemoryListener *listener,
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 22/27] vhost: Add vhost_virtqueue_memory_map
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (20 preceding siblings ...)
  2020-11-20 18:50 ` [RFC PATCH 21/27] vhost: Add vhost_virtqueue_memory_unmap Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-11-20 18:51 ` [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration Eugenio Pérez
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

To make it symmetric with _unmap.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 68 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 44 insertions(+), 24 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index eafbbaa751..f640d4edf0 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -827,6 +827,42 @@ static void vhost_virtqueue_memory_unmap(struct vhost_dev *dev,
     }
 }
 
+static int vhost_virtqueue_memory_map(struct vhost_dev *dev,
+                                      struct vhost_virtqueue *vq,
+                                      unsigned int vhost_vq_index)
+{
+    int r;
+    hwaddr a, l, s;
+
+    a = vq->desc_phys;
+    s = l = vq->desc_size;
+    vq->desc = vhost_memory_map(dev, a, &l, false);
+    if (!vq->desc || l != s) {
+        return -ENOMEM;
+    }
+
+    a = vq->avail_phys;
+    s = l = vq->avail_size;
+    vq->avail = vhost_memory_map(dev, a, &l, false);
+    if (!vq->avail || l != s) {
+        return -ENOMEM;
+    }
+
+    a = vq->used_phys;
+    s = l = vq->used_size;
+    vq->used = vhost_memory_map(dev, a, &l, true);
+    if (!vq->used || l != s) {
+        return -ENOMEM;
+    }
+
+    r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
+    if (r < 0) {
+        return -errno;
+    }
+
+    return 0;
+}
+
 static int vhost_dev_set_features(struct vhost_dev *dev,
                                   bool enable_log)
 {
@@ -1271,7 +1307,7 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
     VirtioBusState *vbus = VIRTIO_BUS(qbus);
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
-    hwaddr s, l, a;
+    hwaddr a;
     int r;
     int vhost_vq_index = dev->vhost_ops->vhost_get_vq_index(dev, idx);
     struct vhost_vring_file file = {
@@ -1311,31 +1347,15 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
         }
     }
 
-    vq->desc_size = s = l = virtio_queue_get_desc_size(vdev, idx);
+    vq->desc_size = virtio_queue_get_desc_size(vdev, idx);
     vq->desc_phys = a;
-    vq->desc = vhost_memory_map(dev, a, &l, false);
-    if (!vq->desc || l != s) {
-        r = -ENOMEM;
-        goto fail_alloc;
-    }
-    vq->avail_size = s = l = virtio_queue_get_avail_size(vdev, idx);
-    vq->avail_phys = a = virtio_queue_get_avail_addr(vdev, idx);
-    vq->avail = vhost_memory_map(dev, a, &l, false);
-    if (!vq->avail || l != s) {
-        r = -ENOMEM;
-        goto fail_alloc;
-    }
-    vq->used_size = s = l = virtio_queue_get_used_size(vdev, idx);
-    vq->used_phys = a = virtio_queue_get_used_addr(vdev, idx);
-    vq->used = vhost_memory_map(dev, a, &l, true);
-    if (!vq->used || l != s) {
-        r = -ENOMEM;
-        goto fail_alloc;
-    }
+    vq->avail_size = virtio_queue_get_avail_size(vdev, idx);
+    vq->avail_phys = virtio_queue_get_avail_addr(vdev, idx);
+    vq->used_size = virtio_queue_get_used_size(vdev, idx);
+    vq->used_phys = virtio_queue_get_used_addr(vdev, idx);
 
-    r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
-    if (r < 0) {
-        r = -errno;
+    r = vhost_virtqueue_memory_map(dev, vq, vhost_vq_index);
+    if (r) {
         goto fail_alloc;
     }
 
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (21 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 22/27] vhost: Add vhost_virtqueue_memory_map Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-11-27 15:29   ` Stefano Garzarella
  2020-11-20 18:51 ` [RFC PATCH 24/27] vhost: iommu changes Eugenio Pérez
                   ` (8 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Since vhost does not need to access it, it has no sense to keep it
mapped.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index f640d4edf0..eebfac4455 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1124,6 +1124,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 
         dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
         event_notifier_set_handler(&vq->masked_notifier, vhost_handle_call);
+        vhost_virtqueue_memory_unmap(dev, &dev->vqs[idx], true);
 
         vhost_vring_write_addr(dev->sw_lm_shadow_vq[idx], &addr);
         r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 24/27] vhost: iommu changes
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (22 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-12-08  9:02   ` Stefan Hajnoczi
  2020-11-20 18:51 ` [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop Eugenio Pérez
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

Since vhost is now asking for qemu's VA, iommu needs to be bypassed.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index eebfac4455..cb44b9997f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1109,6 +1109,10 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 
     assert(dev->vhost_ops->vhost_set_vring_enable);
     dev->vhost_ops->vhost_set_vring_enable(dev, false);
+    if (vhost_dev_has_iommu(dev)) {
+        r = vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL);
+        assert(r == 0);
+    }
 
     for (idx = 0; idx < dev->nvqs; ++idx) {
         struct vhost_virtqueue *vq = &dev->vqs[idx];
@@ -1269,6 +1273,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 
     trace_vhost_iotlb_miss(dev, 1);
 
+    if (dev->sw_lm_enabled) {
+        uaddr = iova;
+        len = 4096;
+        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
+                                                IOMMU_RW);
+        if (ret) {
+            trace_vhost_iotlb_miss(dev, 2);
+            error_report("Fail to update device iotlb");
+        }
+
+        return ret;
+    }
+
     iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
                                           iova, write,
                                           MEMTXATTRS_UNSPECIFIED);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (23 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 24/27] vhost: iommu changes Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-11-20 19:35   ` Eugenio Perez Martin
  2020-11-20 18:51 ` [RFC PATCH 26/27] vhost: Add vhost_hdev_can_sw_lm Eugenio Pérez
                   ` (6 subsequent siblings)
  31 siblings, 1 reply; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

... if sw lm is enabled

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index cb44b9997f..cf000b979f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1424,17 +1424,22 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
     struct vhost_vring_state state = {
         .index = vhost_vq_index,
     };
-    int r;
+    int r = -1;
 
     if (virtio_queue_get_desc_addr(vdev, idx) == 0) {
         /* Don't stop the virtqueue which might have not been started */
         return;
     }
 
-    r = dev->vhost_ops->vhost_get_vring_base(dev, &state);
-    if (r < 0) {
-        VHOST_OPS_DEBUG("vhost VQ %u ring restore failed: %d", idx, r);
-        /* Connection to the backend is broken, so let's sync internal
+    if (!dev->sw_lm_enabled) {
+        r = dev->vhost_ops->vhost_get_vring_base(dev, &state);
+        if (r < 0) {
+            VHOST_OPS_DEBUG("vhost VQ %u ring restore failed: %d", idx, r);
+        }
+    }
+
+    if (!dev->sw_lm_enabled || r < 0) {
+        /* Connection to the backend is unusable, so let's sync internal
          * last avail idx to the device used idx.
          */
         virtio_queue_restore_last_avail_idx(vdev, idx);
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 26/27] vhost: Add vhost_hdev_can_sw_lm
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (24 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-11-20 18:51 ` [RFC PATCH 27/27] vhost: forbid vhost devices logging Eugenio Pérez
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This allows a device to migrate if it meet a few requirements.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index cf000b979f..44a51ccf5e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1529,6 +1529,37 @@ static void vhost_virtqueue_cleanup(struct vhost_virtqueue *vq)
     event_notifier_cleanup(&vq->masked_notifier);
 }
 
+static bool vhost_hdev_can_sw_lm(struct vhost_dev *hdev)
+{
+    const char *cause = NULL;
+
+    if (hdev->features & (VIRTIO_F_IOMMU_PLATFORM)) {
+        cause = "have iommu";
+    } else if (hdev->features & VIRTIO_F_RING_PACKED) {
+        cause = "is packed";
+    } else if (hdev->features & VIRTIO_RING_F_EVENT_IDX) {
+        cause = "Have event idx";
+    } else if (hdev->features & VIRTIO_RING_F_INDIRECT_DESC) {
+        cause = "Supports indirect descriptors";
+    } else if (hdev->nvqs != 2) {
+        cause = "!= 2 #vq supported";
+    } else if (!hdev->vhost_ops->vhost_net_set_backend) {
+        cause = "cannot pause device";
+    }
+
+    if (cause) {
+        if (!hdev->migration_blocker) {
+            error_setg(&hdev->migration_blocker,
+                "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature and %s.",
+                cause);
+        }
+
+        return false;
+    }
+
+    return true;
+}
+
 int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
                    VhostBackendType backend_type, uint32_t busyloop_timeout)
 {
@@ -1604,7 +1635,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     };
 
     if (hdev->migration_blocker == NULL) {
-        if (!vhost_dev_can_log(hdev)) {
+        if (!vhost_dev_can_log(hdev) && !vhost_hdev_can_sw_lm(hdev)
+            && hdev->migration_blocker == NULL) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature.");
         } else if (vhost_dev_log_is_shared(hdev) && !qemu_memfd_alloc_check()) {
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC PATCH 27/27] vhost: forbid vhost devices logging
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (25 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 26/27] vhost: Add vhost_hdev_can_sw_lm Eugenio Pérez
@ 2020-11-20 18:51 ` Eugenio Pérez
  2020-11-20 19:03 ` [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Perez Martin
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Pérez @ 2020-11-20 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Michael S. Tsirkin, Jason Wang, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard, Lars Ganrot,
	Rob Miller, Stefano Garzarella, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

This is NOT a commit intended for merge, but for test the patchset.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 44a51ccf5e..069e5c915d 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -79,7 +79,7 @@ static struct vhost_dev *vhost_dev_from_virtio(const VirtIODevice *vdev)
 
 static bool vhost_dev_can_log(const struct vhost_dev *hdev)
 {
-    return hdev->features & (0x1ULL << VHOST_F_LOG_ALL);
+    return false;
 }
 
 static void vhost_dev_sync_region(struct vhost_dev *dev,
-- 
2.18.4



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (26 preceding siblings ...)
  2020-11-20 18:51 ` [RFC PATCH 27/27] vhost: forbid vhost devices logging Eugenio Pérez
@ 2020-11-20 19:03 ` Eugenio Perez Martin
  2020-11-20 19:30 ` no-reply
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-11-20 19:03 UTC (permalink / raw)
  To: qemu-level
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Rob Miller, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Alex Barba, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Jim Harford, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Stephen Finucane, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

The main intention with this POC/RFC is to serve as a base to
implement SW LM in vdpa devices.

To implement in vhost-vdpa devices, the high priority is to achieve an
interface for vdpa to stop the device without losing its state, i.e.,
the avail_idx the destination device should start fetching descriptors
from. Following this POC, an implementation on vdpa_sim will be
performed.

Apart from that, there is a TODO list about this series, they will be
solved as the code is marked as valid. They don't affect the device,
just internal qemu's code, and in case of change of direction it is
easy to modify or delete. Comments about these are welcome.

- Currently, it hijacks the log mechanism to know when migration is
starting/done. Maybe it would be cleaner to forward migrate status
from virtio_vmstate_change, since there is no need for the memory
listener. However, this could make "memory backend" abstraction (also
TODO) more complicated. This would drop patches 2,3 entirely.
- There is a reverse search in a list on "vhost_dev_from_virtio" for
each notification. Not really efficient, and it leads to a race
condition at device destruction.
- Implement new capabilities (no iommu, packed vq, event_idx, ...)
- Lot of assertions need to be converted to proper error handling.

Thanks!

On Fri, Nov 20, 2020 at 8:02 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net devices, which already
> have dirty page logging capabilities. None of the changes have actual
> effect with current devices until last two patches (26 and 27) are
> applied, but they can be rebased on top of any other. These checks the
> device to meet all requirements, and disable vhost-net devices logging
> so migration goes through SW LM. This last patch is not meant to be
> applied in the final revision, it is in the series just for testing
> purposes.
>
> For use SW assisted LM these vhost-net devices need to be instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 9, and starting migration. Partial applies
> between 13 and 24 will not work while migrating on source, and patch
> 25 is needed for the destination to resume network activity.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
> Comments are welcome.
>
> Thanks!
>
> Eugenio Pérez (27):
>   vhost: Add vhost_dev_can_log
>   vhost: Add device callback in vhost_migration_log
>   vhost: Move log resize/put to vhost_dev_set_log
>   vhost: add vhost_kernel_set_vring_enable
>   vhost: Add hdev->dev.sw_lm_vq_handler
>   virtio: Add virtio_queue_get_used_notify_split
>   vhost: Route guest->host notification through qemu
>   vhost: Add a flag for software assisted Live Migration
>   vhost: Route host->guest notification through qemu
>   vhost: Allocate shadow vring
>   virtio: const-ify all virtio_tswap* functions
>   virtio: Add virtio_queue_full
>   vhost: Send buffers to device
>   virtio: Remove virtio_queue_get_used_notify_split
>   vhost: Do not invalidate signalled used
>   virtio: Expose virtqueue_alloc_element
>   vhost: add vhost_vring_set_notification_rcu
>   vhost: add vhost_vring_poll_rcu
>   vhost: add vhost_vring_get_buf_rcu
>   vhost: Return used buffers
>   vhost: Add vhost_virtqueue_memory_unmap
>   vhost: Add vhost_virtqueue_memory_map
>   vhost: unmap qemu's shadow virtqueues on sw live migration
>   vhost: iommu changes
>   vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>   vhost: Add vhost_hdev_can_sw_lm
>   vhost: forbid vhost devices logging
>
>  hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>  include/hw/virtio/vhost.h         |   5 +
>  include/hw/virtio/virtio-access.h |   8 +-
>  include/hw/virtio/virtio.h        |   4 +
>  hw/net/virtio-net.c               |  39 ++-
>  hw/virtio/vhost-backend.c         |  29 ++
>  hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>  hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>  hw/virtio/virtio.c                |  18 +-
>  hw/virtio/meson.build             |   2 +-
>  10 files changed, 758 insertions(+), 85 deletions(-)
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>
> --
> 2.18.4
>
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (27 preceding siblings ...)
  2020-11-20 19:03 ` [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Perez Martin
@ 2020-11-20 19:30 ` no-reply
  2020-11-25  7:08 ` Jason Wang
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 81+ messages in thread
From: no-reply @ 2020-11-20 19:30 UTC (permalink / raw)
  To: eperezma
  Cc: kvm, mst, jasowang, qemu-devel, dandaly0, virtualization,
	liralon, eli, nitin.shrivastav, rob.miller, cfontain, quintela,
	ballle98, lars.ganrot, alex.barba, sgarzare, howard.cai, parav,
	vmireyno, mehta.salil.lnk, jim.harford, xiao.w.wang, smooney,
	stefanha, stephenfin, dmytro.kazantsev, loseweigh, hanand, ml,
	maxgu14

Patchew URL: https://patchew.org/QEMU/20201120185105.279030-1-eperezma@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201120185105.279030-1-eperezma@redhat.com
Subject: [RFC PATCH 00/27] vDPA software assisted live migration

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20201117173635.29101-1-alex.bennee@linaro.org -> patchew/20201117173635.29101-1-alex.bennee@linaro.org
 * [new tag]         patchew/20201120185105.279030-1-eperezma@redhat.com -> patchew/20201120185105.279030-1-eperezma@redhat.com
Switched to a new branch 'test'
af2fe22 vhost: forbid vhost devices logging
405925c vhost: Add vhost_hdev_can_sw_lm
7f2955b vhost: Do not commit vhost used idx on vhost_virtqueue_stop
b68d3d5 vhost: iommu changes
74f282a vhost: unmap qemu's shadow virtqueues on sw live migration
c999e86 vhost: Add vhost_virtqueue_memory_map
6e5f219 vhost: Add vhost_virtqueue_memory_unmap
d5054c8 vhost: Return used buffers
806db46 vhost: add vhost_vring_get_buf_rcu
b6b8168 vhost: add vhost_vring_poll_rcu
ca44882 vhost: add vhost_vring_set_notification_rcu
3183f62 virtio: Expose virtqueue_alloc_element
4ead0ac vhost: Do not invalidate signalled used
ceb76a4 virtio: Remove virtio_queue_get_used_notify_split
6aacdfe vhost: Send buffers to device
9fcc98d virtio: Add virtio_queue_full
41da0f8 virtio: const-ify all virtio_tswap* functions
a3c92f1 vhost: Allocate shadow vring
b40b3f7 vhost: Route host->guest notification through qemu
92ea117 vhost: Add a flag for software assisted Live Migration
ace8c10 vhost: Route guest->host notification through qemu
6be11a6 virtio: Add virtio_queue_get_used_notify_split
47d4485 vhost: Add hdev->dev.sw_lm_vq_handler
7dbc8e5 vhost: add vhost_kernel_set_vring_enable
4e51c5c vhost: Move log resize/put to vhost_dev_set_log
b964abc vhost: Add device callback in vhost_migration_log
264504e vhost: Add vhost_dev_can_log

=== OUTPUT BEGIN ===
1/27 Checking commit 264504ee0018 (vhost: Add vhost_dev_can_log)
2/27 Checking commit b964abc315bd (vhost: Add device callback in vhost_migration_log)
3/27 Checking commit 4e51c5cec56d (vhost: Move log resize/put to vhost_dev_set_log)
4/27 Checking commit 7dbc8e5c64c3 (vhost: add vhost_kernel_set_vring_enable)
5/27 Checking commit 47d4485458d1 (vhost: Add hdev->dev.sw_lm_vq_handler)
6/27 Checking commit 6be11a63e3b4 (virtio: Add virtio_queue_get_used_notify_split)
7/27 Checking commit ace8c1034f83 (vhost: Route guest->host notification through qemu)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#24: 
new file mode 100644

total: 0 errors, 1 warnings, 245 lines checked

Patch 7/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/27 Checking commit 92ea11700e32 (vhost: Add a flag for software assisted Live Migration)
WARNING: Block comments use a leading /* on a separate line
#46: FILE: hw/virtio/vhost.c:1581:
+        /* We've been called after migration is completed, so no need to

WARNING: Block comments use * on subsequent lines
#47: FILE: hw/virtio/vhost.c:1582:
+        /* We've been called after migration is completed, so no need to
+           disable it again

total: 0 errors, 2 warnings, 45 lines checked

Patch 8/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
9/27 Checking commit b40b3f79355c (vhost: Route host->guest notification through qemu)
10/27 Checking commit a3c92f15b554 (vhost: Allocate shadow vring)
11/27 Checking commit 41da0f8e02b2 (virtio: const-ify all virtio_tswap* functions)
12/27 Checking commit 9fcc98da9cd9 (virtio: Add virtio_queue_full)
13/27 Checking commit 6aacdfe0e0ba (vhost: Send buffers to device)
ERROR: memory barrier without comment
#50: FILE: hw/virtio/vhost-sw-lm-ring.c:45:
+    smp_rmb();

WARNING: Block comments use a leading /* on a separate line
#98: FILE: hw/virtio/vhost-sw-lm-ring.c:93:
+/* virtqueue_add:

WARNING: Block comments use a leading /* on a separate line
#125: FILE: hw/virtio/vhost-sw-lm-ring.c:120:
+    /* Put entry in available array (but don't update avail->idx until they

WARNING: Block comments use a trailing */ on a separate line
#126: FILE: hw/virtio/vhost-sw-lm-ring.c:121:
+     * do sync). */

ERROR: g_free(NULL) is safe this check is probably not required
#147: FILE: hw/virtio/vhost-sw-lm-ring.c:142:
+    if (vq->ring_id_maps[host_head]) {
+        g_free(vq->ring_id_maps[host_head]);

ERROR: braces {} are necessary for all arms of this statement
#185: FILE: hw/virtio/vhost-sw-lm-ring.c:181:
+    for (i = 0; i < num - 1; i++)
[...]

ERROR: code indent should never use tabs
#207: FILE: hw/virtio/vhost-sw-lm-ring.h:23:
+^I                    struct vhost_vring_addr *addr);$

ERROR: space required before the open parenthesis '('
#247: FILE: hw/virtio/vhost.c:986:
+    } while(!virtio_queue_empty(vq));

total: 5 errors, 3 warnings, 275 lines checked

Patch 13/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

14/27 Checking commit ceb76a4401b8 (virtio: Remove virtio_queue_get_used_notify_split)
15/27 Checking commit 4ead0ac8457f (vhost: Do not invalidate signalled used)
16/27 Checking commit 3183f62db3dc (virtio: Expose virtqueue_alloc_element)
17/27 Checking commit ca44882af152 (vhost: add vhost_vring_set_notification_rcu)
ERROR: memory barrier without comment
#45: FILE: hw/virtio/vhost-sw-lm-ring.c:83:
+    smp_mb();

total: 1 errors, 0 warnings, 45 lines checked

Patch 17/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

18/27 Checking commit b6b8168b9fe7 (vhost: add vhost_vring_poll_rcu)
ERROR: memory barrier without comment
#37: FILE: hw/virtio/vhost-sw-lm-ring.c:98:
+    smp_rmb();

total: 1 errors, 0 warnings, 38 lines checked

Patch 18/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

19/27 Checking commit 806db46e9194 (vhost: add vhost_vring_get_buf_rcu)
20/27 Checking commit d5054c8556bf (vhost: Return used buffers)
21/27 Checking commit 6e5f2192254a (vhost: Add vhost_virtqueue_memory_unmap)
22/27 Checking commit c999e86cf7f0 (vhost: Add vhost_virtqueue_memory_map)
23/27 Checking commit 74f282a80019 (vhost: unmap qemu's shadow virtqueues on sw live migration)
24/27 Checking commit b68d3d5fb839 (vhost: iommu changes)
25/27 Checking commit 7f2955b8e788 (vhost: Do not commit vhost used idx on vhost_virtqueue_stop)
WARNING: Block comments use a leading /* on a separate line
#40: FILE: hw/virtio/vhost.c:1442:
+        /* Connection to the backend is unusable, so let's sync internal

total: 0 errors, 1 warnings, 27 lines checked

Patch 25/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
26/27 Checking commit 405925c6b54a (vhost: Add vhost_hdev_can_sw_lm)
27/27 Checking commit af2fe2219d39 (vhost: forbid vhost devices logging)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201120185105.279030-1-eperezma@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop
  2020-11-20 18:51 ` [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop Eugenio Pérez
@ 2020-11-20 19:35   ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-11-20 19:35 UTC (permalink / raw)
  To: qemu-level
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Fri, Nov 20, 2020 at 8:12 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> ... if sw lm is enabled
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index cb44b9997f..cf000b979f 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1424,17 +1424,22 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
>      struct vhost_vring_state state = {
>          .index = vhost_vq_index,
>      };
> -    int r;
> +    int r = -1;
>
>      if (virtio_queue_get_desc_addr(vdev, idx) == 0) {
>          /* Don't stop the virtqueue which might have not been started */
>          return;
>      }
>
> -    r = dev->vhost_ops->vhost_get_vring_base(dev, &state);
> -    if (r < 0) {
> -        VHOST_OPS_DEBUG("vhost VQ %u ring restore failed: %d", idx, r);
> -        /* Connection to the backend is broken, so let's sync internal
> +    if (!dev->sw_lm_enabled) {
> +        r = dev->vhost_ops->vhost_get_vring_base(dev, &state);
> +        if (r < 0) {
> +            VHOST_OPS_DEBUG("vhost VQ %u ring restore failed: %d", idx, r);
> +        }
> +    }
> +
> +    if (!dev->sw_lm_enabled || r < 0) {

This test should actually test for `dev->sw_lm_enabled`, not for negation.

> +        /* Connection to the backend is unusable, so let's sync internal
>           * last avail idx to the device used idx.
>           */
>          virtio_queue_restore_last_avail_idx(vdev, idx);
> --
> 2.18.4
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (28 preceding siblings ...)
  2020-11-20 19:30 ` no-reply
@ 2020-11-25  7:08 ` Jason Wang
  2020-11-25 12:03   ` Eugenio Perez Martin
  2020-11-27 15:44 ` Stefano Garzarella
  2020-12-08  9:37 ` Stefan Hajnoczi
  31 siblings, 1 reply; 81+ messages in thread
From: Jason Wang @ 2020-11-25  7:08 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: kvm, Michael S. Tsirkin, Daniel Daly, virtualization, Liran Alon,
	Eli Cohen, Nitin Shrivastav, Alex Barba, Christophe Fontaine,
	Juan Quintela, Lee Ballard, Lars Ganrot, Rob Miller,
	Stefano Garzarella, Howard Cai, Parav Pandit, vm, Salil Mehta,
	Stephen Finucane, Xiao W Wang, Sean Mooney, Stefan Hajnoczi,
	Jim Harford, Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand,
	Michael Lilja, Max Gurtovoy


On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net devices, which already
> have dirty page logging capabilities. None of the changes have actual
> effect with current devices until last two patches (26 and 27) are
> applied, but they can be rebased on top of any other. These checks the
> device to meet all requirements, and disable vhost-net devices logging
> so migration goes through SW LM. This last patch is not meant to be
> applied in the final revision, it is in the series just for testing
> purposes.
>
> For use SW assisted LM these vhost-net devices need to be instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)


So a question is at what level do we want to implement qemu assisted 
live migration. To me it could be done at two levels:

1) generic vhost level which makes it work for both vhost-net/vhost-user 
and vhost-vDPA
2) a specific type of vhost

To me, having a generic one looks better but it would be much more 
complicated. So what I read from this series is it was a vhost kernel 
specific software assisted live migration which is a good start. 
Actually it may even have real use case, e.g it can save dirty bitmaps 
for guest with large memory. But we need to address the above 
limitations first.

So I would like to know what's the reason for mandating iommu platform 
and ats? And I think we need to fix case of event idx support.


>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 9, and starting migration. Partial applies
> between 13 and 24 will not work while migrating on source, and patch
> 25 is needed for the destination to resume network activity.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of


Actually we're better than that since there's no need the trick like 
hardcoded IOVA for mediated(shadow) virtqueue.


> DPDK's https://patchwork.dpdk.org/cover/48370/ .


I notice that you do GPA->VA translations and try to establish a VA->VA 
(use VA as IOVA) mapping via device IOTLB. This shortcut should work for 
vhost-kernel/user but not vhost-vDPA. The reason is that there's no 
guarantee that the whole 64bit address range could be used as IOVA. One 
example is that for hardware IOMMU like intel, it usually has 47 or 52 
bits of address width.

So we probably need an IOVA allocator that can make sure the IOVA is not 
overlapped and fit for [1]. We can probably build the IOVA for guest VA 
via memory listeners. Then we have

1) IOVA for GPA
2) IOVA for shadow VQ

And advertise IOVA to VA mapping to vhost.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498


>
> Comments are welcome.
>
> Thanks!
>
> Eugenio Pérez (27):
>    vhost: Add vhost_dev_can_log
>    vhost: Add device callback in vhost_migration_log
>    vhost: Move log resize/put to vhost_dev_set_log
>    vhost: add vhost_kernel_set_vring_enable
>    vhost: Add hdev->dev.sw_lm_vq_handler
>    virtio: Add virtio_queue_get_used_notify_split
>    vhost: Route guest->host notification through qemu
>    vhost: Add a flag for software assisted Live Migration
>    vhost: Route host->guest notification through qemu
>    vhost: Allocate shadow vring
>    virtio: const-ify all virtio_tswap* functions
>    virtio: Add virtio_queue_full
>    vhost: Send buffers to device
>    virtio: Remove virtio_queue_get_used_notify_split
>    vhost: Do not invalidate signalled used
>    virtio: Expose virtqueue_alloc_element
>    vhost: add vhost_vring_set_notification_rcu
>    vhost: add vhost_vring_poll_rcu
>    vhost: add vhost_vring_get_buf_rcu
>    vhost: Return used buffers
>    vhost: Add vhost_virtqueue_memory_unmap
>    vhost: Add vhost_virtqueue_memory_map
>    vhost: unmap qemu's shadow virtqueues on sw live migration
>    vhost: iommu changes
>    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>    vhost: Add vhost_hdev_can_sw_lm
>    vhost: forbid vhost devices logging
>
>   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>   include/hw/virtio/vhost.h         |   5 +
>   include/hw/virtio/virtio-access.h |   8 +-
>   include/hw/virtio/virtio.h        |   4 +
>   hw/net/virtio-net.c               |  39 ++-
>   hw/virtio/vhost-backend.c         |  29 ++
>   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>   hw/virtio/virtio.c                |  18 +-
>   hw/virtio/meson.build             |   2 +-
>   10 files changed, 758 insertions(+), 85 deletions(-)
>   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>   create mode 100644 hw/virtio/vhost-sw-lm-ring.c


So this looks like a pretty huge patchset which I'm trying to think of 
ways to split. An idea is to do this is two steps

1) implement a shadow virtqueue mode for vhost first (w/o live 
migration). Then we can test descriptors relay, IOVA allocating, etc.
2) add live migration support on top

And it looks to me it's better to split the shadow virtqueue (virtio 
driver part) into an independent file. And use generic name (w/o 
"shadow") in order to be reused by other use cases as well.

Thoughts?



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-25  7:08 ` Jason Wang
@ 2020-11-25 12:03   ` Eugenio Perez Martin
  2020-11-25 12:14     ` Eugenio Perez Martin
  2020-11-26  3:07     ` Jason Wang
  0 siblings, 2 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-11-25 12:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm list, Michael S. Tsirkin, qemu-level, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> > This series enable vDPA software assisted live migration for vhost-net
> > devices. This is a new method of vhost devices migration: Instead of
> > relay on vDPA device's dirty logging capability, SW assisted LM
> > intercepts dataplane, forwarding the descriptors between VM and device.
> >
> > In this migration mode, qemu offers a new vring to the device to
> > read and write into, and disable vhost notifiers, processing guest and
> > vhost notifications in qemu. On used buffer relay, qemu will mark the
> > dirty memory as with plain virtio-net devices. This way, devices does
> > not need to have dirty page logging capability.
> >
> > This series is a POC doing SW LM for vhost-net devices, which already
> > have dirty page logging capabilities. None of the changes have actual
> > effect with current devices until last two patches (26 and 27) are
> > applied, but they can be rebased on top of any other. These checks the
> > device to meet all requirements, and disable vhost-net devices logging
> > so migration goes through SW LM. This last patch is not meant to be
> > applied in the final revision, it is in the series just for testing
> > purposes.
> >
> > For use SW assisted LM these vhost-net devices need to be instantiated:
> > * With IOMMU (iommu_platform=on,ats=on)
> > * Without event_idx (event_idx=off)
>
>
> So a question is at what level do we want to implement qemu assisted
> live migration. To me it could be done at two levels:
>
> 1) generic vhost level which makes it work for both vhost-net/vhost-user
> and vhost-vDPA
> 2) a specific type of vhost
>
> To me, having a generic one looks better but it would be much more
> complicated. So what I read from this series is it was a vhost kernel
> specific software assisted live migration which is a good start.
> Actually it may even have real use case, e.g it can save dirty bitmaps
> for guest with large memory. But we need to address the above
> limitations first.
>
> So I would like to know what's the reason for mandating iommu platform
> and ats? And I think we need to fix case of event idx support.
>

There is no specific reason for mandating iommu & ats, it was just
started that way.

I will extend the patch to support those cases too.

>
> >
> > Just the notification forwarding (with no descriptor relay) can be
> > achieved with patches 7 and 9, and starting migration. Partial applies
> > between 13 and 24 will not work while migrating on source, and patch
> > 25 is needed for the destination to resume network activity.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
>
>
> Actually we're better than that since there's no need the trick like
> hardcoded IOVA for mediated(shadow) virtqueue.
>
>
> > DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
>
> I notice that you do GPA->VA translations and try to establish a VA->VA
> (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
> vhost-kernel/user but not vhost-vDPA. The reason is that there's no
> guarantee that the whole 64bit address range could be used as IOVA. One
> example is that for hardware IOMMU like intel, it usually has 47 or 52
> bits of address width.
>
> So we probably need an IOVA allocator that can make sure the IOVA is not
> overlapped and fit for [1]. We can probably build the IOVA for guest VA
> via memory listeners. Then we have
>
> 1) IOVA for GPA
> 2) IOVA for shadow VQ
>
> And advertise IOVA to VA mapping to vhost.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
>

Got it, will control it too.

Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].

>
> >
> > Comments are welcome.
> >
> > Thanks!
> >
> > Eugenio Pérez (27):
> >    vhost: Add vhost_dev_can_log
> >    vhost: Add device callback in vhost_migration_log
> >    vhost: Move log resize/put to vhost_dev_set_log
> >    vhost: add vhost_kernel_set_vring_enable
> >    vhost: Add hdev->dev.sw_lm_vq_handler
> >    virtio: Add virtio_queue_get_used_notify_split
> >    vhost: Route guest->host notification through qemu
> >    vhost: Add a flag for software assisted Live Migration
> >    vhost: Route host->guest notification through qemu
> >    vhost: Allocate shadow vring
> >    virtio: const-ify all virtio_tswap* functions
> >    virtio: Add virtio_queue_full
> >    vhost: Send buffers to device
> >    virtio: Remove virtio_queue_get_used_notify_split
> >    vhost: Do not invalidate signalled used
> >    virtio: Expose virtqueue_alloc_element
> >    vhost: add vhost_vring_set_notification_rcu
> >    vhost: add vhost_vring_poll_rcu
> >    vhost: add vhost_vring_get_buf_rcu
> >    vhost: Return used buffers
> >    vhost: Add vhost_virtqueue_memory_unmap
> >    vhost: Add vhost_virtqueue_memory_map
> >    vhost: unmap qemu's shadow virtqueues on sw live migration
> >    vhost: iommu changes
> >    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
> >    vhost: Add vhost_hdev_can_sw_lm
> >    vhost: forbid vhost devices logging
> >
> >   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
> >   include/hw/virtio/vhost.h         |   5 +
> >   include/hw/virtio/virtio-access.h |   8 +-
> >   include/hw/virtio/virtio.h        |   4 +
> >   hw/net/virtio-net.c               |  39 ++-
> >   hw/virtio/vhost-backend.c         |  29 ++
> >   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
> >   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
> >   hw/virtio/virtio.c                |  18 +-
> >   hw/virtio/meson.build             |   2 +-
> >   10 files changed, 758 insertions(+), 85 deletions(-)
> >   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
> >   create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>
>
> So this looks like a pretty huge patchset which I'm trying to think of
> ways to split. An idea is to do this is two steps
>
> 1) implement a shadow virtqueue mode for vhost first (w/o live
> migration). Then we can test descriptors relay, IOVA allocating, etc.

How would that mode be activated if it is not tied to live migration?
New backend/command line switch?

Maybe it is better to also start with no iommu & ats support and add it on top.

> 2) add live migration support on top
>
> And it looks to me it's better to split the shadow virtqueue (virtio
> driver part) into an independent file. And use generic name (w/o
> "shadow") in order to be reused by other use cases as well.
>

I think the same.

Thanks!

> Thoughts?
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-25 12:03   ` Eugenio Perez Martin
@ 2020-11-25 12:14     ` Eugenio Perez Martin
  2020-11-26  3:07     ` Jason Wang
  1 sibling, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-11-25 12:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm list, Michael S. Tsirkin, qemu-level, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Wed, Nov 25, 2020 at 1:03 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> > > This series enable vDPA software assisted live migration for vhost-net
> > > devices. This is a new method of vhost devices migration: Instead of
> > > relay on vDPA device's dirty logging capability, SW assisted LM
> > > intercepts dataplane, forwarding the descriptors between VM and device.
> > >
> > > In this migration mode, qemu offers a new vring to the device to
> > > read and write into, and disable vhost notifiers, processing guest and
> > > vhost notifications in qemu. On used buffer relay, qemu will mark the
> > > dirty memory as with plain virtio-net devices. This way, devices does
> > > not need to have dirty page logging capability.
> > >
> > > This series is a POC doing SW LM for vhost-net devices, which already
> > > have dirty page logging capabilities. None of the changes have actual
> > > effect with current devices until last two patches (26 and 27) are
> > > applied, but they can be rebased on top of any other. These checks the
> > > device to meet all requirements, and disable vhost-net devices logging
> > > so migration goes through SW LM. This last patch is not meant to be
> > > applied in the final revision, it is in the series just for testing
> > > purposes.
> > >
> > > For use SW assisted LM these vhost-net devices need to be instantiated:
> > > * With IOMMU (iommu_platform=on,ats=on)
> > > * Without event_idx (event_idx=off)
> >
> >
> > So a question is at what level do we want to implement qemu assisted
> > live migration. To me it could be done at two levels:
> >
> > 1) generic vhost level which makes it work for both vhost-net/vhost-user
> > and vhost-vDPA
> > 2) a specific type of vhost
> >
> > To me, having a generic one looks better but it would be much more
> > complicated. So what I read from this series is it was a vhost kernel
> > specific software assisted live migration which is a good start.
> > Actually it may even have real use case, e.g it can save dirty bitmaps
> > for guest with large memory. But we need to address the above
> > limitations first.
> >
> > So I would like to know what's the reason for mandating iommu platform
> > and ats? And I think we need to fix case of event idx support.
> >
>
> There is no specific reason for mandating iommu & ats, it was just
> started that way.
>
> I will extend the patch to support those cases too.
>
> >
> > >
> > > Just the notification forwarding (with no descriptor relay) can be
> > > achieved with patches 7 and 9, and starting migration. Partial applies
> > > between 13 and 24 will not work while migrating on source, and patch
> > > 25 is needed for the destination to resume network activity.
> > >
> > > It is based on the ideas of DPDK SW assisted LM, in the series of
> >
> >
> > Actually we're better than that since there's no need the trick like
> > hardcoded IOVA for mediated(shadow) virtqueue.
> >
> >
> > > DPDK's https://patchwork.dpdk.org/cover/48370/ .
> >
> >
> > I notice that you do GPA->VA translations and try to establish a VA->VA
> > (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
> > vhost-kernel/user but not vhost-vDPA. The reason is that there's no
> > guarantee that the whole 64bit address range could be used as IOVA. One
> > example is that for hardware IOMMU like intel, it usually has 47 or 52
> > bits of address width.
> >
> > So we probably need an IOVA allocator that can make sure the IOVA is not
> > overlapped and fit for [1]. We can probably build the IOVA for guest VA
> > via memory listeners. Then we have
> >
> > 1) IOVA for GPA
> > 2) IOVA for shadow VQ
> >
> > And advertise IOVA to VA mapping to vhost.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
> >
>
> Got it, will control it too.
>
> Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].
>

Sorry, this was intended to be a question :).

Given a vhost-* device IOVA usable range, is ok to expose all of qemu
overlapping VA to it? With the iotlb miss, for example. Would it be
acceptable from a security point of view? The device would have access
to all qemu VA, but on the other hand devices like vhost-net already
have it.

> >
> > >
> > > Comments are welcome.
> > >
> > > Thanks!
> > >
> > > Eugenio Pérez (27):
> > >    vhost: Add vhost_dev_can_log
> > >    vhost: Add device callback in vhost_migration_log
> > >    vhost: Move log resize/put to vhost_dev_set_log
> > >    vhost: add vhost_kernel_set_vring_enable
> > >    vhost: Add hdev->dev.sw_lm_vq_handler
> > >    virtio: Add virtio_queue_get_used_notify_split
> > >    vhost: Route guest->host notification through qemu
> > >    vhost: Add a flag for software assisted Live Migration
> > >    vhost: Route host->guest notification through qemu
> > >    vhost: Allocate shadow vring
> > >    virtio: const-ify all virtio_tswap* functions
> > >    virtio: Add virtio_queue_full
> > >    vhost: Send buffers to device
> > >    virtio: Remove virtio_queue_get_used_notify_split
> > >    vhost: Do not invalidate signalled used
> > >    virtio: Expose virtqueue_alloc_element
> > >    vhost: add vhost_vring_set_notification_rcu
> > >    vhost: add vhost_vring_poll_rcu
> > >    vhost: add vhost_vring_get_buf_rcu
> > >    vhost: Return used buffers
> > >    vhost: Add vhost_virtqueue_memory_unmap
> > >    vhost: Add vhost_virtqueue_memory_map
> > >    vhost: unmap qemu's shadow virtqueues on sw live migration
> > >    vhost: iommu changes
> > >    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
> > >    vhost: Add vhost_hdev_can_sw_lm
> > >    vhost: forbid vhost devices logging
> > >
> > >   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
> > >   include/hw/virtio/vhost.h         |   5 +
> > >   include/hw/virtio/virtio-access.h |   8 +-
> > >   include/hw/virtio/virtio.h        |   4 +
> > >   hw/net/virtio-net.c               |  39 ++-
> > >   hw/virtio/vhost-backend.c         |  29 ++
> > >   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
> > >   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
> > >   hw/virtio/virtio.c                |  18 +-
> > >   hw/virtio/meson.build             |   2 +-
> > >   10 files changed, 758 insertions(+), 85 deletions(-)
> > >   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
> > >   create mode 100644 hw/virtio/vhost-sw-lm-ring.c
> >
> >
> > So this looks like a pretty huge patchset which I'm trying to think of
> > ways to split. An idea is to do this is two steps
> >
> > 1) implement a shadow virtqueue mode for vhost first (w/o live
> > migration). Then we can test descriptors relay, IOVA allocating, etc.
>
> How would that mode be activated if it is not tied to live migration?
> New backend/command line switch?
>
> Maybe it is better to also start with no iommu & ats support and add it on top.
>
> > 2) add live migration support on top
> >
> > And it looks to me it's better to split the shadow virtqueue (virtio
> > driver part) into an independent file. And use generic name (w/o
> > "shadow") in order to be reused by other use cases as well.
> >
>
> I think the same.
>
> Thanks!
>
> > Thoughts?
> >



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-25 12:03   ` Eugenio Perez Martin
  2020-11-25 12:14     ` Eugenio Perez Martin
@ 2020-11-26  3:07     ` Jason Wang
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Wang @ 2020-11-26  3:07 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, qemu-level, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Rob Miller, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Alex Barba, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Jim Harford, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Stephen Finucane, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy


On 2020/11/25 下午8:03, Eugenio Perez Martin wrote:
> On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/11/21 上午2:50, Eugenio Pérez wrote:
>>> This series enable vDPA software assisted live migration for vhost-net
>>> devices. This is a new method of vhost devices migration: Instead of
>>> relay on vDPA device's dirty logging capability, SW assisted LM
>>> intercepts dataplane, forwarding the descriptors between VM and device.
>>>
>>> In this migration mode, qemu offers a new vring to the device to
>>> read and write into, and disable vhost notifiers, processing guest and
>>> vhost notifications in qemu. On used buffer relay, qemu will mark the
>>> dirty memory as with plain virtio-net devices. This way, devices does
>>> not need to have dirty page logging capability.
>>>
>>> This series is a POC doing SW LM for vhost-net devices, which already
>>> have dirty page logging capabilities. None of the changes have actual
>>> effect with current devices until last two patches (26 and 27) are
>>> applied, but they can be rebased on top of any other. These checks the
>>> device to meet all requirements, and disable vhost-net devices logging
>>> so migration goes through SW LM. This last patch is not meant to be
>>> applied in the final revision, it is in the series just for testing
>>> purposes.
>>>
>>> For use SW assisted LM these vhost-net devices need to be instantiated:
>>> * With IOMMU (iommu_platform=on,ats=on)
>>> * Without event_idx (event_idx=off)
>>
>> So a question is at what level do we want to implement qemu assisted
>> live migration. To me it could be done at two levels:
>>
>> 1) generic vhost level which makes it work for both vhost-net/vhost-user
>> and vhost-vDPA
>> 2) a specific type of vhost
>>
>> To me, having a generic one looks better but it would be much more
>> complicated. So what I read from this series is it was a vhost kernel
>> specific software assisted live migration which is a good start.
>> Actually it may even have real use case, e.g it can save dirty bitmaps
>> for guest with large memory. But we need to address the above
>> limitations first.
>>
>> So I would like to know what's the reason for mandating iommu platform
>> and ats? And I think we need to fix case of event idx support.
>>
> There is no specific reason for mandating iommu & ats, it was just
> started that way.
>
> I will extend the patch to support those cases too.
>
>>> Just the notification forwarding (with no descriptor relay) can be
>>> achieved with patches 7 and 9, and starting migration. Partial applies
>>> between 13 and 24 will not work while migrating on source, and patch
>>> 25 is needed for the destination to resume network activity.
>>>
>>> It is based on the ideas of DPDK SW assisted LM, in the series of
>>
>> Actually we're better than that since there's no need the trick like
>> hardcoded IOVA for mediated(shadow) virtqueue.
>>
>>
>>> DPDK's https://patchwork.dpdk.org/cover/48370/ .
>>
>> I notice that you do GPA->VA translations and try to establish a VA->VA
>> (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
>> vhost-kernel/user but not vhost-vDPA. The reason is that there's no
>> guarantee that the whole 64bit address range could be used as IOVA. One
>> example is that for hardware IOMMU like intel, it usually has 47 or 52
>> bits of address width.
>>
>> So we probably need an IOVA allocator that can make sure the IOVA is not
>> overlapped and fit for [1]. We can probably build the IOVA for guest VA
>> via memory listeners. Then we have
>>
>> 1) IOVA for GPA
>> 2) IOVA for shadow VQ
>>
>> And advertise IOVA to VA mapping to vhost.
>>
>> [1]
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
>>
> Got it, will control it too.
>
> Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].


It works but it means vhost-net needs some special care. To me a generic 
IOVA allocator looks better.


>
>>> Comments are welcome.
>>>
>>> Thanks!
>>>
>>> Eugenio Pérez (27):
>>>     vhost: Add vhost_dev_can_log
>>>     vhost: Add device callback in vhost_migration_log
>>>     vhost: Move log resize/put to vhost_dev_set_log
>>>     vhost: add vhost_kernel_set_vring_enable
>>>     vhost: Add hdev->dev.sw_lm_vq_handler
>>>     virtio: Add virtio_queue_get_used_notify_split
>>>     vhost: Route guest->host notification through qemu
>>>     vhost: Add a flag for software assisted Live Migration
>>>     vhost: Route host->guest notification through qemu
>>>     vhost: Allocate shadow vring
>>>     virtio: const-ify all virtio_tswap* functions
>>>     virtio: Add virtio_queue_full
>>>     vhost: Send buffers to device
>>>     virtio: Remove virtio_queue_get_used_notify_split
>>>     vhost: Do not invalidate signalled used
>>>     virtio: Expose virtqueue_alloc_element
>>>     vhost: add vhost_vring_set_notification_rcu
>>>     vhost: add vhost_vring_poll_rcu
>>>     vhost: add vhost_vring_get_buf_rcu
>>>     vhost: Return used buffers
>>>     vhost: Add vhost_virtqueue_memory_unmap
>>>     vhost: Add vhost_virtqueue_memory_map
>>>     vhost: unmap qemu's shadow virtqueues on sw live migration
>>>     vhost: iommu changes
>>>     vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>>>     vhost: Add vhost_hdev_can_sw_lm
>>>     vhost: forbid vhost devices logging
>>>
>>>    hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>>>    include/hw/virtio/vhost.h         |   5 +
>>>    include/hw/virtio/virtio-access.h |   8 +-
>>>    include/hw/virtio/virtio.h        |   4 +
>>>    hw/net/virtio-net.c               |  39 ++-
>>>    hw/virtio/vhost-backend.c         |  29 ++
>>>    hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>>>    hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>>>    hw/virtio/virtio.c                |  18 +-
>>>    hw/virtio/meson.build             |   2 +-
>>>    10 files changed, 758 insertions(+), 85 deletions(-)
>>>    create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>>>    create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>>
>> So this looks like a pretty huge patchset which I'm trying to think of
>> ways to split. An idea is to do this is two steps
>>
>> 1) implement a shadow virtqueue mode for vhost first (w/o live
>> migration). Then we can test descriptors relay, IOVA allocating, etc.
> How would that mode be activated if it is not tied to live migration?
> New backend/command line switch?


Either a new cli option or even a qmp command can work.


>
> Maybe it is better to also start with no iommu & ats support and add it on top.


Yes.


>
>> 2) add live migration support on top
>>
>> And it looks to me it's better to split the shadow virtqueue (virtio
>> driver part) into an independent file. And use generic name (w/o
>> "shadow") in order to be reused by other use cases as well.
>>
> I think the same.
>
> Thanks!
>
>> Thoughts?
>>
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration
  2020-11-20 18:51 ` [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration Eugenio Pérez
@ 2020-11-27 15:29   ` Stefano Garzarella
  2020-11-30  7:54     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefano Garzarella @ 2020-11-27 15:29 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Fri, Nov 20, 2020 at 07:51:01PM +0100, Eugenio Pérez wrote:
>Since vhost does not need to access it, it has no sense to keep it
>mapped.
>
>Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>---
> hw/virtio/vhost.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>index f640d4edf0..eebfac4455 100644
>--- a/hw/virtio/vhost.c
>+++ b/hw/virtio/vhost.c
>@@ -1124,6 +1124,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>
>         dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
>         event_notifier_set_handler(&vq->masked_notifier, vhost_handle_call);
>+        vhost_virtqueue_memory_unmap(dev, &dev->vqs[idx], true);

IIUC vhost_virtqueue_memory_unmap() is already called at the end of 
vhost_virtqueue_stop(), so we can skip this call, right?

>
>         vhost_vring_write_addr(dev->sw_lm_shadow_vq[idx], &addr);
>         r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
>-- 2.18.4
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (29 preceding siblings ...)
  2020-11-25  7:08 ` Jason Wang
@ 2020-11-27 15:44 ` Stefano Garzarella
  2020-12-08  9:37 ` Stefan Hajnoczi
  31 siblings, 0 replies; 81+ messages in thread
From: Stefano Garzarella @ 2020-11-27 15:44 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Howard Cai, Parav Pandit, vm,
	Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
>This series enable vDPA software assisted live migration for vhost-net
>devices. This is a new method of vhost devices migration: Instead of
>relay on vDPA device's dirty logging capability, SW assisted LM
>intercepts dataplane, forwarding the descriptors between VM and device.
>
>In this migration mode, qemu offers a new vring to the device to
>read and write into, and disable vhost notifiers, processing guest and
>vhost notifications in qemu. On used buffer relay, qemu will mark the
>dirty memory as with plain virtio-net devices. This way, devices does
>not need to have dirty page logging capability.
>
>This series is a POC doing SW LM for vhost-net devices, which already
>have dirty page logging capabilities. None of the changes have actual
>effect with current devices until last two patches (26 and 27) are
>applied, but they can be rebased on top of any other. These checks the
>device to meet all requirements, and disable vhost-net devices logging
>so migration goes through SW LM. This last patch is not meant to be
>applied in the final revision, it is in the series just for testing
>purposes.
>
>For use SW assisted LM these vhost-net devices need to be instantiated:
>* With IOMMU (iommu_platform=on,ats=on)
>* Without event_idx (event_idx=off)
>
>Just the notification forwarding (with no descriptor relay) can be
>achieved with patches 7 and 9, and starting migration. Partial applies
>between 13 and 24 will not work while migrating on source, and patch
>25 is needed for the destination to resume network activity.
>
>It is based on the ideas of DPDK SW assisted LM, in the series of
>DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
>Comments are welcome.

Hi Eugenio,
I took a look and the idea of the shadow queue I think is the right way.
It's very similar to what we thought with Stefan for io_uring 
passthrough and vdpa-blk.

IIUC, when the migrations starts, the notifications from the guest to 
vhost are disabled, so QEMU starts to intercept them through the 
custom_handler installed in virtio-net (we need to understand how to 
generalize this).
At this point QEMU starts to use the shadows queues and exposes them to 
vhost.
The opposite is done for vhost to guest notifications, where 
vhost_handle_call is installed to masked_notifier to intercept the 
notification.

I hope to give better feedback when I get a complete overview ;-)

Anyway, as Jason suggested, we should split this series, so maybe we can 
merge some preparations patches (e.g. 1, 11, 21, 22) regardless the 
other patches.

Thanks,
Stefano



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration
  2020-11-27 15:29   ` Stefano Garzarella
@ 2020-11-30  7:54     ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-11-30  7:54 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Howard Cai, Parav Pandit,
	vm, Salil Mehta, Stephen Finucane, Xiao W Wang, Sean Mooney,
	Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev, Siwei Liu,
	Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Fri, Nov 27, 2020 at 4:29 PM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:51:01PM +0100, Eugenio Pérez wrote:
> >Since vhost does not need to access it, it has no sense to keep it
> >mapped.
> >
> >Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >---
> > hw/virtio/vhost.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index f640d4edf0..eebfac4455 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -1124,6 +1124,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >
> >         dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
> >         event_notifier_set_handler(&vq->masked_notifier, vhost_handle_call);
> >+        vhost_virtqueue_memory_unmap(dev, &dev->vqs[idx], true);
>
> IIUC vhost_virtqueue_memory_unmap() is already called at the end of
> vhost_virtqueue_stop(), so we can skip this call, right?
>

You are totally right Stefano, thanks for the catch!

> >
> >         vhost_vring_write_addr(dev->sw_lm_shadow_vq[idx], &addr);
> >         r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> >-- 2.18.4
> >
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log
  2020-11-20 18:50 ` [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log Eugenio Pérez
@ 2020-12-07 16:19   ` Stefan Hajnoczi
  2020-12-09 12:20     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 16:19 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 962 bytes --]

On Fri, Nov 20, 2020 at 07:50:40PM +0100, Eugenio Pérez wrote:
> This allows code to reuse the logic to not to re-enable or re-disable
> migration mechanisms. Code works the same way as before.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 2bd8cdf893..2adb2718c1 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -862,7 +862,9 @@ err_features:
>      return r;
>  }
>  
> -static int vhost_migration_log(MemoryListener *listener, bool enable)
> +static int vhost_migration_log(MemoryListener *listener,
> +                               bool enable,
> +                               int (*device_cb)(struct vhost_dev *, bool))

Please document the argument. What is the callback function supposed to
do ("device_cb" is not descriptive so I'm not sure)?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable
  2020-11-20 18:50 ` [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
@ 2020-12-07 16:43   ` Stefan Hajnoczi
  2020-12-09 12:00     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 16:43 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Fri, Nov 20, 2020 at 07:50:42PM +0100, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> index 222bbcc62d..317f1f96fa 100644
> --- a/hw/virtio/vhost-backend.c
> +++ b/hw/virtio/vhost-backend.c
> @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
>      return idx - dev->vq_index;
>  }
>  
> +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
> +                                      bool enable)
> +{
> +    struct vhost_vring_file file = {
> +        .index = idx,
> +    };
> +
> +    if (!enable) {
> +        file.fd = -1; /* Pass -1 to unbind from file. */
> +    } else {
> +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
> +        file.fd = vn_dev->backend;
> +    }
> +
> +    return vhost_kernel_net_set_backend(dev, &file);

This is vhost-net specific even though the function appears to be
generic. Is there a plan to extend this to all devices?

> +}
> +
> +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < dev->nvqs; ++i) {
> +        vhost_kernel_set_vq_enable(dev, i, enable);
> +    }
> +
> +    return 0;
> +}

I suggest exposing the per-vq interface (vhost_kernel_set_vq_enable())
in VhostOps so it follows the ioctl interface.
vhost_kernel_set_vring_enable() can be moved to vhost.c can loop over
all vqs if callers find it convenient to loop over all vqs.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler
  2020-11-20 18:50 ` [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler Eugenio Pérez
@ 2020-12-07 16:52   ` Stefan Hajnoczi
  2020-12-09 15:02     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 16:52 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

On Fri, Nov 20, 2020 at 07:50:43PM +0100, Eugenio Pérez wrote:
> Only virtio-net honors it.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  include/hw/virtio/vhost.h |  1 +
>  hw/net/virtio-net.c       | 39 ++++++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 4a8bc75415..b5b7496537 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -83,6 +83,7 @@ struct vhost_dev {
>      bool started;
>      bool log_enabled;
>      uint64_t log_size;
> +    VirtIOHandleOutput sw_lm_vq_handler;

sw == software?
lm == live migration?

Maybe there is a name that is clearer. What are these virtqueues called?
Shadow vqs? Logged vqs?

Live migration is a feature that uses dirty memory logging, but other
features may use dirty memory logging too. The name should probably not
be associated with live migration.

>      Error *migration_blocker;
>      const VhostOps *vhost_ops;
>      void *opaque;
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 9179013ac4..9a69ae3598 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2628,24 +2628,32 @@ static void virtio_net_tx_bh(void *opaque)
>      }
>  }
>  
> -static void virtio_net_add_queue(VirtIONet *n, int index)
> +static void virtio_net_add_queue(VirtIONet *n, int index,
> +                                 VirtIOHandleOutput custom_handler)
>  {

We talked about the possibility of moving this into the generic vhost
code so that devices don't need to be modified. It would be nice to hide
this feature inside vhost.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2020-11-20 18:50 ` [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split Eugenio Pérez
@ 2020-12-07 16:58   ` Stefan Hajnoczi
  2021-01-12 18:21     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 16:58 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1313 bytes --]

On Fri, Nov 20, 2020 at 07:50:44PM +0100, Eugenio Pérez wrote:
> This function is just used for a few commits, so SW LM is developed
> incrementally, and it is deleted after it is useful.
> 
> For a few commits, only the events (irqfd, eventfd) are forwarded.

s/eventfd/ioeventfd/ (irqfd is also an eventfd)

> +bool virtio_queue_get_used_notify_split(VirtQueue *vq)
> +{
> +    VRingMemoryRegionCaches *caches;
> +    hwaddr pa = offsetof(VRingUsed, flags);
> +    uint16_t flags;
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    caches = vring_get_region_caches(vq);
> +    assert(caches);
> +    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
> +    return !(VRING_USED_F_NO_NOTIFY & flags);
> +}

QEMU stores the notification status:

void virtio_queue_set_notification(VirtQueue *vq, int enable)
{
    vq->notification = enable; <---- here

    if (!vq->vring.desc) {
        return;
    }

    if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
        virtio_queue_packed_set_notification(vq, enable);
    } else {
        virtio_queue_split_set_notification(vq, enable);

I'm wondering why it's necessary to fetch from guest RAM instead of
using vq->notification? It also works for both split and packed
queues so the code would be simpler.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/27] vhost: Route guest->host notification through qemu
  2020-11-20 18:50 ` [RFC PATCH 07/27] vhost: Route guest->host notification through qemu Eugenio Pérez
@ 2020-12-07 17:42   ` Stefan Hajnoczi
  2020-12-09 17:08     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 17:42 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 11448 bytes --]

On Fri, Nov 20, 2020 at 07:50:45PM +0100, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-sw-lm-ring.h |  26 +++++++++
>  include/hw/virtio/vhost.h    |   3 ++
>  hw/virtio/vhost-sw-lm-ring.c |  60 +++++++++++++++++++++
>  hw/virtio/vhost.c            | 100 +++++++++++++++++++++++++++++++++--
>  hw/virtio/meson.build        |   2 +-
>  5 files changed, 187 insertions(+), 4 deletions(-)
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.c
> 
> diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
> new file mode 100644
> index 0000000000..86dc081b93
> --- /dev/null
> +++ b/hw/virtio/vhost-sw-lm-ring.h
> @@ -0,0 +1,26 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2020
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef VHOST_SW_LM_RING_H
> +#define VHOST_SW_LM_RING_H
> +
> +#include "qemu/osdep.h"
> +
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/vhost.h"
> +
> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;

Here it's called a shadow virtqueue while the file calls it a
sw-lm-ring. Please use a single name.

> +
> +bool vhost_vring_kick(VhostShadowVirtqueue *vq);

vhost_shadow_vq_kick()?

> +
> +VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx);

vhost_dev_get_shadow_vq()? This could be in include/hw/virtio/vhost.h
with the other vhost_dev_*() functions.

> +
> +void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq);

Hmm...now I wonder what the lifecycle is. Does vhost_sw_lm_shadow_vq()
allocate it?

Please add doc comments explaining these functions either in this header
file or in vhost-sw-lm-ring.c.

> +
> +#endif
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index b5b7496537..93cc3f1ae3 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -54,6 +54,8 @@ struct vhost_iommu {
>      QLIST_ENTRY(vhost_iommu) iommu_next;
>  };
>  
> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> +
>  typedef struct VhostDevConfigOps {
>      /* Vhost device config space changed callback
>       */
> @@ -83,6 +85,7 @@ struct vhost_dev {
>      bool started;
>      bool log_enabled;
>      uint64_t log_size;
> +    VhostShadowVirtqueue *sw_lm_shadow_vq[2];

The hardcoded 2 is probably for single-queue virtio-net? I guess this
will eventually become VhostShadowVirtqueue *shadow_vqs or
VhostShadowVirtqueue **shadow_vqs, depending on whether each one should
be allocated individually.

>      VirtIOHandleOutput sw_lm_vq_handler;
>      Error *migration_blocker;
>      const VhostOps *vhost_ops;
> diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> new file mode 100644
> index 0000000000..0192e77831
> --- /dev/null
> +++ b/hw/virtio/vhost-sw-lm-ring.c
> @@ -0,0 +1,60 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2020
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "hw/virtio/vhost-sw-lm-ring.h"
> +#include "hw/virtio/vhost.h"
> +
> +#include "standard-headers/linux/vhost_types.h"
> +#include "standard-headers/linux/virtio_ring.h"
> +
> +#include "qemu/event_notifier.h"
> +
> +typedef struct VhostShadowVirtqueue {
> +    EventNotifier hdev_notifier;
> +    VirtQueue *vq;
> +} VhostShadowVirtqueue;
> +
> +static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
> +{
> +    return virtio_queue_get_used_notify_split(vq->vq);
> +}
> +
> +bool vhost_vring_kick(VhostShadowVirtqueue *vq)
> +{
> +    return vhost_vring_should_kick(vq) ? event_notifier_set(&vq->hdev_notifier)
> +                                       : true;
> +}

How is the return value used? event_notifier_set() returns -errno so
this function returns false on success, and true when notifications are
disabled or event_notifier_set() failed. I'm not sure this return value
can be used for anything.

> +
> +VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)

I see now that this function allocates the VhostShadowVirtqueue. Maybe
adding _new() to the name would make that clear?

> +{
> +    struct vhost_vring_file file = {
> +        .index = idx
> +    };
> +    VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
> +    VhostShadowVirtqueue *svq;
> +    int r;
> +
> +    svq = g_new0(VhostShadowVirtqueue, 1);
> +    svq->vq = vq;
> +
> +    r = event_notifier_init(&svq->hdev_notifier, 0);
> +    assert(r == 0);
> +
> +    file.fd = event_notifier_get_fd(&svq->hdev_notifier);
> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> +    assert(r == 0);
> +
> +    return svq;
> +}

I guess there are assumptions about the status of the device? Does the
virtqueue need to be disabled when this function is called?

> +
> +void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq)
> +{
> +    event_notifier_cleanup(&vq->hdev_notifier);
> +    g_free(vq);
> +}
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 9cbd52a7f1..a55b684b5f 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -13,6 +13,8 @@
>   * GNU GPL, version 2 or (at your option) any later version.
>   */
>  
> +#include "hw/virtio/vhost-sw-lm-ring.h"
> +
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
>  #include "hw/virtio/vhost.h"
> @@ -61,6 +63,20 @@ bool vhost_has_free_slot(void)
>      return slots_limit > used_memslots;
>  }
>  
> +static struct vhost_dev *vhost_dev_from_virtio(const VirtIODevice *vdev)
> +{
> +    struct vhost_dev *hdev;
> +
> +    QLIST_FOREACH(hdev, &vhost_devices, entry) {
> +        if (hdev->vdev == vdev) {
> +            return hdev;
> +        }
> +    }
> +
> +    assert(hdev);
> +    return NULL;
> +}
> +
>  static bool vhost_dev_can_log(const struct vhost_dev *hdev)
>  {
>      return hdev->features & (0x1ULL << VHOST_F_LOG_ALL);
> @@ -148,6 +164,12 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
>      return 0;
>  }
>  
> +static void vhost_log_sync_nop(MemoryListener *listener,
> +                               MemoryRegionSection *section)
> +{
> +    return;
> +}
> +
>  static void vhost_log_sync(MemoryListener *listener,
>                            MemoryRegionSection *section)
>  {
> @@ -928,6 +950,71 @@ static void vhost_log_global_stop(MemoryListener *listener)
>      }
>  }
>  
> +static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    struct vhost_dev *hdev = vhost_dev_from_virtio(vdev);

If this lookup becomes a performance bottleneck there are other options
for determining the vhost_dev. For example VirtIODevice could have a
field for stashing the vhost_dev pointer.

> +    uint16_t idx = virtio_get_queue_index(vq);
> +
> +    VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
> +
> +    vhost_vring_kick(svq);
> +}

I'm a confused. Do we need to pop elements from vq and push equivalent
elements onto svq before kicking? Either a todo comment is missing or I
misunderstand how this works.

> +
> +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> +{
> +    int idx;
> +
> +    vhost_dev_enable_notifiers(dev, dev->vdev);
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> +{
> +    int idx;
> +
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
> +    }
> +
> +    vhost_dev_disable_notifiers(dev, dev->vdev);

There is a race condition if the guest kicks the vq while this is
happening. The shadow vq hdev_notifier needs to be set so the vhost
device checks the virtqueue for requests that slipped in during the
race window.

> +
> +    return 0;
> +}
> +
> +static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
> +                                          bool enable_lm)
> +{
> +    if (enable_lm) {
> +        return vhost_sw_live_migration_start(dev);
> +    } else {
> +        return vhost_sw_live_migration_stop(dev);
> +    }
> +}
> +
> +static void vhost_sw_lm_global_start(MemoryListener *listener)
> +{
> +    int r;
> +
> +    r = vhost_migration_log(listener, true, vhost_sw_live_migration_enable);
> +    if (r < 0) {
> +        abort();
> +    }
> +}
> +
> +static void vhost_sw_lm_global_stop(MemoryListener *listener)
> +{
> +    int r;
> +
> +    r = vhost_migration_log(listener, false, vhost_sw_live_migration_enable);
> +    if (r < 0) {
> +        abort();
> +    }
> +}
> +
>  static void vhost_log_start(MemoryListener *listener,
>                              MemoryRegionSection *section,
>                              int old, int new)
> @@ -1334,9 +1421,14 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>          .region_nop = vhost_region_addnop,
>          .log_start = vhost_log_start,
>          .log_stop = vhost_log_stop,
> -        .log_sync = vhost_log_sync,
> -        .log_global_start = vhost_log_global_start,
> -        .log_global_stop = vhost_log_global_stop,
> +        .log_sync = !vhost_dev_can_log(hdev) ?
> +                    vhost_log_sync_nop :
> +                    vhost_log_sync,

Why is this change necessary now? It's not clear to me why it was
previously okay to call vhost_log_sync().

> +        .log_global_start = !vhost_dev_can_log(hdev) ?
> +                            vhost_sw_lm_global_start :
> +                            vhost_log_global_start,
> +        .log_global_stop = !vhost_dev_can_log(hdev) ? vhost_sw_lm_global_stop :
> +                                                      vhost_log_global_stop,
>          .eventfd_add = vhost_eventfd_add,
>          .eventfd_del = vhost_eventfd_del,
>          .priority = 10
> @@ -1364,6 +1456,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>              error_free(hdev->migration_blocker);
>              goto fail_busyloop;
>          }
> +    } else {
> +        hdev->sw_lm_vq_handler = handle_sw_lm_vq;
>      }
>  
>      hdev->mem = g_malloc0(offsetof(struct vhost_memory, regions));
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index fbff9bc9d4..17419cb13e 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>  
>  virtio_ss = ss.source_set()
>  virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-sw-lm-ring.c'))
>  virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>  virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>  virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> -- 
> 2.18.4
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration
  2020-11-20 18:50 ` [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration Eugenio Pérez
@ 2020-12-08  7:20   ` Stefan Hajnoczi
  2020-12-09 17:57     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  7:20 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Fri, Nov 20, 2020 at 07:50:46PM +0100, Eugenio Pérez wrote:
> @@ -1571,6 +1577,13 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
>      BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
>      int i, r;
>  
> +    if (hdev->sw_lm_enabled) {
> +        /* We've been called after migration is completed, so no need to
> +           disable it again
> +        */
> +        return;
> +    }
> +
>      for (i = 0; i < hdev->nvqs; ++i) {
>          r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
>                                           false);

What is the purpose of this?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 09/27] vhost: Route host->guest notification through qemu
  2020-11-20 18:50 ` [RFC PATCH 09/27] vhost: Route host->guest notification through qemu Eugenio Pérez
@ 2020-12-08  7:34   ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  7:34 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

On Fri, Nov 20, 2020 at 07:50:47PM +0100, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-sw-lm-ring.c |  3 +++
>  hw/virtio/vhost.c            | 20 ++++++++++++++++++++
>  2 files changed, 23 insertions(+)

I'm not sure I understand what is going here. The guest notifier masking
feature exists to support MSI masking semantics. It looks like this
patch repurposes the notifier to decouple the vhost hdev from the virtio
device's irqfd? But this breaks MSI masking. I think you need to set up
your own eventfd and assign it to the vhost hdev's call fd instead of
using the mask notifier.

> 
> diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> index 0192e77831..cbf53965cd 100644
> --- a/hw/virtio/vhost-sw-lm-ring.c
> +++ b/hw/virtio/vhost-sw-lm-ring.c
> @@ -50,6 +50,9 @@ VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
>      r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>      assert(r == 0);
>  
> +    vhost_virtqueue_mask(dev, dev->vdev, idx, true);
> +    vhost_virtqueue_pending(dev, idx);

Why is the mask notifier cleared? Could we lose a guest notification
here?

> +
>      return svq;
>  }
>  
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 1d55e26d45..9352c56bfa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -960,12 +960,29 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
>      vhost_vring_kick(svq);
>  }
>  
> +static void vhost_handle_call(EventNotifier *n)
> +{
> +    struct vhost_virtqueue *hvq = container_of(n,
> +                                              struct vhost_virtqueue,
> +                                              masked_notifier);
> +    struct vhost_dev *vdev = hvq->dev;
> +    int idx = vdev->vq_index + (hvq == &vdev->vqs[0] ? 0 : 1);

vhost-net-specific hack

> +    VirtQueue *vq = virtio_get_queue(vdev->vdev, idx);
> +
> +    if (event_notifier_test_and_clear(n)) {
> +        virtio_queue_invalidate_signalled_used(vdev->vdev, idx);
> +        virtio_notify_irqfd(vdev->vdev, vq);

/* TODO push used elements into vq? */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/27] vhost: Allocate shadow vring
  2020-11-20 18:50 ` [RFC PATCH 10/27] vhost: Allocate shadow vring Eugenio Pérez
@ 2020-12-08  7:49   ` Stefan Hajnoczi
  2020-12-08  8:17   ` Stefan Hajnoczi
  1 sibling, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  7:49 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Fri, Nov 20, 2020 at 07:50:48PM +0100, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-sw-lm-ring.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> index cbf53965cd..cd7b5ba772 100644
> --- a/hw/virtio/vhost-sw-lm-ring.c
> +++ b/hw/virtio/vhost-sw-lm-ring.c
> @@ -16,8 +16,11 @@
>  #include "qemu/event_notifier.h"
>  
>  typedef struct VhostShadowVirtqueue {
> +    struct vring vring;
>      EventNotifier hdev_notifier;
>      VirtQueue *vq;
> +
> +    vring_desc_t descs[];
>  } VhostShadowVirtqueue;

VhostShadowVirtqueue is starting to look like VirtQueue. Can the shadow
vq code simply use the VirtIODevice's VirtQueues instead of duplicating
this?

What I mean is:

1. Disable the vhost hdev vq and sync the avail index back to the
   VirtQueue.
2. Move the irq fd to the VirtQueue as its guest notifier.
3. Install the shadow_vq_handler() as the VirtQueue's handle_output
   function.
4. Move the call fd to the VirtQueue as its host notifier.

Now we can process requests from the VirtIODevice's VirtQueue using
virtqueue_pop() and friends. We're also in sync and ready for vmstate
save/load.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2020-11-20 18:50 ` [RFC PATCH 13/27] vhost: Send buffers to device Eugenio Pérez
@ 2020-12-08  8:16   ` Stefan Hajnoczi
  2020-12-09 18:41     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  8:16 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 2332 bytes --]

On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> -static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
> +static bool vhost_vring_should_kick_rcu(VhostShadowVirtqueue *vq)

"vhost_vring_" is a confusing name. This is not related to
vhost_virtqueue or the vhost_vring_* structs.

vhost_shadow_vq_should_kick_rcu()?

>  {
> -    return virtio_queue_get_used_notify_split(vq->vq);
> +    VirtIODevice *vdev = vq->vdev;
> +    vq->num_added = 0;

I'm surprised that a bool function modifies state. Is this assignment
intentional?

> +/* virtqueue_add:
> + * @vq: The #VirtQueue
> + * @elem: The #VirtQueueElement

The copy-pasted doc comment doesn't match this function.

> +int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem)
> +{
> +    int host_head = vhost_vring_add_split(vq, elem);
> +    if (vq->ring_id_maps[host_head]) {
> +        g_free(vq->ring_id_maps[host_head]);
> +    }

VirtQueueElement is freed lazily? Is there a reason for this approach? I
would have expected it to be freed when the used element is process by
the kick fd handler.

> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 9352c56bfa..304e0baa61 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -956,8 +956,34 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
>      uint16_t idx = virtio_get_queue_index(vq);
>  
>      VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
> +    VirtQueueElement *elem;
>  
> -    vhost_vring_kick(svq);
> +    /*
> +     * Make available all buffers as possible.
> +     */
> +    do {
> +        if (virtio_queue_get_notification(vq)) {
> +            virtio_queue_set_notification(vq, false);
> +        }
> +
> +        while (true) {
> +            int r;
> +            if (virtio_queue_full(vq)) {
> +                break;
> +            }

Why is this check necessary? The guest cannot provide more descriptors
than there is ring space. If that happens somehow then it's a driver
error that is already reported in virtqueue_pop() below.

I wonder if you added this because the vring implementation above
doesn't support indirect descriptors? It's easy to exhaust the vhost
hdev vring while there is still room in the VirtIODevice's VirtQueue
vring.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/27] vhost: Allocate shadow vring
  2020-11-20 18:50 ` [RFC PATCH 10/27] vhost: Allocate shadow vring Eugenio Pérez
  2020-12-08  7:49   ` Stefan Hajnoczi
@ 2020-12-08  8:17   ` Stefan Hajnoczi
  2020-12-09 18:15     ` Eugenio Perez Martin
  1 sibling, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  8:17 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Fri, Nov 20, 2020 at 07:50:48PM +0100, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-sw-lm-ring.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> index cbf53965cd..cd7b5ba772 100644
> --- a/hw/virtio/vhost-sw-lm-ring.c
> +++ b/hw/virtio/vhost-sw-lm-ring.c
> @@ -16,8 +16,11 @@
>  #include "qemu/event_notifier.h"
>  
>  typedef struct VhostShadowVirtqueue {
> +    struct vring vring;
>      EventNotifier hdev_notifier;
>      VirtQueue *vq;
> +
> +    vring_desc_t descs[];
>  } VhostShadowVirtqueue;

Looking at later patches I see that this is the vhost hdev vring state,
not the VirtIODevice vring state. That makes more sense.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element
  2020-11-20 18:50 ` [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element Eugenio Pérez
@ 2020-12-08  8:25   ` Stefan Hajnoczi
  2020-12-09 18:46     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  8:25 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 666 bytes --]

On Fri, Nov 20, 2020 at 07:50:54PM +0100, Eugenio Pérez wrote:
> Specify VirtQueueElement * as return type makes no harm at this moment.

The reason for the void * return type is that C implicitly converts void
pointers to pointers of any type. The function takes a size_t sz
argument so it can allocate a object of user-defined size. The idea is
that the user's struct embeds a VirtQueueElement field. Changing the
return type to VirtQueueElement * means that callers may need to
explicitly cast to the user's struct type.

It's a question of coding style but I think the void * return type
communicates what is going on better than VirtQueueElement *.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu
  2020-11-20 18:50 ` [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu Eugenio Pérez
@ 2020-12-08  8:41   ` Stefan Hajnoczi
  2020-12-09 18:48     ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  8:41 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 299 bytes --]

On Fri, Nov 20, 2020 at 07:50:56PM +0100, Eugenio Pérez wrote:
> @@ -83,6 +89,18 @@ void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable)
>      smp_mb();
>  }
>  
> +bool vhost_vring_poll_rcu(VhostShadowVirtqueue *vq)

A name like "more_used" is clearer than "poll".

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 20/27] vhost: Return used buffers
  2020-11-20 18:50 ` [RFC PATCH 20/27] vhost: Return used buffers Eugenio Pérez
@ 2020-12-08  8:50   ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  8:50 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1069 bytes --]

On Fri, Nov 20, 2020 at 07:50:58PM +0100, Eugenio Pérez wrote:
> @@ -1028,6 +1061,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>  
>      for (idx = 0; idx < dev->nvqs; ++idx) {
>          struct vhost_virtqueue *vq = &dev->vqs[idx];
> +        unsigned num = virtio_queue_get_num(dev->vdev, idx);
>          struct vhost_vring_addr addr = {
>              .index = idx,
>          };
> @@ -1044,6 +1078,12 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>          r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
>          assert(r == 0);
>  
> +        r = vhost_backend_update_device_iotlb(dev, addr.used_user_addr,
> +                                              addr.used_user_addr,
> +                                              sizeof(vring_used_elem_t) * num,
> +                                              IOMMU_RW);

I don't remember seeing iotlb setup for the rest of the vring or guest
memory. Maybe this should go into a single patch so it's easy to review
the iova space layout.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 24/27] vhost: iommu changes
  2020-11-20 18:51 ` [RFC PATCH 24/27] vhost: iommu changes Eugenio Pérez
@ 2020-12-08  9:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  9:02 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Fri, Nov 20, 2020 at 07:51:02PM +0100, Eugenio Pérez wrote:
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index eebfac4455..cb44b9997f 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1109,6 +1109,10 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>  
>      assert(dev->vhost_ops->vhost_set_vring_enable);
>      dev->vhost_ops->vhost_set_vring_enable(dev, false);
> +    if (vhost_dev_has_iommu(dev)) {
> +        r = vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL);
> +        assert(r == 0);
> +    }
>  
>      for (idx = 0; idx < dev->nvqs; ++idx) {
>          struct vhost_virtqueue *vq = &dev->vqs[idx];
> @@ -1269,6 +1273,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
>  
>      trace_vhost_iotlb_miss(dev, 1);
>  
> +    if (dev->sw_lm_enabled) {
> +        uaddr = iova;
> +        len = 4096;
> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> +                                                IOMMU_RW);

It would be nice to look up the available memory so
vhost_backend_update_device_iotlb() can be called with a much bigger
[uaddr, uaddr+len) range. This will reduce the number of iotlb misses.

Will vIOMMU be required for this feature? If not, then the vring needs
to be added to the vhost memory regions because vhost will not send QEMU
iotlb misses.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
                   ` (30 preceding siblings ...)
  2020-11-27 15:44 ` Stefano Garzarella
@ 2020-12-08  9:37 ` Stefan Hajnoczi
  2020-12-09  9:26   ` Jason Wang
  31 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-08  9:37 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: kvm, Michael S. Tsirkin, Jason Wang, qemu-devel, Daniel Daly,
	virtualization, Liran Alon, Eli Cohen, Nitin Shrivastav,
	Alex Barba, Christophe Fontaine, Juan Quintela, Lee Ballard,
	Lars Ganrot, Rob Miller, Stefano Garzarella, Howard Cai,
	Parav Pandit, vm, Salil Mehta, Stephen Finucane, Xiao W Wang,
	Sean Mooney, Stefan Hajnoczi, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]

On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.

Pros:
+ vhost/vDPA devices don't need to implement dirty memory logging
+ Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends

Cons:
- Not generic, relies on vhost-net-specific ioctls
- Doesn't support VIRTIO Shared Memory Regions
  https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
- Performance (see below)

I think performance will be significantly lower when the shadow vq is
enabled. Imagine a vDPA device with hardware vq doorbell registers
mapped into the guest so the guest driver can directly kick the device.
When the shadow vq is enabled a vmexit is needed to write to the shadow
vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
to read the ioeventfd, the descriptors are translated, QEMU writes to
the vhost hdev kick fd, the host kernel scheduler switches to the vhost
worker thread, vhost/vDPA notifies the virtqueue, and finally the
vDPA driver writes to the hardware vq doorbell register. That is a lot
of overhead compared to writing to an exitless MMIO register!

If the shadow vq was implemented in drivers/vhost/ and QEMU used the
existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
dirty memory logging happens asynchronously and isn't in the dataplane.

In addition, hardware that supports dirty memory logging as well as
software vDPA devices could completely eliminate the shadow vq for even
better performance.

But performance is a question of "is it good enough?". Maybe this
approach is okay and users don't expect good performance while dirty
memory logging is enabled. I just wanted to share the idea of moving the
shadow vq into the kernel in case you like that approach better.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-12-08  9:37 ` Stefan Hajnoczi
@ 2020-12-09  9:26   ` Jason Wang
  2020-12-09 15:57     ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Wang @ 2020-12-09  9:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, qemu-devel, Daniel Daly, virtualization,
	Liran Alon, Eli Cohen, Nitin Shrivastav, Alex Barba,
	Christophe Fontaine, Juan Quintela, Lee Ballard,
	Eugenio Pérez, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

----- Original Message -----
> On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> > This series enable vDPA software assisted live migration for vhost-net
> > devices. This is a new method of vhost devices migration: Instead of
> > relay on vDPA device's dirty logging capability, SW assisted LM
> > intercepts dataplane, forwarding the descriptors between VM and device.
> 
> Pros:
> + vhost/vDPA devices don't need to implement dirty memory logging
> + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
> 
> Cons:
> - Not generic, relies on vhost-net-specific ioctls
> - Doesn't support VIRTIO Shared Memory Regions
>   https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex

I may miss something but my understanding is that it's the
responsiblity of device to migrate this part?

> - Performance (see below)
> 
> I think performance will be significantly lower when the shadow vq is
> enabled. Imagine a vDPA device with hardware vq doorbell registers
> mapped into the guest so the guest driver can directly kick the device.
> When the shadow vq is enabled a vmexit is needed to write to the shadow
> vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
> to read the ioeventfd, the descriptors are translated, QEMU writes to
> the vhost hdev kick fd, the host kernel scheduler switches to the vhost
> worker thread, vhost/vDPA notifies the virtqueue, and finally the
> vDPA driver writes to the hardware vq doorbell register. That is a lot
> of overhead compared to writing to an exitless MMIO register!

I think it's a balance. E.g we can poll the virtqueue to have an
exitless doorbell.

> 
> If the shadow vq was implemented in drivers/vhost/ and QEMU used the
> existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
> reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
> dirty memory logging happens asynchronously and isn't in the dataplane.
> 
> In addition, hardware that supports dirty memory logging as well as
> software vDPA devices could completely eliminate the shadow vq for even
> better performance.

Yes. That's our plan. But the interface might require more thought.

E.g is the bitmap a good approach? To me reporting dirty pages via
virqueue is better since it get less footprint and is self throttled.

And we need an address space other than the one used by guest for
either bitmap for virtqueue.

> 
> But performance is a question of "is it good enough?". Maybe this
> approach is okay and users don't expect good performance while dirty
> memory logging is enabled.

Yes, and actually such slow down may help for the converge of the
migration.

Note that the whole idea is try to have a generic solution for all
types of devices. It's good to consider the performance but for the
first stage, it should be sufficient to make it work and consider to
optimize on top.

> I just wanted to share the idea of moving the
> shadow vq into the kernel in case you like that approach better.

My understanding is to keep kernel as simple as possible and leave the
polices to userspace as much as possible. E.g it requires us to
disable doorbell mapping and irq offloading, all of which were under
the control of userspace.

Thanks

> 
> Stefan
>



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable
  2020-12-07 16:43   ` Stefan Hajnoczi
@ 2020-12-09 12:00     ` Eugenio Perez Martin
  2020-12-09 16:08       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 12:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Mon, Dec 7, 2020 at 5:43 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:42PM +0100, Eugenio Pérez wrote:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index 222bbcc62d..317f1f96fa 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
> >      return idx - dev->vq_index;
> >  }
> >
> > +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
> > +                                      bool enable)
> > +{
> > +    struct vhost_vring_file file = {
> > +        .index = idx,
> > +    };
> > +
> > +    if (!enable) {
> > +        file.fd = -1; /* Pass -1 to unbind from file. */
> > +    } else {
> > +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
> > +        file.fd = vn_dev->backend;
> > +    }
> > +
> > +    return vhost_kernel_net_set_backend(dev, &file);
>
> This is vhost-net specific even though the function appears to be
> generic. Is there a plan to extend this to all devices?
>

I expected each vhost backend to enable-disable in its own terms, but
I think it could be 100% virtio-device generic with something like the
device state capability:
https://lists.oasis-open.org/archives/virtio-comment/202012/msg00005.html
.

> > +}
> > +
> > +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < dev->nvqs; ++i) {
> > +        vhost_kernel_set_vq_enable(dev, i, enable);
> > +    }
> > +
> > +    return 0;
> > +}
>
> I suggest exposing the per-vq interface (vhost_kernel_set_vq_enable())
> in VhostOps so it follows the ioctl interface.

It was actually the initial plan, I left as all-or-nothing to make less changes.

> vhost_kernel_set_vring_enable() can be moved to vhost.c can loop over
> all vqs if callers find it convenient to loop over all vqs.

I'm ok with it. Thinking out loud, I don't know if it is easier for
some devices to enable/disable all of it (less syscalls? less downtime
somehow?) but I find more generic and useful the per-virtqueue
approach.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log
  2020-12-07 16:19   ` Stefan Hajnoczi
@ 2020-12-09 12:20     ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 12:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Mon, Dec 7, 2020 at 5:19 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:40PM +0100, Eugenio Pérez wrote:
> > This allows code to reuse the logic to not to re-enable or re-disable
> > migration mechanisms. Code works the same way as before.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/virtio/vhost.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 2bd8cdf893..2adb2718c1 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -862,7 +862,9 @@ err_features:
> >      return r;
> >  }
> >
> > -static int vhost_migration_log(MemoryListener *listener, bool enable)
> > +static int vhost_migration_log(MemoryListener *listener,
> > +                               bool enable,
> > +                               int (*device_cb)(struct vhost_dev *, bool))
>
> Please document the argument. What is the callback function supposed to
> do ("device_cb" is not descriptive so I'm not sure)?

Sure, I will expand documentation if we stick with this approach to
enable/disable the shadow virtqueue (I hope we agree on a better one
anyway).

Just for completion, it was meant for vhost_dev_set_log, so vhost_dev*
is the device to enable/disable migration, and the second bool is for
enable/disable it.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler
  2020-12-07 16:52   ` Stefan Hajnoczi
@ 2020-12-09 15:02     ` Eugenio Perez Martin
  2020-12-10 11:30       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 15:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Mon, Dec 7, 2020 at 5:52 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:43PM +0100, Eugenio Pérez wrote:
> > Only virtio-net honors it.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  include/hw/virtio/vhost.h |  1 +
> >  hw/net/virtio-net.c       | 39 ++++++++++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index 4a8bc75415..b5b7496537 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -83,6 +83,7 @@ struct vhost_dev {
> >      bool started;
> >      bool log_enabled;
> >      uint64_t log_size;
> > +    VirtIOHandleOutput sw_lm_vq_handler;
>
> sw == software?
> lm == live migration?
>
> Maybe there is a name that is clearer. What are these virtqueues called?
> Shadow vqs? Logged vqs?
>
> Live migration is a feature that uses dirty memory logging, but other
> features may use dirty memory logging too. The name should probably not
> be associated with live migration.
>

I totally agree, I find shadow_vq a better name for it.

> >      Error *migration_blocker;
> >      const VhostOps *vhost_ops;
> >      void *opaque;
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index 9179013ac4..9a69ae3598 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -2628,24 +2628,32 @@ static void virtio_net_tx_bh(void *opaque)
> >      }
> >  }
> >
> > -static void virtio_net_add_queue(VirtIONet *n, int index)
> > +static void virtio_net_add_queue(VirtIONet *n, int index,
> > +                                 VirtIOHandleOutput custom_handler)
> >  {
>
> We talked about the possibility of moving this into the generic vhost
> code so that devices don't need to be modified. It would be nice to hide
> this feature inside vhost.

I'm thinking of tying it to VirtQueue, allowing the caller to override
the handler knowing it is not going to be called (I mean, not offering
race conditions protection, like before of starting processing
notifications in qemu calling vhost_dev_disable_notifiers).

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-12-09  9:26   ` Jason Wang
@ 2020-12-09 15:57     ` Stefan Hajnoczi
  2020-12-10  9:12       ` Jason Wang
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-09 15:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, Stefan Hajnoczi, qemu-devel,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Eugenio Pérez, Lars Ganrot, Rob Miller,
	Stefano Garzarella, Howard Cai, Parav Pandit, vm, Salil Mehta,
	Stephen Finucane, Xiao W Wang, Sean Mooney, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

On Wed, Dec 09, 2020 at 04:26:50AM -0500, Jason Wang wrote:
> ----- Original Message -----
> > On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> > > This series enable vDPA software assisted live migration for vhost-net
> > > devices. This is a new method of vhost devices migration: Instead of
> > > relay on vDPA device's dirty logging capability, SW assisted LM
> > > intercepts dataplane, forwarding the descriptors between VM and device.
> > 
> > Pros:
> > + vhost/vDPA devices don't need to implement dirty memory logging
> > + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
> > 
> > Cons:
> > - Not generic, relies on vhost-net-specific ioctls
> > - Doesn't support VIRTIO Shared Memory Regions
> >   https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
> 
> I may miss something but my understanding is that it's the
> responsiblity of device to migrate this part?

Good point. You're right.

> > - Performance (see below)
> > 
> > I think performance will be significantly lower when the shadow vq is
> > enabled. Imagine a vDPA device with hardware vq doorbell registers
> > mapped into the guest so the guest driver can directly kick the device.
> > When the shadow vq is enabled a vmexit is needed to write to the shadow
> > vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
> > to read the ioeventfd, the descriptors are translated, QEMU writes to
> > the vhost hdev kick fd, the host kernel scheduler switches to the vhost
> > worker thread, vhost/vDPA notifies the virtqueue, and finally the
> > vDPA driver writes to the hardware vq doorbell register. That is a lot
> > of overhead compared to writing to an exitless MMIO register!
> 
> I think it's a balance. E.g we can poll the virtqueue to have an
> exitless doorbell.
> 
> > 
> > If the shadow vq was implemented in drivers/vhost/ and QEMU used the
> > existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
> > reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
> > dirty memory logging happens asynchronously and isn't in the dataplane.
> > 
> > In addition, hardware that supports dirty memory logging as well as
> > software vDPA devices could completely eliminate the shadow vq for even
> > better performance.
> 
> Yes. That's our plan. But the interface might require more thought.
> 
> E.g is the bitmap a good approach? To me reporting dirty pages via
> virqueue is better since it get less footprint and is self throttled.
> 
> And we need an address space other than the one used by guest for
> either bitmap for virtqueue.
> 
> > 
> > But performance is a question of "is it good enough?". Maybe this
> > approach is okay and users don't expect good performance while dirty
> > memory logging is enabled.
> 
> Yes, and actually such slow down may help for the converge of the
> migration.
> 
> Note that the whole idea is try to have a generic solution for all
> types of devices. It's good to consider the performance but for the
> first stage, it should be sufficient to make it work and consider to
> optimize on top.

Moving the shadow vq to the kernel later would be quite a big change
requiring rewriting much of the code. That's why I mentioned this now
before a lot of effort is invested in a QEMU implementation.

> > I just wanted to share the idea of moving the
> > shadow vq into the kernel in case you like that approach better.
> 
> My understanding is to keep kernel as simple as possible and leave the
> polices to userspace as much as possible. E.g it requires us to
> disable doorbell mapping and irq offloading, all of which were under
> the control of userspace.

If the performance is acceptable with the QEMU approach then I think
that's the best place to implement it. It looks high-overhead though so
maybe one of the first things to do is to run benchmarks to collect data
on how it performs?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable
  2020-12-09 12:00     ` Eugenio Perez Martin
@ 2020-12-09 16:08       ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-09 16:08 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 3529 bytes --]

On Wed, Dec 09, 2020 at 01:00:19PM +0100, Eugenio Perez Martin wrote:
> On Mon, Dec 7, 2020 at 5:43 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Fri, Nov 20, 2020 at 07:50:42PM +0100, Eugenio Pérez wrote:
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
> > >  1 file changed, 29 insertions(+)
> > >
> > > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > > index 222bbcc62d..317f1f96fa 100644
> > > --- a/hw/virtio/vhost-backend.c
> > > +++ b/hw/virtio/vhost-backend.c
> > > @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
> > >      return idx - dev->vq_index;
> > >  }
> > >
> > > +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
> > > +                                      bool enable)
> > > +{
> > > +    struct vhost_vring_file file = {
> > > +        .index = idx,
> > > +    };
> > > +
> > > +    if (!enable) {
> > > +        file.fd = -1; /* Pass -1 to unbind from file. */
> > > +    } else {
> > > +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
> > > +        file.fd = vn_dev->backend;
> > > +    }
> > > +
> > > +    return vhost_kernel_net_set_backend(dev, &file);
> >
> > This is vhost-net specific even though the function appears to be
> > generic. Is there a plan to extend this to all devices?
> >
> 
> I expected each vhost backend to enable-disable in its own terms, but
> I think it could be 100% virtio-device generic with something like the
> device state capability:
> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00005.html
> .

Great, thanks for the link!

> > > +}
> > > +
> > > +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> > > +{
> > > +    int i;
> > > +
> > > +    for (i = 0; i < dev->nvqs; ++i) {
> > > +        vhost_kernel_set_vq_enable(dev, i, enable);
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> >
> > I suggest exposing the per-vq interface (vhost_kernel_set_vq_enable())
> > in VhostOps so it follows the ioctl interface.
> 
> It was actually the initial plan, I left as all-or-nothing to make less changes.
> 
> > vhost_kernel_set_vring_enable() can be moved to vhost.c can loop over
> > all vqs if callers find it convenient to loop over all vqs.
> 
> I'm ok with it. Thinking out loud, I don't know if it is easier for
> some devices to enable/disable all of it (less syscalls? less downtime
> somehow?) but I find more generic and useful the per-virtqueue
> approach.

That's an interesting question, the ability to enable/disable specific
virtqueues seems like it could be useful. For example, guests with vCPU
hotplug may want to enable/disable virtqueues so that multi-queue
adapts as the number of vCPUs changes. A per-vq interface is needed for
that.

I'm a little worried that some device types might not cope well with
quiescing individual vqs. Here "quiesce" means to complete in flight
requests. This would be where two or more vqs have a relationship and
disabling one vq could cause a deadlock when trying to disable the other
one. However, I can't think of a case where this happens.

virtio-vsock is the closest example but luckily we don't need complete
in flight requests, we can just stop the vq immediately. So although
there is a dependency on the other vq it won't deadlock in this case.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/27] vhost: Route guest->host notification through qemu
  2020-12-07 17:42   ` Stefan Hajnoczi
@ 2020-12-09 17:08     ` Eugenio Perez Martin
  2020-12-10 11:50       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 17:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Mon, Dec 7, 2020 at 6:42 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:45PM +0100, Eugenio Pérez wrote:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/virtio/vhost-sw-lm-ring.h |  26 +++++++++
> >  include/hw/virtio/vhost.h    |   3 ++
> >  hw/virtio/vhost-sw-lm-ring.c |  60 +++++++++++++++++++++
> >  hw/virtio/vhost.c            | 100 +++++++++++++++++++++++++++++++++--
> >  hw/virtio/meson.build        |   2 +-
> >  5 files changed, 187 insertions(+), 4 deletions(-)
> >  create mode 100644 hw/virtio/vhost-sw-lm-ring.h
> >  create mode 100644 hw/virtio/vhost-sw-lm-ring.c
> >
> > diff --git a/hw/virtio/vhost-sw-lm-ring.h b/hw/virtio/vhost-sw-lm-ring.h
> > new file mode 100644
> > index 0000000000..86dc081b93
> > --- /dev/null
> > +++ b/hw/virtio/vhost-sw-lm-ring.h
> > @@ -0,0 +1,26 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2020
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#ifndef VHOST_SW_LM_RING_H
> > +#define VHOST_SW_LM_RING_H
> > +
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/virtio/virtio.h"
> > +#include "hw/virtio/vhost.h"
> > +
> > +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>
> Here it's called a shadow virtqueue while the file calls it a
> sw-lm-ring. Please use a single name.
>

I will switch to shadow virtqueue.

> > +
> > +bool vhost_vring_kick(VhostShadowVirtqueue *vq);
>
> vhost_shadow_vq_kick()?
>
> > +
> > +VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx);
>
> vhost_dev_get_shadow_vq()? This could be in include/hw/virtio/vhost.h
> with the other vhost_dev_*() functions.
>

I agree, that is a better place.

> > +
> > +void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq);
>
> Hmm...now I wonder what the lifecycle is. Does vhost_sw_lm_shadow_vq()
> allocate it?
>
> Please add doc comments explaining these functions either in this header
> file or in vhost-sw-lm-ring.c.
>

Will document.

> > +
> > +#endif
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index b5b7496537..93cc3f1ae3 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -54,6 +54,8 @@ struct vhost_iommu {
> >      QLIST_ENTRY(vhost_iommu) iommu_next;
> >  };
> >
> > +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > +
> >  typedef struct VhostDevConfigOps {
> >      /* Vhost device config space changed callback
> >       */
> > @@ -83,6 +85,7 @@ struct vhost_dev {
> >      bool started;
> >      bool log_enabled;
> >      uint64_t log_size;
> > +    VhostShadowVirtqueue *sw_lm_shadow_vq[2];
>
> The hardcoded 2 is probably for single-queue virtio-net? I guess this
> will eventually become VhostShadowVirtqueue *shadow_vqs or
> VhostShadowVirtqueue **shadow_vqs, depending on whether each one should
> be allocated individually.
>

Yes, I will switch to one way or another for the next series.

> >      VirtIOHandleOutput sw_lm_vq_handler;
> >      Error *migration_blocker;
> >      const VhostOps *vhost_ops;
> > diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> > new file mode 100644
> > index 0000000000..0192e77831
> > --- /dev/null
> > +++ b/hw/virtio/vhost-sw-lm-ring.c
> > @@ -0,0 +1,60 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2020
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "hw/virtio/vhost-sw-lm-ring.h"
> > +#include "hw/virtio/vhost.h"
> > +
> > +#include "standard-headers/linux/vhost_types.h"
> > +#include "standard-headers/linux/virtio_ring.h"
> > +
> > +#include "qemu/event_notifier.h"
> > +
> > +typedef struct VhostShadowVirtqueue {
> > +    EventNotifier hdev_notifier;
> > +    VirtQueue *vq;
> > +} VhostShadowVirtqueue;
> > +
> > +static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
> > +{
> > +    return virtio_queue_get_used_notify_split(vq->vq);
> > +}
> > +
> > +bool vhost_vring_kick(VhostShadowVirtqueue *vq)
> > +{
> > +    return vhost_vring_should_kick(vq) ? event_notifier_set(&vq->hdev_notifier)
> > +                                       : true;
> > +}
>
> How is the return value used? event_notifier_set() returns -errno so
> this function returns false on success, and true when notifications are
> disabled or event_notifier_set() failed. I'm not sure this return value
> can be used for anything.
>

I think you are right, this is bad. It could be used for retry, but
the failure is unlikely and the fail path is easy to add in the future
if needed.

It will be void.

> > +
> > +VhostShadowVirtqueue *vhost_sw_lm_shadow_vq(struct vhost_dev *dev, int idx)
>
> I see now that this function allocates the VhostShadowVirtqueue. Maybe
> adding _new() to the name would make that clear?
>

Yes, I will rename.

> > +{
> > +    struct vhost_vring_file file = {
> > +        .index = idx
> > +    };
> > +    VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
> > +    VhostShadowVirtqueue *svq;
> > +    int r;
> > +
> > +    svq = g_new0(VhostShadowVirtqueue, 1);
> > +    svq->vq = vq;
> > +
> > +    r = event_notifier_init(&svq->hdev_notifier, 0);
> > +    assert(r == 0);
> > +
> > +    file.fd = event_notifier_get_fd(&svq->hdev_notifier);
> > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > +    assert(r == 0);
> > +
> > +    return svq;
> > +}
>
> I guess there are assumptions about the status of the device? Does the
> virtqueue need to be disabled when this function is called?
>

Yes. Maybe an assertion checking the notification state?

> > +
> > +void vhost_sw_lm_shadow_vq_free(VhostShadowVirtqueue *vq)
> > +{
> > +    event_notifier_cleanup(&vq->hdev_notifier);
> > +    g_free(vq);
> > +}
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 9cbd52a7f1..a55b684b5f 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -13,6 +13,8 @@
> >   * GNU GPL, version 2 or (at your option) any later version.
> >   */
> >
> > +#include "hw/virtio/vhost-sw-lm-ring.h"
> > +
> >  #include "qemu/osdep.h"
> >  #include "qapi/error.h"
> >  #include "hw/virtio/vhost.h"
> > @@ -61,6 +63,20 @@ bool vhost_has_free_slot(void)
> >      return slots_limit > used_memslots;
> >  }
> >
> > +static struct vhost_dev *vhost_dev_from_virtio(const VirtIODevice *vdev)
> > +{
> > +    struct vhost_dev *hdev;
> > +
> > +    QLIST_FOREACH(hdev, &vhost_devices, entry) {
> > +        if (hdev->vdev == vdev) {
> > +            return hdev;
> > +        }
> > +    }
> > +
> > +    assert(hdev);
> > +    return NULL;
> > +}
> > +
> >  static bool vhost_dev_can_log(const struct vhost_dev *hdev)
> >  {
> >      return hdev->features & (0x1ULL << VHOST_F_LOG_ALL);
> > @@ -148,6 +164,12 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
> >      return 0;
> >  }
> >
> > +static void vhost_log_sync_nop(MemoryListener *listener,
> > +                               MemoryRegionSection *section)
> > +{
> > +    return;
> > +}
> > +
> >  static void vhost_log_sync(MemoryListener *listener,
> >                            MemoryRegionSection *section)
> >  {
> > @@ -928,6 +950,71 @@ static void vhost_log_global_stop(MemoryListener *listener)
> >      }
> >  }
> >
> > +static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
> > +{
> > +    struct vhost_dev *hdev = vhost_dev_from_virtio(vdev);
>
> If this lookup becomes a performance bottleneck there are other options
> for determining the vhost_dev. For example VirtIODevice could have a
> field for stashing the vhost_dev pointer.
>

I would like to have something like that for the definitive patch
series, yes. I would not like to increase the virtio knowledge of
vhost, but it seems the most straightforward change for it.

> > +    uint16_t idx = virtio_get_queue_index(vq);
> > +
> > +    VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
> > +
> > +    vhost_vring_kick(svq);
> > +}
>
> I'm a confused. Do we need to pop elements from vq and push equivalent
> elements onto svq before kicking? Either a todo comment is missing or I
> misunderstand how this works.
>

At this commit only notifications are forwarded, buffers are still
fetched directly from the guest. A TODO comment would have been
helpful, yes :).

> > +
> > +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> > +{
> > +    int idx;
> > +
> > +    vhost_dev_enable_notifiers(dev, dev->vdev);
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> > +{
> > +    int idx;
> > +
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
> > +    }
> > +
> > +    vhost_dev_disable_notifiers(dev, dev->vdev);
>
> There is a race condition if the guest kicks the vq while this is
> happening. The shadow vq hdev_notifier needs to be set so the vhost
> device checks the virtqueue for requests that slipped in during the
> race window.
>

I'm not sure if I follow you. If I understand correctly,
vhost_dev_disable_notifiers calls virtio_bus_cleanup_host_notifier,
and the latter calls virtio_queue_host_notifier_read. That's why the
documentation says "This might actually run the qemu handlers right
away, so virtio in qemu must be completely setup when this is
called.". Am I missing something?

> > +
> > +    return 0;
> > +}
> > +
> > +static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
> > +                                          bool enable_lm)
> > +{
> > +    if (enable_lm) {
> > +        return vhost_sw_live_migration_start(dev);
> > +    } else {
> > +        return vhost_sw_live_migration_stop(dev);
> > +    }
> > +}
> > +
> > +static void vhost_sw_lm_global_start(MemoryListener *listener)
> > +{
> > +    int r;
> > +
> > +    r = vhost_migration_log(listener, true, vhost_sw_live_migration_enable);
> > +    if (r < 0) {
> > +        abort();
> > +    }
> > +}
> > +
> > +static void vhost_sw_lm_global_stop(MemoryListener *listener)
> > +{
> > +    int r;
> > +
> > +    r = vhost_migration_log(listener, false, vhost_sw_live_migration_enable);
> > +    if (r < 0) {
> > +        abort();
> > +    }
> > +}
> > +
> >  static void vhost_log_start(MemoryListener *listener,
> >                              MemoryRegionSection *section,
> >                              int old, int new)
> > @@ -1334,9 +1421,14 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >          .region_nop = vhost_region_addnop,
> >          .log_start = vhost_log_start,
> >          .log_stop = vhost_log_stop,
> > -        .log_sync = vhost_log_sync,
> > -        .log_global_start = vhost_log_global_start,
> > -        .log_global_stop = vhost_log_global_stop,
> > +        .log_sync = !vhost_dev_can_log(hdev) ?
> > +                    vhost_log_sync_nop :
> > +                    vhost_log_sync,
>
> Why is this change necessary now? It's not clear to me why it was
> previously okay to call vhost_log_sync().
>

This is only needed because I'm hijacking the vhost log system to know
when migration has started. Since vhost log is not allocated, the call
to vhost_log_sync() will fail to write in the bitmap.

Likely, this change will be discarded in the final patch series, since
another way of detecting live migration will be used.

> > +        .log_global_start = !vhost_dev_can_log(hdev) ?
> > +                            vhost_sw_lm_global_start :
> > +                            vhost_log_global_start,
> > +        .log_global_stop = !vhost_dev_can_log(hdev) ? vhost_sw_lm_global_stop :
> > +                                                      vhost_log_global_stop,
> >          .eventfd_add = vhost_eventfd_add,
> >          .eventfd_del = vhost_eventfd_del,
> >          .priority = 10
> > @@ -1364,6 +1456,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >              error_free(hdev->migration_blocker);
> >              goto fail_busyloop;
> >          }
> > +    } else {
> > +        hdev->sw_lm_vq_handler = handle_sw_lm_vq;
> >      }
> >
> >      hdev->mem = g_malloc0(offsetof(struct vhost_memory, regions));
> > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > index fbff9bc9d4..17419cb13e 100644
> > --- a/hw/virtio/meson.build
> > +++ b/hw/virtio/meson.build
> > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> >
> >  virtio_ss = ss.source_set()
> >  virtio_ss.add(files('virtio.c'))
> > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
> > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-sw-lm-ring.c'))
> >  virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> >  virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> >  virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> > --
> > 2.18.4
> >



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration
  2020-12-08  7:20   ` Stefan Hajnoczi
@ 2020-12-09 17:57     ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 17:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Dec 8, 2020 at 8:21 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:46PM +0100, Eugenio Pérez wrote:
> > @@ -1571,6 +1577,13 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
> >      BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
> >      int i, r;
> >
> > +    if (hdev->sw_lm_enabled) {
> > +        /* We've been called after migration is completed, so no need to
> > +           disable it again
> > +        */
> > +        return;
> > +    }
> > +
> >      for (i = 0; i < hdev->nvqs; ++i) {
> >          r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
> >                                           false);
>
> What is the purpose of this?

It is again a quick hack to get shadow_vq POC working. Again, it
deserves a better comment :).

If I recall correctly, vhost-net calls vhost_dev_disable_notifiers
again on destruction, and it calls to memory_region_del_eventfd, then
virtio_pci_ioeventfd_assign, which is not safe to call again because
of the i != mr->ioeventfd_nb assertion.

The right fix for this should be either in virtio-pci (more generic,
but not sure if calling it again is the expected semantic of it),
individual vhost devices (less generic) or where it is at this moment,
but with the right comment.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/27] vhost: Allocate shadow vring
  2020-12-08  8:17   ` Stefan Hajnoczi
@ 2020-12-09 18:15     ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 18:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Dec 8, 2020 at 9:18 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:48PM +0100, Eugenio Pérez wrote:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/virtio/vhost-sw-lm-ring.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-sw-lm-ring.c b/hw/virtio/vhost-sw-lm-ring.c
> > index cbf53965cd..cd7b5ba772 100644
> > --- a/hw/virtio/vhost-sw-lm-ring.c
> > +++ b/hw/virtio/vhost-sw-lm-ring.c
> > @@ -16,8 +16,11 @@
> >  #include "qemu/event_notifier.h"
> >
> >  typedef struct VhostShadowVirtqueue {
> > +    struct vring vring;
> >      EventNotifier hdev_notifier;
> >      VirtQueue *vq;
> > +
> > +    vring_desc_t descs[];
> >  } VhostShadowVirtqueue;
>
> Looking at later patches I see that this is the vhost hdev vring state,
> not the VirtIODevice vring state. That makes more sense.

I will add a comment here too.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2020-12-08  8:16   ` Stefan Hajnoczi
@ 2020-12-09 18:41     ` Eugenio Perez Martin
  2020-12-10 11:55       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 18:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > -static inline bool vhost_vring_should_kick(VhostShadowVirtqueue *vq)
> > +static bool vhost_vring_should_kick_rcu(VhostShadowVirtqueue *vq)
>
> "vhost_vring_" is a confusing name. This is not related to
> vhost_virtqueue or the vhost_vring_* structs.
>
> vhost_shadow_vq_should_kick_rcu()?
>
> >  {
> > -    return virtio_queue_get_used_notify_split(vq->vq);
> > +    VirtIODevice *vdev = vq->vdev;
> > +    vq->num_added = 0;
>
> I'm surprised that a bool function modifies state. Is this assignment
> intentional?
>

It's from the kernel code, virtqueue_kick_prepare_split function. The
num_added member is internal (mutable) state, counting for the batch
so the driver sends a notification in case of uint16_t wrapping in
vhost_vring_add_split with no notification in between. I don't know if
some actual virtio devices could be actually affected from this, since
actual vqs are smaller than (uint16_t)-1 so they should be aware that
more buffers have been added anyway.

> > +/* virtqueue_add:
> > + * @vq: The #VirtQueue
> > + * @elem: The #VirtQueueElement
>
> The copy-pasted doc comment doesn't match this function.
>

Right, I will fix it.

> > +int vhost_vring_add(VhostShadowVirtqueue *vq, VirtQueueElement *elem)
> > +{
> > +    int host_head = vhost_vring_add_split(vq, elem);
> > +    if (vq->ring_id_maps[host_head]) {
> > +        g_free(vq->ring_id_maps[host_head]);
> > +    }
>
> VirtQueueElement is freed lazily? Is there a reason for this approach? I
> would have expected it to be freed when the used element is process by
> the kick fd handler.
>

Maybe it has more sense to free immediately in this commit and
introduce ring_id_maps in later commits, yes.

> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 9352c56bfa..304e0baa61 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -956,8 +956,34 @@ static void handle_sw_lm_vq(VirtIODevice *vdev, VirtQueue *vq)
> >      uint16_t idx = virtio_get_queue_index(vq);
> >
> >      VhostShadowVirtqueue *svq = hdev->sw_lm_shadow_vq[idx];
> > +    VirtQueueElement *elem;
> >
> > -    vhost_vring_kick(svq);
> > +    /*
> > +     * Make available all buffers as possible.
> > +     */
> > +    do {
> > +        if (virtio_queue_get_notification(vq)) {
> > +            virtio_queue_set_notification(vq, false);
> > +        }
> > +
> > +        while (true) {
> > +            int r;
> > +            if (virtio_queue_full(vq)) {
> > +                break;
> > +            }
>
> Why is this check necessary? The guest cannot provide more descriptors
> than there is ring space. If that happens somehow then it's a driver
> error that is already reported in virtqueue_pop() below.
>

It's just checked because virtqueue_pop prints an error on that case,
and there is no way to tell the difference between a regular error and
another caused by other causes. Maybe the right thing to do is just to
not to print that error? Caller should do the error printing in that
case. Should we return an error code?

> I wonder if you added this because the vring implementation above
> doesn't support indirect descriptors? It's easy to exhaust the vhost
> hdev vring while there is still room in the VirtIODevice's VirtQueue
> vring.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element
  2020-12-08  8:25   ` Stefan Hajnoczi
@ 2020-12-09 18:46     ` Eugenio Perez Martin
  2020-12-10 11:57       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 18:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Dec 8, 2020 at 9:26 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:54PM +0100, Eugenio Pérez wrote:
> > Specify VirtQueueElement * as return type makes no harm at this moment.
>
> The reason for the void * return type is that C implicitly converts void
> pointers to pointers of any type. The function takes a size_t sz
> argument so it can allocate a object of user-defined size. The idea is
> that the user's struct embeds a VirtQueueElement field. Changing the
> return type to VirtQueueElement * means that callers may need to
> explicitly cast to the user's struct type.
>
> It's a question of coding style but I think the void * return type
> communicates what is going on better than VirtQueueElement *.

Right, what I meant with that is that nobody uses that feature, but I
just re-check and I saw that contrib/vhost-user-blk actually uses it
(not checked for more uses). I think it is better just to drop this
commit.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu
  2020-12-08  8:41   ` Stefan Hajnoczi
@ 2020-12-09 18:48     ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2020-12-09 18:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Dec 8, 2020 at 9:42 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:56PM +0100, Eugenio Pérez wrote:
> > @@ -83,6 +89,18 @@ void vhost_vring_set_notification_rcu(VhostShadowVirtqueue *vq, bool enable)
> >      smp_mb();
> >  }
> >
> > +bool vhost_vring_poll_rcu(VhostShadowVirtqueue *vq)
>
> A name like "more_used" is clearer than "poll".

I agree, I will rename.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/27] vDPA software assisted live migration
  2020-12-09 15:57     ` Stefan Hajnoczi
@ 2020-12-10  9:12       ` Jason Wang
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Wang @ 2020-12-10  9:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, Stefan Hajnoczi, qemu-devel,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Eugenio Pérez, Lars Ganrot, Rob Miller,
	Stefano Garzarella, Howard Cai, Parav Pandit, vm, Salil Mehta,
	Stephen Finucane, Xiao W Wang, Sean Mooney, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy


On 2020/12/9 下午11:57, Stefan Hajnoczi wrote:
> On Wed, Dec 09, 2020 at 04:26:50AM -0500, Jason Wang wrote:
>> ----- Original Message -----
>>> On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
>>>> This series enable vDPA software assisted live migration for vhost-net
>>>> devices. This is a new method of vhost devices migration: Instead of
>>>> relay on vDPA device's dirty logging capability, SW assisted LM
>>>> intercepts dataplane, forwarding the descriptors between VM and device.
>>> Pros:
>>> + vhost/vDPA devices don't need to implement dirty memory logging
>>> + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
>>>
>>> Cons:
>>> - Not generic, relies on vhost-net-specific ioctls
>>> - Doesn't support VIRTIO Shared Memory Regions
>>>    https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
>> I may miss something but my understanding is that it's the
>> responsiblity of device to migrate this part?
> Good point. You're right.
>
>>> - Performance (see below)
>>>
>>> I think performance will be significantly lower when the shadow vq is
>>> enabled. Imagine a vDPA device with hardware vq doorbell registers
>>> mapped into the guest so the guest driver can directly kick the device.
>>> When the shadow vq is enabled a vmexit is needed to write to the shadow
>>> vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
>>> to read the ioeventfd, the descriptors are translated, QEMU writes to
>>> the vhost hdev kick fd, the host kernel scheduler switches to the vhost
>>> worker thread, vhost/vDPA notifies the virtqueue, and finally the
>>> vDPA driver writes to the hardware vq doorbell register. That is a lot
>>> of overhead compared to writing to an exitless MMIO register!
>> I think it's a balance. E.g we can poll the virtqueue to have an
>> exitless doorbell.
>>
>>> If the shadow vq was implemented in drivers/vhost/ and QEMU used the
>>> existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
>>> reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
>>> dirty memory logging happens asynchronously and isn't in the dataplane.
>>>
>>> In addition, hardware that supports dirty memory logging as well as
>>> software vDPA devices could completely eliminate the shadow vq for even
>>> better performance.
>> Yes. That's our plan. But the interface might require more thought.
>>
>> E.g is the bitmap a good approach? To me reporting dirty pages via
>> virqueue is better since it get less footprint and is self throttled.
>>
>> And we need an address space other than the one used by guest for
>> either bitmap for virtqueue.
>>
>>> But performance is a question of "is it good enough?". Maybe this
>>> approach is okay and users don't expect good performance while dirty
>>> memory logging is enabled.
>> Yes, and actually such slow down may help for the converge of the
>> migration.
>>
>> Note that the whole idea is try to have a generic solution for all
>> types of devices. It's good to consider the performance but for the
>> first stage, it should be sufficient to make it work and consider to
>> optimize on top.
> Moving the shadow vq to the kernel later would be quite a big change
> requiring rewriting much of the code. That's why I mentioned this now
> before a lot of effort is invested in a QEMU implementation.


Right.


>
>>> I just wanted to share the idea of moving the
>>> shadow vq into the kernel in case you like that approach better.
>> My understanding is to keep kernel as simple as possible and leave the
>> polices to userspace as much as possible. E.g it requires us to
>> disable doorbell mapping and irq offloading, all of which were under
>> the control of userspace.
> If the performance is acceptable with the QEMU approach then I think
> that's the best place to implement it. It looks high-overhead though so
> maybe one of the first things to do is to run benchmarks to collect data
> on how it performs?


Yes, I agree.

Thanks


>
> Stefan



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler
  2020-12-09 15:02     ` Eugenio Perez Martin
@ 2020-12-10 11:30       ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-10 11:30 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1258 bytes --]

On Wed, Dec 09, 2020 at 04:02:56PM +0100, Eugenio Perez Martin wrote:
> On Mon, Dec 7, 2020 at 5:52 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Nov 20, 2020 at 07:50:43PM +0100, Eugenio Pérez wrote:
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index 9179013ac4..9a69ae3598 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -2628,24 +2628,32 @@ static void virtio_net_tx_bh(void *opaque)
> > >      }
> > >  }
> > >
> > > -static void virtio_net_add_queue(VirtIONet *n, int index)
> > > +static void virtio_net_add_queue(VirtIONet *n, int index,
> > > +                                 VirtIOHandleOutput custom_handler)
> > >  {
> >
> > We talked about the possibility of moving this into the generic vhost
> > code so that devices don't need to be modified. It would be nice to hide
> > this feature inside vhost.
> 
> I'm thinking of tying it to VirtQueue, allowing the caller to override
> the handler knowing it is not going to be called (I mean, not offering
> race conditions protection, like before of starting processing
> notifications in qemu calling vhost_dev_disable_notifiers).

Yes, I can see how at least part of this belongs to VirtQueue.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/27] vhost: Route guest->host notification through qemu
  2020-12-09 17:08     ` Eugenio Perez Martin
@ 2020-12-10 11:50       ` Stefan Hajnoczi
  2021-01-21 20:10         ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-10 11:50 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 3450 bytes --]

On Wed, Dec 09, 2020 at 06:08:14PM +0100, Eugenio Perez Martin wrote:
> On Mon, Dec 7, 2020 at 6:42 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Nov 20, 2020 at 07:50:45PM +0100, Eugenio Pérez wrote:
> > > +{
> > > +    struct vhost_vring_file file = {
> > > +        .index = idx
> > > +    };
> > > +    VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
> > > +    VhostShadowVirtqueue *svq;
> > > +    int r;
> > > +
> > > +    svq = g_new0(VhostShadowVirtqueue, 1);
> > > +    svq->vq = vq;
> > > +
> > > +    r = event_notifier_init(&svq->hdev_notifier, 0);
> > > +    assert(r == 0);
> > > +
> > > +    file.fd = event_notifier_get_fd(&svq->hdev_notifier);
> > > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > > +    assert(r == 0);
> > > +
> > > +    return svq;
> > > +}
> >
> > I guess there are assumptions about the status of the device? Does the
> > virtqueue need to be disabled when this function is called?
> >
> 
> Yes. Maybe an assertion checking the notification state?

Sounds good.

> > > +
> > > +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> > > +{
> > > +    int idx;
> > > +
> > > +    vhost_dev_enable_notifiers(dev, dev->vdev);
> > > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > > +        vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> > > +{
> > > +    int idx;
> > > +
> > > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > > +        dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
> > > +    }
> > > +
> > > +    vhost_dev_disable_notifiers(dev, dev->vdev);
> >
> > There is a race condition if the guest kicks the vq while this is
> > happening. The shadow vq hdev_notifier needs to be set so the vhost
> > device checks the virtqueue for requests that slipped in during the
> > race window.
> >
> 
> I'm not sure if I follow you. If I understand correctly,
> vhost_dev_disable_notifiers calls virtio_bus_cleanup_host_notifier,
> and the latter calls virtio_queue_host_notifier_read. That's why the
> documentation says "This might actually run the qemu handlers right
> away, so virtio in qemu must be completely setup when this is
> called.". Am I missing something?

There are at least two cases:

1. Virtqueue kicks that come before vhost_dev_disable_notifiers().
   vhost_dev_disable_notifiers() notices that and calls
   virtio_queue_notify_vq(). Will handle_sw_lm_vq() be invoked or is the
   device's vq handler function invoked?

2. Virtqueue kicks that come in after vhost_dev_disable_notifiers()
   returns. We hold the QEMU global mutex so the vCPU thread cannot
   perform MMIO/PIO dispatch immediately. The vCPU thread's
   ioctl(KVM_RUN) has already returned and will dispatch dispatch the
   MMIO/PIO access inside QEMU as soon as the global mutex is released.
   In other words, we're not taking the kvm.ko ioeventfd path but
   memory_region_dispatch_write_eventfds() should signal the ioeventfd
   that is registered at the time the dispatch occurs. Is that eventfd
   handled by handle_sw_lm_vq()?

Neither of these cases are obvious from the code. At least comments
would help but I suspect restructuring the code so the critical
ioeventfd state changes happen in a sequence would make it even clearer.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2020-12-09 18:41     ` Eugenio Perez Martin
@ 2020-12-10 11:55       ` Stefan Hajnoczi
  2021-01-22 18:18         ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-10 11:55 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > +        while (true) {
> > > +            int r;
> > > +            if (virtio_queue_full(vq)) {
> > > +                break;
> > > +            }
> >
> > Why is this check necessary? The guest cannot provide more descriptors
> > than there is ring space. If that happens somehow then it's a driver
> > error that is already reported in virtqueue_pop() below.
> >
> 
> It's just checked because virtqueue_pop prints an error on that case,
> and there is no way to tell the difference between a regular error and
> another caused by other causes. Maybe the right thing to do is just to
> not to print that error? Caller should do the error printing in that
> case. Should we return an error code?

The reason an error is printed today is because it's a guest error that
never happens with correct guest drivers. Something is broken and the
user should know about it.

Why is "virtio_queue_full" (I already forgot what that actually means,
it's not clear whether this is referring to avail elements or used
elements) a condition that should be silently ignored in shadow vqs?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element
  2020-12-09 18:46     ` Eugenio Perez Martin
@ 2020-12-10 11:57       ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2020-12-10 11:57 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1476 bytes --]

On Wed, Dec 09, 2020 at 07:46:49PM +0100, Eugenio Perez Martin wrote:
> On Tue, Dec 8, 2020 at 9:26 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Fri, Nov 20, 2020 at 07:50:54PM +0100, Eugenio Pérez wrote:
> > > Specify VirtQueueElement * as return type makes no harm at this moment.
> >
> > The reason for the void * return type is that C implicitly converts void
> > pointers to pointers of any type. The function takes a size_t sz
> > argument so it can allocate a object of user-defined size. The idea is
> > that the user's struct embeds a VirtQueueElement field. Changing the
> > return type to VirtQueueElement * means that callers may need to
> > explicitly cast to the user's struct type.
> >
> > It's a question of coding style but I think the void * return type
> > communicates what is going on better than VirtQueueElement *.
> 
> Right, what I meant with that is that nobody uses that feature, but I
> just re-check and I saw that contrib/vhost-user-blk actually uses it
> (not checked for more uses). I think it is better just to drop this
> commit.

contrib/vhost-user-blk doesn't use hw/virtio/virtio.c. The code is
similar and copy-pasted, but you are free to change this file without
affecting vontrib/vhost-user-blk :).

I still think it's clearer to make it obvious that this function
allocates an object of generic type or at least the change is purely a
question of style and probably not worth making.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2020-12-07 16:58   ` Stefan Hajnoczi
@ 2021-01-12 18:21     ` Eugenio Perez Martin
  2021-03-02 11:22       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-01-12 18:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Mon, Dec 7, 2020 at 5:58 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, Nov 20, 2020 at 07:50:44PM +0100, Eugenio Pérez wrote:
> > This function is just used for a few commits, so SW LM is developed
> > incrementally, and it is deleted after it is useful.
> >
> > For a few commits, only the events (irqfd, eventfd) are forwarded.
>
> s/eventfd/ioeventfd/ (irqfd is also an eventfd)
>

Oops, will fix, thanks!

> > +bool virtio_queue_get_used_notify_split(VirtQueue *vq)
> > +{
> > +    VRingMemoryRegionCaches *caches;
> > +    hwaddr pa = offsetof(VRingUsed, flags);
> > +    uint16_t flags;
> > +
> > +    RCU_READ_LOCK_GUARD();
> > +
> > +    caches = vring_get_region_caches(vq);
> > +    assert(caches);
> > +    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
> > +    return !(VRING_USED_F_NO_NOTIFY & flags);
> > +}
>
> QEMU stores the notification status:
>
> void virtio_queue_set_notification(VirtQueue *vq, int enable)
> {
>     vq->notification = enable; <---- here
>
>     if (!vq->vring.desc) {
>         return;
>     }
>
>     if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>         virtio_queue_packed_set_notification(vq, enable);
>     } else {
>         virtio_queue_split_set_notification(vq, enable);
>
> I'm wondering why it's necessary to fetch from guest RAM instead of
> using vq->notification? It also works for both split and packed
> queues so the code would be simpler.

To use vq->notification makes sense at the end of the series.

However, at this stage (just routing notifications, not descriptors),
vhost device is the one updating that flag, not qemu. Since we cannot
just migrate used ring memory to qemu without migrating descriptors
ring too, qemu needs to check guest's memory looking for vhost device
updates on that flag.

I can see how that deserves better documentation or even a better
name. Also, this function should be in the shadow vq file, not
virtio.c.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/27] vhost: Route guest->host notification through qemu
  2020-12-10 11:50       ` Stefan Hajnoczi
@ 2021-01-21 20:10         ` Eugenio Perez Martin
  0 siblings, 0 replies; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-01-21 20:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Thu, Dec 10, 2020 at 12:51 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Dec 09, 2020 at 06:08:14PM +0100, Eugenio Perez Martin wrote:
> > On Mon, Dec 7, 2020 at 6:42 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > On Fri, Nov 20, 2020 at 07:50:45PM +0100, Eugenio Pérez wrote:
> > > > +{
> > > > +    struct vhost_vring_file file = {
> > > > +        .index = idx
> > > > +    };
> > > > +    VirtQueue *vq = virtio_get_queue(dev->vdev, idx);
> > > > +    VhostShadowVirtqueue *svq;
> > > > +    int r;
> > > > +
> > > > +    svq = g_new0(VhostShadowVirtqueue, 1);
> > > > +    svq->vq = vq;
> > > > +
> > > > +    r = event_notifier_init(&svq->hdev_notifier, 0);
> > > > +    assert(r == 0);
> > > > +
> > > > +    file.fd = event_notifier_get_fd(&svq->hdev_notifier);
> > > > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > > > +    assert(r == 0);
> > > > +
> > > > +    return svq;
> > > > +}
> > >
> > > I guess there are assumptions about the status of the device? Does the
> > > virtqueue need to be disabled when this function is called?
> > >
> >
> > Yes. Maybe an assertion checking the notification state?
>
> Sounds good.
>
> > > > +
> > > > +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> > > > +{
> > > > +    int idx;
> > > > +
> > > > +    vhost_dev_enable_notifiers(dev, dev->vdev);
> > > > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > > > +        vhost_sw_lm_shadow_vq_free(dev->sw_lm_shadow_vq[idx]);
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> > > > +{
> > > > +    int idx;
> > > > +
> > > > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > > > +        dev->sw_lm_shadow_vq[idx] = vhost_sw_lm_shadow_vq(dev, idx);
> > > > +    }
> > > > +
> > > > +    vhost_dev_disable_notifiers(dev, dev->vdev);
> > >
> > > There is a race condition if the guest kicks the vq while this is
> > > happening. The shadow vq hdev_notifier needs to be set so the vhost
> > > device checks the virtqueue for requests that slipped in during the
> > > race window.
> > >
> >
> > I'm not sure if I follow you. If I understand correctly,
> > vhost_dev_disable_notifiers calls virtio_bus_cleanup_host_notifier,
> > and the latter calls virtio_queue_host_notifier_read. That's why the
> > documentation says "This might actually run the qemu handlers right
> > away, so virtio in qemu must be completely setup when this is
> > called.". Am I missing something?
>
> There are at least two cases:
>
> 1. Virtqueue kicks that come before vhost_dev_disable_notifiers().
>    vhost_dev_disable_notifiers() notices that and calls
>    virtio_queue_notify_vq(). Will handle_sw_lm_vq() be invoked or is the
>    device's vq handler function invoked?
>

As I understand both the code and your question, no kick can call
handle_sw_lm_vq before vhost_dev_disable_notifiers (in particular,
before memory_region_add_eventfd calls in
virtio_{pci,mmio,ccw}_ioeventfd_assign(true) calls. So these will be
handled by the device.

> 2. Virtqueue kicks that come in after vhost_dev_disable_notifiers()
>    returns. We hold the QEMU global mutex so the vCPU thread cannot
>    perform MMIO/PIO dispatch immediately. The vCPU thread's
>    ioctl(KVM_RUN) has already returned and will dispatch dispatch the
>    MMIO/PIO access inside QEMU as soon as the global mutex is released.
>    In other words, we're not taking the kvm.ko ioeventfd path but
>    memory_region_dispatch_write_eventfds() should signal the ioeventfd
>    that is registered at the time the dispatch occurs. Is that eventfd
>    handled by handle_sw_lm_vq()?
>

I didn't think on that case, but it's being very difficult for me to
reproduce that behavior. It should be handled by handle_sw_lm_vq, but
maybe I'm trusting too much in vhost_dev_disable_notifiers.

> Neither of these cases are obvious from the code. At least comments
> would help but I suspect restructuring the code so the critical
> ioeventfd state changes happen in a sequence would make it even clearer.

Could you expand on this? That change is managed entirely by
virtio_bus_set_host_notifier, and the virtqueue callback is already
changed before the call to vhost_dev_disable_notifiers(). Did you mean
to restructure virtio_bus_set_host_notifier or
vhost_dev_disable_notifiers maybe?

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2020-12-10 11:55       ` Stefan Hajnoczi
@ 2021-01-22 18:18         ` Eugenio Perez Martin
       [not found]           ` <CAJaqyWdNeaboGaSsXPA8r=mUsbctFLzACFKLX55yRQpTvjqxJw@mail.gmail.com>
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-01-22 18:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > +        while (true) {
> > > > +            int r;
> > > > +            if (virtio_queue_full(vq)) {
> > > > +                break;
> > > > +            }
> > >
> > > Why is this check necessary? The guest cannot provide more descriptors
> > > than there is ring space. If that happens somehow then it's a driver
> > > error that is already reported in virtqueue_pop() below.
> > >
> >
> > It's just checked because virtqueue_pop prints an error on that case,
> > and there is no way to tell the difference between a regular error and
> > another caused by other causes. Maybe the right thing to do is just to
> > not to print that error? Caller should do the error printing in that
> > case. Should we return an error code?
>
> The reason an error is printed today is because it's a guest error that
> never happens with correct guest drivers. Something is broken and the
> user should know about it.
>
> Why is "virtio_queue_full" (I already forgot what that actually means,
> it's not clear whether this is referring to avail elements or used
> elements) a condition that should be silently ignored in shadow vqs?
>

TL;DR: It can be changed to a check of the number of available
descriptors in shadow vq, instead of returning as a regular operation.
However, I think that making it a special return of virtqueue_pop
could help in devices that run to completion, avoiding having to
duplicate the count logic in them.

The function virtio_queue_empty checks if the vq has all descriptors
available, so the device has no more work to do until the driver makes
another descriptor available. I can see how it can be a bad name
choice, but virtio_queue_full means the opposite: device has pop()
every descriptor available, and it has not returned any, so the driver
cannot progress until the device marks some descriptors as used.

As I understand, if vq->in_use >vq->num would mean we have a bug in
the device vq code, not in the driver. virtio_queue_full could even be
changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
vq->vring.num", as long as vq->in_use is operated right.

If we hit vq->in_use == vq->num in virtqueue_pop it means the device
tried to pop() one more buffer after having all of them available and
pop'ed. This would be invalid if the device is counting right the
number of in_use descriptors, but then we are duplicating that logic
in the device and the vq.

In shadow vq this situation happens with the correct guest network
driver, since the rx queue is filled for the device to write. Network
device in qemu fetch descriptors on demand, but shadow vq fetch all
available in batching. If the driver just happens to fill the queue of
available descriptors, the log will raise, so we need to check in
handle_sw_lm_vq before calling pop(). Of course the shadow vq can
duplicate guest_vq->in_use instead of checking virtio_queue_full, but
then it needs to check two things for every virtqueue_pop() [1].

Having said that, would you prefer a checking of available slots in
the shadow vq?

Thanks!

[1] if we don't change virtqueue_pop code.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2021-01-12 18:21     ` Eugenio Perez Martin
@ 2021-03-02 11:22       ` Stefan Hajnoczi
  2021-03-02 18:34         ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-03-02 11:22 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 2474 bytes --]

On Tue, Jan 12, 2021 at 07:21:27PM +0100, Eugenio Perez Martin wrote:
> On Mon, Dec 7, 2020 at 5:58 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Fri, Nov 20, 2020 at 07:50:44PM +0100, Eugenio Pérez wrote:
> > > This function is just used for a few commits, so SW LM is developed
> > > incrementally, and it is deleted after it is useful.
> > >
> > > For a few commits, only the events (irqfd, eventfd) are forwarded.
> >
> > s/eventfd/ioeventfd/ (irqfd is also an eventfd)
> >
> 
> Oops, will fix, thanks!
> 
> > > +bool virtio_queue_get_used_notify_split(VirtQueue *vq)
> > > +{
> > > +    VRingMemoryRegionCaches *caches;
> > > +    hwaddr pa = offsetof(VRingUsed, flags);
> > > +    uint16_t flags;
> > > +
> > > +    RCU_READ_LOCK_GUARD();
> > > +
> > > +    caches = vring_get_region_caches(vq);
> > > +    assert(caches);
> > > +    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
> > > +    return !(VRING_USED_F_NO_NOTIFY & flags);
> > > +}
> >
> > QEMU stores the notification status:
> >
> > void virtio_queue_set_notification(VirtQueue *vq, int enable)
> > {
> >     vq->notification = enable; <---- here
> >
> >     if (!vq->vring.desc) {
> >         return;
> >     }
> >
> >     if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >         virtio_queue_packed_set_notification(vq, enable);
> >     } else {
> >         virtio_queue_split_set_notification(vq, enable);
> >
> > I'm wondering why it's necessary to fetch from guest RAM instead of
> > using vq->notification? It also works for both split and packed
> > queues so the code would be simpler.
> 
> To use vq->notification makes sense at the end of the series.
> 
> However, at this stage (just routing notifications, not descriptors),
> vhost device is the one updating that flag, not qemu. Since we cannot
> just migrate used ring memory to qemu without migrating descriptors
> ring too, qemu needs to check guest's memory looking for vhost device
> updates on that flag.
> 
> I can see how that deserves better documentation or even a better
> name. Also, this function should be in the shadow vq file, not
> virtio.c.

I can't think of a reason why QEMU needs to know the flag value that the
vhost device has set. This flag is a hint to the guest driver indicating
whether the device wants to receive notifications.

Can you explain why QEMU needs to look at the value of the flag?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2021-03-02 11:22       ` Stefan Hajnoczi
@ 2021-03-02 18:34         ` Eugenio Perez Martin
  2021-03-08 10:46           ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-03-02 18:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm list, Michael S. Tsirkin, Jason Wang, qemu-level,
	Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Stefan Hajnoczi, Jim Harford,
	Dmytro Kazantsev, Siwei Liu, Harpreet Singh Anand, Michael Lilja,
	Max Gurtovoy

On Tue, Mar 2, 2021 at 12:22 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Tue, Jan 12, 2021 at 07:21:27PM +0100, Eugenio Perez Martin wrote:
> > On Mon, Dec 7, 2020 at 5:58 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Fri, Nov 20, 2020 at 07:50:44PM +0100, Eugenio Pérez wrote:
> > > > This function is just used for a few commits, so SW LM is developed
> > > > incrementally, and it is deleted after it is useful.
> > > >
> > > > For a few commits, only the events (irqfd, eventfd) are forwarded.
> > >
> > > s/eventfd/ioeventfd/ (irqfd is also an eventfd)
> > >
> >
> > Oops, will fix, thanks!
> >
> > > > +bool virtio_queue_get_used_notify_split(VirtQueue *vq)
> > > > +{
> > > > +    VRingMemoryRegionCaches *caches;
> > > > +    hwaddr pa = offsetof(VRingUsed, flags);
> > > > +    uint16_t flags;
> > > > +
> > > > +    RCU_READ_LOCK_GUARD();
> > > > +
> > > > +    caches = vring_get_region_caches(vq);
> > > > +    assert(caches);
> > > > +    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
> > > > +    return !(VRING_USED_F_NO_NOTIFY & flags);
> > > > +}
> > >
> > > QEMU stores the notification status:
> > >
> > > void virtio_queue_set_notification(VirtQueue *vq, int enable)
> > > {
> > >     vq->notification = enable; <---- here
> > >
> > >     if (!vq->vring.desc) {
> > >         return;
> > >     }
> > >
> > >     if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> > >         virtio_queue_packed_set_notification(vq, enable);
> > >     } else {
> > >         virtio_queue_split_set_notification(vq, enable);
> > >
> > > I'm wondering why it's necessary to fetch from guest RAM instead of
> > > using vq->notification? It also works for both split and packed
> > > queues so the code would be simpler.
> >
> > To use vq->notification makes sense at the end of the series.
> >
> > However, at this stage (just routing notifications, not descriptors),
> > vhost device is the one updating that flag, not qemu. Since we cannot
> > just migrate used ring memory to qemu without migrating descriptors
> > ring too, qemu needs to check guest's memory looking for vhost device
> > updates on that flag.
> >
> > I can see how that deserves better documentation or even a better
> > name. Also, this function should be in the shadow vq file, not
> > virtio.c.
>
> I can't think of a reason why QEMU needs to know the flag value that the
> vhost device has set. This flag is a hint to the guest driver indicating
> whether the device wants to receive notifications.
>
> Can you explain why QEMU needs to look at the value of the flag?
>
> Stefan

My bad, "need" is not the right word: SVQ could just forward the
notification at this point without checking the flag. Taking into
account that it's not used in later series, and it's even removed in
patch 14/27 of this series, I can see that it just adds noise to the
entire patchset

This function just allows svq to re-check the flag after the guest
sends the notification. This way svq is able to drop the kick as a
(premature?) optimization in case the device sets it just after the
guest sends the kick.

Until patch 13/27, only notifications are forwarded, not buffers. VM
guest's drivers and vhost device still read and write at usual
addresses, but ioeventfd and kvmfd are intercepted by qemu. This
allows us to test if the notification forwarding part is doing ok.
From patch 14 of this series, svq offers a new vring to the device in
qemu's VAS, so the former does not need to check the guest's memory
anymore, and this function can be dropped.

Is it clearer now? Please let me know if I should add something else.

Thanks!



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split
  2021-03-02 18:34         ` Eugenio Perez Martin
@ 2021-03-08 10:46           ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-03-08 10:46 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: kvm list, Michael S. Tsirkin, Stefan Hajnoczi, Jason Wang,
	qemu-level, Daniel Daly, virtualization, Liran Alon, Eli Cohen,
	Nitin Shrivastav, Alex Barba, Christophe Fontaine, Juan Quintela,
	Lee Ballard, Lars Ganrot, Rob Miller, Stefano Garzarella,
	Howard Cai, Parav Pandit, vm, Salil Mehta, Stephen Finucane,
	Xiao W Wang, Sean Mooney, Jim Harford, Dmytro Kazantsev,
	Siwei Liu, Harpreet Singh Anand, Michael Lilja, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 4131 bytes --]

On Tue, Mar 02, 2021 at 07:34:20PM +0100, Eugenio Perez Martin wrote:
> On Tue, Mar 2, 2021 at 12:22 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Tue, Jan 12, 2021 at 07:21:27PM +0100, Eugenio Perez Martin wrote:
> > > On Mon, Dec 7, 2020 at 5:58 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 20, 2020 at 07:50:44PM +0100, Eugenio Pérez wrote:
> > > > > This function is just used for a few commits, so SW LM is developed
> > > > > incrementally, and it is deleted after it is useful.
> > > > >
> > > > > For a few commits, only the events (irqfd, eventfd) are forwarded.
> > > >
> > > > s/eventfd/ioeventfd/ (irqfd is also an eventfd)
> > > >
> > >
> > > Oops, will fix, thanks!
> > >
> > > > > +bool virtio_queue_get_used_notify_split(VirtQueue *vq)
> > > > > +{
> > > > > +    VRingMemoryRegionCaches *caches;
> > > > > +    hwaddr pa = offsetof(VRingUsed, flags);
> > > > > +    uint16_t flags;
> > > > > +
> > > > > +    RCU_READ_LOCK_GUARD();
> > > > > +
> > > > > +    caches = vring_get_region_caches(vq);
> > > > > +    assert(caches);
> > > > > +    flags = virtio_lduw_phys_cached(vq->vdev, &caches->used, pa);
> > > > > +    return !(VRING_USED_F_NO_NOTIFY & flags);
> > > > > +}
> > > >
> > > > QEMU stores the notification status:
> > > >
> > > > void virtio_queue_set_notification(VirtQueue *vq, int enable)
> > > > {
> > > >     vq->notification = enable; <---- here
> > > >
> > > >     if (!vq->vring.desc) {
> > > >         return;
> > > >     }
> > > >
> > > >     if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> > > >         virtio_queue_packed_set_notification(vq, enable);
> > > >     } else {
> > > >         virtio_queue_split_set_notification(vq, enable);
> > > >
> > > > I'm wondering why it's necessary to fetch from guest RAM instead of
> > > > using vq->notification? It also works for both split and packed
> > > > queues so the code would be simpler.
> > >
> > > To use vq->notification makes sense at the end of the series.
> > >
> > > However, at this stage (just routing notifications, not descriptors),
> > > vhost device is the one updating that flag, not qemu. Since we cannot
> > > just migrate used ring memory to qemu without migrating descriptors
> > > ring too, qemu needs to check guest's memory looking for vhost device
> > > updates on that flag.
> > >
> > > I can see how that deserves better documentation or even a better
> > > name. Also, this function should be in the shadow vq file, not
> > > virtio.c.
> >
> > I can't think of a reason why QEMU needs to know the flag value that the
> > vhost device has set. This flag is a hint to the guest driver indicating
> > whether the device wants to receive notifications.
> >
> > Can you explain why QEMU needs to look at the value of the flag?
> >
> > Stefan
> 
> My bad, "need" is not the right word: SVQ could just forward the
> notification at this point without checking the flag. Taking into
> account that it's not used in later series, and it's even removed in
> patch 14/27 of this series, I can see that it just adds noise to the
> entire patchset
> 
> This function just allows svq to re-check the flag after the guest
> sends the notification. This way svq is able to drop the kick as a
> (premature?) optimization in case the device sets it just after the
> guest sends the kick.
> 
> Until patch 13/27, only notifications are forwarded, not buffers. VM
> guest's drivers and vhost device still read and write at usual
> addresses, but ioeventfd and kvmfd are intercepted by qemu. This
> allows us to test if the notification forwarding part is doing ok.
> From patch 14 of this series, svq offers a new vring to the device in
> qemu's VAS, so the former does not need to check the guest's memory
> anymore, and this function can be dropped.
> 
> Is it clearer now? Please let me know if I should add something else.

Thanks for explaining. You could drop it to simplify the code. If you
leave it in, please include a comment explaining the purpose.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
       [not found]           ` <CAJaqyWdNeaboGaSsXPA8r=mUsbctFLzACFKLX55yRQpTvjqxJw@mail.gmail.com>
@ 2021-03-22 10:51             ` Stefan Hajnoczi
  2021-03-22 15:55               ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22 10:51 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4342 bytes --]

On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote:
> On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > > > +        while (true) {
> > > > > > +            int r;
> > > > > > +            if (virtio_queue_full(vq)) {
> > > > > > +                break;
> > > > > > +            }
> > > > >
> > > > > Why is this check necessary? The guest cannot provide more descriptors
> > > > > than there is ring space. If that happens somehow then it's a driver
> > > > > error that is already reported in virtqueue_pop() below.
> > > > >
> > > >
> > > > It's just checked because virtqueue_pop prints an error on that case,
> > > > and there is no way to tell the difference between a regular error and
> > > > another caused by other causes. Maybe the right thing to do is just to
> > > > not to print that error? Caller should do the error printing in that
> > > > case. Should we return an error code?
> > >
> > > The reason an error is printed today is because it's a guest error that
> > > never happens with correct guest drivers. Something is broken and the
> > > user should know about it.
> > >
> > > Why is "virtio_queue_full" (I already forgot what that actually means,
> > > it's not clear whether this is referring to avail elements or used
> > > elements) a condition that should be silently ignored in shadow vqs?
> > >
> >
> > TL;DR: It can be changed to a check of the number of available
> > descriptors in shadow vq, instead of returning as a regular operation.
> > However, I think that making it a special return of virtqueue_pop
> > could help in devices that run to completion, avoiding having to
> > duplicate the count logic in them.
> >
> > The function virtio_queue_empty checks if the vq has all descriptors
> > available, so the device has no more work to do until the driver makes
> > another descriptor available. I can see how it can be a bad name
> > choice, but virtio_queue_full means the opposite: device has pop()
> > every descriptor available, and it has not returned any, so the driver
> > cannot progress until the device marks some descriptors as used.
> >
> > As I understand, if vq->in_use >vq->num would mean we have a bug in
> > the device vq code, not in the driver. virtio_queue_full could even be
> > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
> > vq->vring.num", as long as vq->in_use is operated right.
> >
> > If we hit vq->in_use == vq->num in virtqueue_pop it means the device
> > tried to pop() one more buffer after having all of them available and
> > pop'ed. This would be invalid if the device is counting right the
> > number of in_use descriptors, but then we are duplicating that logic
> > in the device and the vq.

Devices call virtqueue_pop() until it returns NULL. They don't need to
count virtqueue buffers explicitly. It returns NULL when vq->num
virtqueue buffers have already been popped (either because
virtio_queue_empty() is true or because an invalid driver state is
detected by checking vq->num in virtqueue_pop()).

> > In shadow vq this situation happens with the correct guest network
> > driver, since the rx queue is filled for the device to write. Network
> > device in qemu fetch descriptors on demand, but shadow vq fetch all
> > available in batching. If the driver just happens to fill the queue of
> > available descriptors, the log will raise, so we need to check in
> > handle_sw_lm_vq before calling pop(). Of course the shadow vq can
> > duplicate guest_vq->in_use instead of checking virtio_queue_full, but
> > then it needs to check two things for every virtqueue_pop() [1].

I don't understand this scenario. It sounds like you are saying the
guest and shadow rx vq are not in sync so there is a case where
vq->in_use > vq->num is triggered? I'm not sure how that can happen
since both vqs have equal vq->num. Maybe you can explain the scenario in
more detail?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2021-03-22 10:51             ` Stefan Hajnoczi
@ 2021-03-22 15:55               ` Eugenio Perez Martin
  2021-03-22 17:40                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-03-22 15:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-level

On Mon, Mar 22, 2021 at 11:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote:
> > On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > > > > +        while (true) {
> > > > > > > +            int r;
> > > > > > > +            if (virtio_queue_full(vq)) {
> > > > > > > +                break;
> > > > > > > +            }
> > > > > >
> > > > > > Why is this check necessary? The guest cannot provide more descriptors
> > > > > > than there is ring space. If that happens somehow then it's a driver
> > > > > > error that is already reported in virtqueue_pop() below.
> > > > > >
> > > > >
> > > > > It's just checked because virtqueue_pop prints an error on that case,
> > > > > and there is no way to tell the difference between a regular error and
> > > > > another caused by other causes. Maybe the right thing to do is just to
> > > > > not to print that error? Caller should do the error printing in that
> > > > > case. Should we return an error code?
> > > >
> > > > The reason an error is printed today is because it's a guest error that
> > > > never happens with correct guest drivers. Something is broken and the
> > > > user should know about it.
> > > >
> > > > Why is "virtio_queue_full" (I already forgot what that actually means,
> > > > it's not clear whether this is referring to avail elements or used
> > > > elements) a condition that should be silently ignored in shadow vqs?
> > > >
> > >
> > > TL;DR: It can be changed to a check of the number of available
> > > descriptors in shadow vq, instead of returning as a regular operation.
> > > However, I think that making it a special return of virtqueue_pop
> > > could help in devices that run to completion, avoiding having to
> > > duplicate the count logic in them.
> > >
> > > The function virtio_queue_empty checks if the vq has all descriptors
> > > available, so the device has no more work to do until the driver makes
> > > another descriptor available. I can see how it can be a bad name
> > > choice, but virtio_queue_full means the opposite: device has pop()
> > > every descriptor available, and it has not returned any, so the driver
> > > cannot progress until the device marks some descriptors as used.
> > >
> > > As I understand, if vq->in_use >vq->num would mean we have a bug in
> > > the device vq code, not in the driver. virtio_queue_full could even be
> > > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
> > > vq->vring.num", as long as vq->in_use is operated right.
> > >
> > > If we hit vq->in_use == vq->num in virtqueue_pop it means the device
> > > tried to pop() one more buffer after having all of them available and
> > > pop'ed. This would be invalid if the device is counting right the
> > > number of in_use descriptors, but then we are duplicating that logic
> > > in the device and the vq.
>
> Devices call virtqueue_pop() until it returns NULL. They don't need to
> count virtqueue buffers explicitly. It returns NULL when vq->num
> virtqueue buffers have already been popped (either because
> virtio_queue_empty() is true or because an invalid driver state is
> detected by checking vq->num in virtqueue_pop()).

If I understood you right, the virtio_queue_full addresses the reverse
problem: it controls when the virtqueue is out of buffers to make
available for the device because the latter has not consumed any, not
when the driver does not offer more buffers to the device because it
has no more data to offer.

I find it easier to explain with the virtio-net rx queue (and I think
it's the easier way to trigger this issue). You are describing it's
regular behavior: The guest fills it (let's say 100%), and the device
picks buffers one by one:

virtio_net_receive_rcu:
while (offset < size) {
    elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));
    if (!elem) {
        virtio_error("unexpected empty queue");
    }
    /* [1] */
    /* fill elem with rx packet */
    virtqueue_fill(virtqueue, elem);
    ...
    virtqueue_flush(q->rx_vq, i);
}

Every device as far as I know does this buffer by buffer, there is
just processing code in [1], and it never tries to pop more than one
buffers/chain of buffers at the same time. In the case of a queue
empty (no more available buffers), we hit an error, because there are
no more buffers to write. In other devices (or tx queue), empty
buffers means there is no more work to do, not an error.

In the case of shadow virtqueue, we cannot limit the number of exposed
rx buffers to 1 buffer/chain of buffers in [1], since it will affect
batching. We have the opposite problem: All devices (but rx queue)
want to queue "as empty as possible", or "to mark all available
buffers empty". Net rx queue is ok as long as it has a buffer/buffer
chain big enough to write to, but it will fetch them on demand, so
"queue full" (as in all buffers are available) is not a problem for
the device.

However, the part of the shadow virtqueue that forwards the available
buffer seeks the opposite: It wants as many buffers as possible to be
available. That means that there is no [1] code that fills/read &
flush/detach the buffer immediately: Shadow virtqueue wants to make
available as many buffers as possible, but the device may not use them
until it has more data available. To the extreme (virtio-net rx queue
full), shadow virtqueue may make available all buffers, so in a
while(true) loop, it will try to make them available until it hits
that all the buffers are already available (vq->in_use == vq->num).

The solution is to check the number of buffers already available
before calling virtio_queue_pop(). We could duplicate in_use in shadow
virtqueue, of course, but everything we need is already done in
VirtQueue code, so I think to reuse it is a better solution. Another
solution could be to treat vq->in_use == vq->num as an special return
code with no printed error in virtqueue_pop, but to expose if the
queue is full (as vq->in_use == vq->num) sounds less invasive to me.

>
> > > In shadow vq this situation happens with the correct guest network
> > > driver, since the rx queue is filled for the device to write. Network
> > > device in qemu fetch descriptors on demand, but shadow vq fetch all
> > > available in batching. If the driver just happens to fill the queue of
> > > available descriptors, the log will raise, so we need to check in
> > > handle_sw_lm_vq before calling pop(). Of course the shadow vq can
> > > duplicate guest_vq->in_use instead of checking virtio_queue_full, but
> > > then it needs to check two things for every virtqueue_pop() [1].
>
> I don't understand this scenario. It sounds like you are saying the
> guest and shadow rx vq are not in sync so there is a case where
> vq->in_use > vq->num is triggered?

Sorry if I explain it bad, what I meant is that there is a case where
SVQ (as device code) will call virtqueue_pop() when vq->in_use ==
vq->num. virtio_queue_full maintains the check as >=, I think it
should be safe to even to code virtio_queue_full to:

assert(vq->in_use > vq->num);
return vq->inuse == vq->num;

Please let me know if this is not clear enough.

Thanks!

> I'm not sure how that can happen
> since both vqs have equal vq->num. Maybe you can explain the scenario in
> more detail?
>
> Stefan



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2021-03-22 15:55               ` Eugenio Perez Martin
@ 2021-03-22 17:40                 ` Stefan Hajnoczi
  2021-03-24 19:04                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22 17:40 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-level

[-- Attachment #1: Type: text/plain, Size: 8915 bytes --]

On Mon, Mar 22, 2021 at 04:55:13PM +0100, Eugenio Perez Martin wrote:
> On Mon, Mar 22, 2021 at 11:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote:
> > > On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > > > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > > > > > +        while (true) {
> > > > > > > > +            int r;
> > > > > > > > +            if (virtio_queue_full(vq)) {
> > > > > > > > +                break;
> > > > > > > > +            }
> > > > > > >
> > > > > > > Why is this check necessary? The guest cannot provide more descriptors
> > > > > > > than there is ring space. If that happens somehow then it's a driver
> > > > > > > error that is already reported in virtqueue_pop() below.
> > > > > > >
> > > > > >
> > > > > > It's just checked because virtqueue_pop prints an error on that case,
> > > > > > and there is no way to tell the difference between a regular error and
> > > > > > another caused by other causes. Maybe the right thing to do is just to
> > > > > > not to print that error? Caller should do the error printing in that
> > > > > > case. Should we return an error code?
> > > > >
> > > > > The reason an error is printed today is because it's a guest error that
> > > > > never happens with correct guest drivers. Something is broken and the
> > > > > user should know about it.
> > > > >
> > > > > Why is "virtio_queue_full" (I already forgot what that actually means,
> > > > > it's not clear whether this is referring to avail elements or used
> > > > > elements) a condition that should be silently ignored in shadow vqs?
> > > > >
> > > >
> > > > TL;DR: It can be changed to a check of the number of available
> > > > descriptors in shadow vq, instead of returning as a regular operation.
> > > > However, I think that making it a special return of virtqueue_pop
> > > > could help in devices that run to completion, avoiding having to
> > > > duplicate the count logic in them.
> > > >
> > > > The function virtio_queue_empty checks if the vq has all descriptors
> > > > available, so the device has no more work to do until the driver makes
> > > > another descriptor available. I can see how it can be a bad name
> > > > choice, but virtio_queue_full means the opposite: device has pop()
> > > > every descriptor available, and it has not returned any, so the driver
> > > > cannot progress until the device marks some descriptors as used.
> > > >
> > > > As I understand, if vq->in_use >vq->num would mean we have a bug in
> > > > the device vq code, not in the driver. virtio_queue_full could even be
> > > > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
> > > > vq->vring.num", as long as vq->in_use is operated right.
> > > >
> > > > If we hit vq->in_use == vq->num in virtqueue_pop it means the device
> > > > tried to pop() one more buffer after having all of them available and
> > > > pop'ed. This would be invalid if the device is counting right the
> > > > number of in_use descriptors, but then we are duplicating that logic
> > > > in the device and the vq.
> >
> > Devices call virtqueue_pop() until it returns NULL. They don't need to
> > count virtqueue buffers explicitly. It returns NULL when vq->num
> > virtqueue buffers have already been popped (either because
> > virtio_queue_empty() is true or because an invalid driver state is
> > detected by checking vq->num in virtqueue_pop()).
> 
> If I understood you right, the virtio_queue_full addresses the reverse
> problem: it controls when the virtqueue is out of buffers to make
> available for the device because the latter has not consumed any, not
> when the driver does not offer more buffers to the device because it
> has no more data to offer.
> 
> I find it easier to explain with the virtio-net rx queue (and I think
> it's the easier way to trigger this issue). You are describing it's
> regular behavior: The guest fills it (let's say 100%), and the device
> picks buffers one by one:
> 
> virtio_net_receive_rcu:
> while (offset < size) {
>     elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));

The lines before this loop return early when the virtqueue does not have
sufficient buffer space:

  if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) {
      return 0;
  }

When entering this loop we know that we can pop the buffers needed to
fill one rx packet.

>     if (!elem) {
>         virtio_error("unexpected empty queue");
>     }
>     /* [1] */
>     /* fill elem with rx packet */
>     virtqueue_fill(virtqueue, elem);
>     ...
>     virtqueue_flush(q->rx_vq, i);
> }
> 
> Every device as far as I know does this buffer by buffer, there is
> just processing code in [1], and it never tries to pop more than one
> buffers/chain of buffers at the same time. In the case of a queue
> empty (no more available buffers), we hit an error, because there are
> no more buffers to write.

It's an error because we already checked that the virtqueue has buffer
space. This should never happen.

> In other devices (or tx queue), empty
> buffers means there is no more work to do, not an error.
> 
> In the case of shadow virtqueue, we cannot limit the number of exposed
> rx buffers to 1 buffer/chain of buffers in [1], since it will affect
> batching. We have the opposite problem: All devices (but rx queue)
> want to queue "as empty as possible", or "to mark all available
> buffers empty". Net rx queue is ok as long as it has a buffer/buffer
> chain big enough to write to, but it will fetch them on demand, so
> "queue full" (as in all buffers are available) is not a problem for
> the device.
> 
> However, the part of the shadow virtqueue that forwards the available
> buffer seeks the opposite: It wants as many buffers as possible to be
> available. That means that there is no [1] code that fills/read &
> flush/detach the buffer immediately: Shadow virtqueue wants to make
> available as many buffers as possible, but the device may not use them
> until it has more data available. To the extreme (virtio-net rx queue
> full), shadow virtqueue may make available all buffers, so in a
> while(true) loop, it will try to make them available until it hits
> that all the buffers are already available (vq->in_use == vq->num).
> 
> The solution is to check the number of buffers already available
> before calling virtio_queue_pop(). We could duplicate in_use in shadow
> virtqueue, of course, but everything we need is already done in
> VirtQueue code, so I think to reuse it is a better solution. Another
> solution could be to treat vq->in_use == vq->num as an special return
> code with no printed error in virtqueue_pop, but to expose if the
> queue is full (as vq->in_use == vq->num) sounds less invasive to me.
>
> >
> > > > In shadow vq this situation happens with the correct guest network
> > > > driver, since the rx queue is filled for the device to write. Network
> > > > device in qemu fetch descriptors on demand, but shadow vq fetch all
> > > > available in batching. If the driver just happens to fill the queue of
> > > > available descriptors, the log will raise, so we need to check in
> > > > handle_sw_lm_vq before calling pop(). Of course the shadow vq can
> > > > duplicate guest_vq->in_use instead of checking virtio_queue_full, but
> > > > then it needs to check two things for every virtqueue_pop() [1].
> >
> > I don't understand this scenario. It sounds like you are saying the
> > guest and shadow rx vq are not in sync so there is a case where
> > vq->in_use > vq->num is triggered?
> 
> Sorry if I explain it bad, what I meant is that there is a case where
> SVQ (as device code) will call virtqueue_pop() when vq->in_use ==
> vq->num. virtio_queue_full maintains the check as >=, I think it
> should be safe to even to code virtio_queue_full to:
> 
> assert(vq->in_use > vq->num);
> return vq->inuse == vq->num;
> 
> Please let me know if this is not clear enough.

I don't get it. When virtqueue_split_pop() has popped all requests
virtio_queue_empty_rcu() should return true and we shouldn't reach if
(vq->inuse >= vq->vring.num). The guest driver cannot submit more
available buffers at this point.

I only checked split rings, not packed rings.

Can you point to the SVQ code which has this problem? It may be easier
to re-read the code than try to describe it in an email.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2021-03-22 17:40                 ` Stefan Hajnoczi
@ 2021-03-24 19:04                   ` Eugenio Perez Martin
  2021-03-24 19:56                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Eugenio Perez Martin @ 2021-03-24 19:04 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-level

On Mon, Mar 22, 2021 at 6:40 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Mar 22, 2021 at 04:55:13PM +0100, Eugenio Perez Martin wrote:
> > On Mon, Mar 22, 2021 at 11:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote:
> > > > On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > > > > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > > > > > > +        while (true) {
> > > > > > > > > +            int r;
> > > > > > > > > +            if (virtio_queue_full(vq)) {
> > > > > > > > > +                break;
> > > > > > > > > +            }
> > > > > > > >
> > > > > > > > Why is this check necessary? The guest cannot provide more descriptors
> > > > > > > > than there is ring space. If that happens somehow then it's a driver
> > > > > > > > error that is already reported in virtqueue_pop() below.
> > > > > > > >
> > > > > > >
> > > > > > > It's just checked because virtqueue_pop prints an error on that case,
> > > > > > > and there is no way to tell the difference between a regular error and
> > > > > > > another caused by other causes. Maybe the right thing to do is just to
> > > > > > > not to print that error? Caller should do the error printing in that
> > > > > > > case. Should we return an error code?
> > > > > >
> > > > > > The reason an error is printed today is because it's a guest error that
> > > > > > never happens with correct guest drivers. Something is broken and the
> > > > > > user should know about it.
> > > > > >
> > > > > > Why is "virtio_queue_full" (I already forgot what that actually means,
> > > > > > it's not clear whether this is referring to avail elements or used
> > > > > > elements) a condition that should be silently ignored in shadow vqs?
> > > > > >
> > > > >
> > > > > TL;DR: It can be changed to a check of the number of available
> > > > > descriptors in shadow vq, instead of returning as a regular operation.
> > > > > However, I think that making it a special return of virtqueue_pop
> > > > > could help in devices that run to completion, avoiding having to
> > > > > duplicate the count logic in them.
> > > > >
> > > > > The function virtio_queue_empty checks if the vq has all descriptors
> > > > > available, so the device has no more work to do until the driver makes
> > > > > another descriptor available. I can see how it can be a bad name
> > > > > choice, but virtio_queue_full means the opposite: device has pop()
> > > > > every descriptor available, and it has not returned any, so the driver
> > > > > cannot progress until the device marks some descriptors as used.
> > > > >
> > > > > As I understand, if vq->in_use >vq->num would mean we have a bug in
> > > > > the device vq code, not in the driver. virtio_queue_full could even be
> > > > > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
> > > > > vq->vring.num", as long as vq->in_use is operated right.
> > > > >
> > > > > If we hit vq->in_use == vq->num in virtqueue_pop it means the device
> > > > > tried to pop() one more buffer after having all of them available and
> > > > > pop'ed. This would be invalid if the device is counting right the
> > > > > number of in_use descriptors, but then we are duplicating that logic
> > > > > in the device and the vq.
> > >
> > > Devices call virtqueue_pop() until it returns NULL. They don't need to
> > > count virtqueue buffers explicitly. It returns NULL when vq->num
> > > virtqueue buffers have already been popped (either because
> > > virtio_queue_empty() is true or because an invalid driver state is
> > > detected by checking vq->num in virtqueue_pop()).
> >
> > If I understood you right, the virtio_queue_full addresses the reverse
> > problem: it controls when the virtqueue is out of buffers to make
> > available for the device because the latter has not consumed any, not
> > when the driver does not offer more buffers to the device because it
> > has no more data to offer.
> >
> > I find it easier to explain with the virtio-net rx queue (and I think
> > it's the easier way to trigger this issue). You are describing it's
> > regular behavior: The guest fills it (let's say 100%), and the device
> > picks buffers one by one:
> >
> > virtio_net_receive_rcu:
> > while (offset < size) {
> >     elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));
>
> The lines before this loop return early when the virtqueue does not have
> sufficient buffer space:
>
>   if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) {
>       return 0;
>   }
>
> When entering this loop we know that we can pop the buffers needed to
> fill one rx packet.
>
> >     if (!elem) {
> >         virtio_error("unexpected empty queue");
> >     }
> >     /* [1] */
> >     /* fill elem with rx packet */
> >     virtqueue_fill(virtqueue, elem);
> >     ...
> >     virtqueue_flush(q->rx_vq, i);
> > }
> >
> > Every device as far as I know does this buffer by buffer, there is
> > just processing code in [1], and it never tries to pop more than one
> > buffers/chain of buffers at the same time. In the case of a queue
> > empty (no more available buffers), we hit an error, because there are
> > no more buffers to write.
>
> It's an error because we already checked that the virtqueue has buffer
> space. This should never happen.
>
> > In other devices (or tx queue), empty
> > buffers means there is no more work to do, not an error.
> >
> > In the case of shadow virtqueue, we cannot limit the number of exposed
> > rx buffers to 1 buffer/chain of buffers in [1], since it will affect
> > batching. We have the opposite problem: All devices (but rx queue)
> > want to queue "as empty as possible", or "to mark all available
> > buffers empty". Net rx queue is ok as long as it has a buffer/buffer
> > chain big enough to write to, but it will fetch them on demand, so
> > "queue full" (as in all buffers are available) is not a problem for
> > the device.
> >
> > However, the part of the shadow virtqueue that forwards the available
> > buffer seeks the opposite: It wants as many buffers as possible to be
> > available. That means that there is no [1] code that fills/read &
> > flush/detach the buffer immediately: Shadow virtqueue wants to make
> > available as many buffers as possible, but the device may not use them
> > until it has more data available. To the extreme (virtio-net rx queue
> > full), shadow virtqueue may make available all buffers, so in a
> > while(true) loop, it will try to make them available until it hits
> > that all the buffers are already available (vq->in_use == vq->num).
> >
> > The solution is to check the number of buffers already available
> > before calling virtio_queue_pop(). We could duplicate in_use in shadow
> > virtqueue, of course, but everything we need is already done in
> > VirtQueue code, so I think to reuse it is a better solution. Another
> > solution could be to treat vq->in_use == vq->num as an special return
> > code with no printed error in virtqueue_pop, but to expose if the
> > queue is full (as vq->in_use == vq->num) sounds less invasive to me.
> >
> > >
> > > > > In shadow vq this situation happens with the correct guest network
> > > > > driver, since the rx queue is filled for the device to write. Network
> > > > > device in qemu fetch descriptors on demand, but shadow vq fetch all
> > > > > available in batching. If the driver just happens to fill the queue of
> > > > > available descriptors, the log will raise, so we need to check in
> > > > > handle_sw_lm_vq before calling pop(). Of course the shadow vq can
> > > > > duplicate guest_vq->in_use instead of checking virtio_queue_full, but
> > > > > then it needs to check two things for every virtqueue_pop() [1].
> > >
> > > I don't understand this scenario. It sounds like you are saying the
> > > guest and shadow rx vq are not in sync so there is a case where
> > > vq->in_use > vq->num is triggered?
> >
> > Sorry if I explain it bad, what I meant is that there is a case where
> > SVQ (as device code) will call virtqueue_pop() when vq->in_use ==
> > vq->num. virtio_queue_full maintains the check as >=, I think it
> > should be safe to even to code virtio_queue_full to:
> >
> > assert(vq->in_use > vq->num);
> > return vq->inuse == vq->num;
> >
> > Please let me know if this is not clear enough.
>
> I don't get it. When virtqueue_split_pop() has popped all requests
> virtio_queue_empty_rcu() should return true and we shouldn't reach if
> (vq->inuse >= vq->vring.num). The guest driver cannot submit more
> available buffers at this point.
>

Hi Stefan.

After the meeting, and reviewing the code carefully, I think you are
right. I'm not sure what I did to reproduce the issue, but I'm not
able to do it anymore, even in the conditions I thought where it was
trivially reproducible. Now I think it was caused in the previous
series because of accessing directly to guest's vring.

So I will delete this commit from the series. I still need to test SVQ
with the new additions, so if the bug persists it will reproduce for
sure.

Thank you very much!

> I only checked split rings, not packed rings.
>
> Can you point to the SVQ code which has this problem? It may be easier
> to re-read the code than try to describe it in an email.
>
> Stefan



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 13/27] vhost: Send buffers to device
  2021-03-24 19:04                   ` Eugenio Perez Martin
@ 2021-03-24 19:56                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-03-24 19:56 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-level

[-- Attachment #1: Type: text/plain, Size: 10097 bytes --]

On Wed, Mar 24, 2021 at 08:04:07PM +0100, Eugenio Perez Martin wrote:
> On Mon, Mar 22, 2021 at 6:40 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Mon, Mar 22, 2021 at 04:55:13PM +0100, Eugenio Perez Martin wrote:
> > > On Mon, Mar 22, 2021 at 11:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote:
> > > > > On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote:
> > > > > > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote:
> > > > > > > > > > +        while (true) {
> > > > > > > > > > +            int r;
> > > > > > > > > > +            if (virtio_queue_full(vq)) {
> > > > > > > > > > +                break;
> > > > > > > > > > +            }
> > > > > > > > >
> > > > > > > > > Why is this check necessary? The guest cannot provide more descriptors
> > > > > > > > > than there is ring space. If that happens somehow then it's a driver
> > > > > > > > > error that is already reported in virtqueue_pop() below.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's just checked because virtqueue_pop prints an error on that case,
> > > > > > > > and there is no way to tell the difference between a regular error and
> > > > > > > > another caused by other causes. Maybe the right thing to do is just to
> > > > > > > > not to print that error? Caller should do the error printing in that
> > > > > > > > case. Should we return an error code?
> > > > > > >
> > > > > > > The reason an error is printed today is because it's a guest error that
> > > > > > > never happens with correct guest drivers. Something is broken and the
> > > > > > > user should know about it.
> > > > > > >
> > > > > > > Why is "virtio_queue_full" (I already forgot what that actually means,
> > > > > > > it's not clear whether this is referring to avail elements or used
> > > > > > > elements) a condition that should be silently ignored in shadow vqs?
> > > > > > >
> > > > > >
> > > > > > TL;DR: It can be changed to a check of the number of available
> > > > > > descriptors in shadow vq, instead of returning as a regular operation.
> > > > > > However, I think that making it a special return of virtqueue_pop
> > > > > > could help in devices that run to completion, avoiding having to
> > > > > > duplicate the count logic in them.
> > > > > >
> > > > > > The function virtio_queue_empty checks if the vq has all descriptors
> > > > > > available, so the device has no more work to do until the driver makes
> > > > > > another descriptor available. I can see how it can be a bad name
> > > > > > choice, but virtio_queue_full means the opposite: device has pop()
> > > > > > every descriptor available, and it has not returned any, so the driver
> > > > > > cannot progress until the device marks some descriptors as used.
> > > > > >
> > > > > > As I understand, if vq->in_use >vq->num would mean we have a bug in
> > > > > > the device vq code, not in the driver. virtio_queue_full could even be
> > > > > > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse ==
> > > > > > vq->vring.num", as long as vq->in_use is operated right.
> > > > > >
> > > > > > If we hit vq->in_use == vq->num in virtqueue_pop it means the device
> > > > > > tried to pop() one more buffer after having all of them available and
> > > > > > pop'ed. This would be invalid if the device is counting right the
> > > > > > number of in_use descriptors, but then we are duplicating that logic
> > > > > > in the device and the vq.
> > > >
> > > > Devices call virtqueue_pop() until it returns NULL. They don't need to
> > > > count virtqueue buffers explicitly. It returns NULL when vq->num
> > > > virtqueue buffers have already been popped (either because
> > > > virtio_queue_empty() is true or because an invalid driver state is
> > > > detected by checking vq->num in virtqueue_pop()).
> > >
> > > If I understood you right, the virtio_queue_full addresses the reverse
> > > problem: it controls when the virtqueue is out of buffers to make
> > > available for the device because the latter has not consumed any, not
> > > when the driver does not offer more buffers to the device because it
> > > has no more data to offer.
> > >
> > > I find it easier to explain with the virtio-net rx queue (and I think
> > > it's the easier way to trigger this issue). You are describing it's
> > > regular behavior: The guest fills it (let's say 100%), and the device
> > > picks buffers one by one:
> > >
> > > virtio_net_receive_rcu:
> > > while (offset < size) {
> > >     elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));
> >
> > The lines before this loop return early when the virtqueue does not have
> > sufficient buffer space:
> >
> >   if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) {
> >       return 0;
> >   }
> >
> > When entering this loop we know that we can pop the buffers needed to
> > fill one rx packet.
> >
> > >     if (!elem) {
> > >         virtio_error("unexpected empty queue");
> > >     }
> > >     /* [1] */
> > >     /* fill elem with rx packet */
> > >     virtqueue_fill(virtqueue, elem);
> > >     ...
> > >     virtqueue_flush(q->rx_vq, i);
> > > }
> > >
> > > Every device as far as I know does this buffer by buffer, there is
> > > just processing code in [1], and it never tries to pop more than one
> > > buffers/chain of buffers at the same time. In the case of a queue
> > > empty (no more available buffers), we hit an error, because there are
> > > no more buffers to write.
> >
> > It's an error because we already checked that the virtqueue has buffer
> > space. This should never happen.
> >
> > > In other devices (or tx queue), empty
> > > buffers means there is no more work to do, not an error.
> > >
> > > In the case of shadow virtqueue, we cannot limit the number of exposed
> > > rx buffers to 1 buffer/chain of buffers in [1], since it will affect
> > > batching. We have the opposite problem: All devices (but rx queue)
> > > want to queue "as empty as possible", or "to mark all available
> > > buffers empty". Net rx queue is ok as long as it has a buffer/buffer
> > > chain big enough to write to, but it will fetch them on demand, so
> > > "queue full" (as in all buffers are available) is not a problem for
> > > the device.
> > >
> > > However, the part of the shadow virtqueue that forwards the available
> > > buffer seeks the opposite: It wants as many buffers as possible to be
> > > available. That means that there is no [1] code that fills/read &
> > > flush/detach the buffer immediately: Shadow virtqueue wants to make
> > > available as many buffers as possible, but the device may not use them
> > > until it has more data available. To the extreme (virtio-net rx queue
> > > full), shadow virtqueue may make available all buffers, so in a
> > > while(true) loop, it will try to make them available until it hits
> > > that all the buffers are already available (vq->in_use == vq->num).
> > >
> > > The solution is to check the number of buffers already available
> > > before calling virtio_queue_pop(). We could duplicate in_use in shadow
> > > virtqueue, of course, but everything we need is already done in
> > > VirtQueue code, so I think to reuse it is a better solution. Another
> > > solution could be to treat vq->in_use == vq->num as an special return
> > > code with no printed error in virtqueue_pop, but to expose if the
> > > queue is full (as vq->in_use == vq->num) sounds less invasive to me.
> > >
> > > >
> > > > > > In shadow vq this situation happens with the correct guest network
> > > > > > driver, since the rx queue is filled for the device to write. Network
> > > > > > device in qemu fetch descriptors on demand, but shadow vq fetch all
> > > > > > available in batching. If the driver just happens to fill the queue of
> > > > > > available descriptors, the log will raise, so we need to check in
> > > > > > handle_sw_lm_vq before calling pop(). Of course the shadow vq can
> > > > > > duplicate guest_vq->in_use instead of checking virtio_queue_full, but
> > > > > > then it needs to check two things for every virtqueue_pop() [1].
> > > >
> > > > I don't understand this scenario. It sounds like you are saying the
> > > > guest and shadow rx vq are not in sync so there is a case where
> > > > vq->in_use > vq->num is triggered?
> > >
> > > Sorry if I explain it bad, what I meant is that there is a case where
> > > SVQ (as device code) will call virtqueue_pop() when vq->in_use ==
> > > vq->num. virtio_queue_full maintains the check as >=, I think it
> > > should be safe to even to code virtio_queue_full to:
> > >
> > > assert(vq->in_use > vq->num);
> > > return vq->inuse == vq->num;
> > >
> > > Please let me know if this is not clear enough.
> >
> > I don't get it. When virtqueue_split_pop() has popped all requests
> > virtio_queue_empty_rcu() should return true and we shouldn't reach if
> > (vq->inuse >= vq->vring.num). The guest driver cannot submit more
> > available buffers at this point.
> >
> 
> Hi Stefan.
> 
> After the meeting, and reviewing the code carefully, I think you are
> right. I'm not sure what I did to reproduce the issue, but I'm not
> able to do it anymore, even in the conditions I thought where it was
> trivially reproducible. Now I think it was caused in the previous
> series because of accessing directly to guest's vring.
> 
> So I will delete this commit from the series. I still need to test SVQ
> with the new additions, so if the bug persists it will reproduce for
> sure.

Okay, thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2021-03-24 19:58 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20 18:50 [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 01/27] vhost: Add vhost_dev_can_log Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 02/27] vhost: Add device callback in vhost_migration_log Eugenio Pérez
2020-12-07 16:19   ` Stefan Hajnoczi
2020-12-09 12:20     ` Eugenio Perez Martin
2020-11-20 18:50 ` [RFC PATCH 03/27] vhost: Move log resize/put to vhost_dev_set_log Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 04/27] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
2020-12-07 16:43   ` Stefan Hajnoczi
2020-12-09 12:00     ` Eugenio Perez Martin
2020-12-09 16:08       ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 05/27] vhost: Add hdev->dev.sw_lm_vq_handler Eugenio Pérez
2020-12-07 16:52   ` Stefan Hajnoczi
2020-12-09 15:02     ` Eugenio Perez Martin
2020-12-10 11:30       ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 06/27] virtio: Add virtio_queue_get_used_notify_split Eugenio Pérez
2020-12-07 16:58   ` Stefan Hajnoczi
2021-01-12 18:21     ` Eugenio Perez Martin
2021-03-02 11:22       ` Stefan Hajnoczi
2021-03-02 18:34         ` Eugenio Perez Martin
2021-03-08 10:46           ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 07/27] vhost: Route guest->host notification through qemu Eugenio Pérez
2020-12-07 17:42   ` Stefan Hajnoczi
2020-12-09 17:08     ` Eugenio Perez Martin
2020-12-10 11:50       ` Stefan Hajnoczi
2021-01-21 20:10         ` Eugenio Perez Martin
2020-11-20 18:50 ` [RFC PATCH 08/27] vhost: Add a flag for software assisted Live Migration Eugenio Pérez
2020-12-08  7:20   ` Stefan Hajnoczi
2020-12-09 17:57     ` Eugenio Perez Martin
2020-11-20 18:50 ` [RFC PATCH 09/27] vhost: Route host->guest notification through qemu Eugenio Pérez
2020-12-08  7:34   ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 10/27] vhost: Allocate shadow vring Eugenio Pérez
2020-12-08  7:49   ` Stefan Hajnoczi
2020-12-08  8:17   ` Stefan Hajnoczi
2020-12-09 18:15     ` Eugenio Perez Martin
2020-11-20 18:50 ` [RFC PATCH 11/27] virtio: const-ify all virtio_tswap* functions Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 12/27] virtio: Add virtio_queue_full Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 13/27] vhost: Send buffers to device Eugenio Pérez
2020-12-08  8:16   ` Stefan Hajnoczi
2020-12-09 18:41     ` Eugenio Perez Martin
2020-12-10 11:55       ` Stefan Hajnoczi
2021-01-22 18:18         ` Eugenio Perez Martin
     [not found]           ` <CAJaqyWdNeaboGaSsXPA8r=mUsbctFLzACFKLX55yRQpTvjqxJw@mail.gmail.com>
2021-03-22 10:51             ` Stefan Hajnoczi
2021-03-22 15:55               ` Eugenio Perez Martin
2021-03-22 17:40                 ` Stefan Hajnoczi
2021-03-24 19:04                   ` Eugenio Perez Martin
2021-03-24 19:56                     ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 14/27] virtio: Remove virtio_queue_get_used_notify_split Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 15/27] vhost: Do not invalidate signalled used Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 16/27] virtio: Expose virtqueue_alloc_element Eugenio Pérez
2020-12-08  8:25   ` Stefan Hajnoczi
2020-12-09 18:46     ` Eugenio Perez Martin
2020-12-10 11:57       ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 17/27] vhost: add vhost_vring_set_notification_rcu Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 18/27] vhost: add vhost_vring_poll_rcu Eugenio Pérez
2020-12-08  8:41   ` Stefan Hajnoczi
2020-12-09 18:48     ` Eugenio Perez Martin
2020-11-20 18:50 ` [RFC PATCH 19/27] vhost: add vhost_vring_get_buf_rcu Eugenio Pérez
2020-11-20 18:50 ` [RFC PATCH 20/27] vhost: Return used buffers Eugenio Pérez
2020-12-08  8:50   ` Stefan Hajnoczi
2020-11-20 18:50 ` [RFC PATCH 21/27] vhost: Add vhost_virtqueue_memory_unmap Eugenio Pérez
2020-11-20 18:51 ` [RFC PATCH 22/27] vhost: Add vhost_virtqueue_memory_map Eugenio Pérez
2020-11-20 18:51 ` [RFC PATCH 23/27] vhost: unmap qemu's shadow virtqueues on sw live migration Eugenio Pérez
2020-11-27 15:29   ` Stefano Garzarella
2020-11-30  7:54     ` Eugenio Perez Martin
2020-11-20 18:51 ` [RFC PATCH 24/27] vhost: iommu changes Eugenio Pérez
2020-12-08  9:02   ` Stefan Hajnoczi
2020-11-20 18:51 ` [RFC PATCH 25/27] vhost: Do not commit vhost used idx on vhost_virtqueue_stop Eugenio Pérez
2020-11-20 19:35   ` Eugenio Perez Martin
2020-11-20 18:51 ` [RFC PATCH 26/27] vhost: Add vhost_hdev_can_sw_lm Eugenio Pérez
2020-11-20 18:51 ` [RFC PATCH 27/27] vhost: forbid vhost devices logging Eugenio Pérez
2020-11-20 19:03 ` [RFC PATCH 00/27] vDPA software assisted live migration Eugenio Perez Martin
2020-11-20 19:30 ` no-reply
2020-11-25  7:08 ` Jason Wang
2020-11-25 12:03   ` Eugenio Perez Martin
2020-11-25 12:14     ` Eugenio Perez Martin
2020-11-26  3:07     ` Jason Wang
2020-11-27 15:44 ` Stefano Garzarella
2020-12-08  9:37 ` Stefan Hajnoczi
2020-12-09  9:26   ` Jason Wang
2020-12-09 15:57     ` Stefan Hajnoczi
2020-12-10  9:12       ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).