qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Add debug interface to kick/call on purpose
@ 2021-03-26  5:44 Dongli Zhang
  2021-03-26  5:44 ` [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event Dongli Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
the loss of doorbell kick, e.g.,

https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html

... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").

This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
to help narrow down if the issue is due to loss of irq/kick. So far the new
interface handles only two events: 'call' and 'kick'. Any device (e.g.,
virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
or legacy IRQ).

The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
on purpose by admin at QEMU/host side for a specific device.


This device can be used as a workaround if call/kick is lost due to
virtualization software (e.g., kernel or QEMU) issue.

We may also implement the interface for VFIO PCI, e.g., to write to
VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
on purpose. This is considered future work once the virtio part is done.


Below is from live crash analysis. Initially, the queue=2 has count=15 for
'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
used available. We suspect this is because vhost-scsi was not notified by
VM. In order to narrow down and analyze the issue, we use live crash to
dump the current counter of eventfd for queue=2.

crash> eventfd_ctx ffff8f67f6bbe700
struct eventfd_ctx {
  kref = {
    refcount = {
      refs = {
        counter = 4
      }
    }
  }, 
  wqh = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    }, 
    head = {
      next = 0xffff8f841dc08e18, 
      prev = 0xffff8f841dc08e18
    }
  }, 
  count = 15, ---> eventfd is 15 !!!
  flags = 526336
}

Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
with this interface.

{ "execute": "x-debug-device-event",
  "arguments": { "dev": "/machine/peripheral/vscsi0",
                 "event": "kick", "queue": 2 } }

The counter is increased to 16. Suppose the hang issue is resolved, it
indicates something bad is in software that the 'kick' is lost.

crash> eventfd_ctx ffff8f67f6bbe700
struct eventfd_ctx {
  kref = {
    refcount = {
      refs = {
        counter = 4
      }
    }
  },
  wqh = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    head = {
      next = 0xffff8f841dc08e18,
      prev = 0xffff8f841dc08e18
    }
  },
  count = 16, ---> eventfd incremented to 16 !!!
  flags = 526336
}


Original RFC link:

https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html

Changed since RFC:
  - add support for more virtio/vhost pci devices
  - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this 
    mischeivous command had been used
  - fix grammer error (s/lost/loss/)
  - change version to 6.1
  - fix incorrect example in qapi/qdev.json
  - manage event types with enum/array, instead of hard coding


Dongli Zhang (6):
   qdev: introduce qapi/hmp command for kick/call event
   virtio: introduce helper function for kick/call device event
   virtio-blk-pci: implement device event interface for kick/call
   virtio-scsi-pci: implement device event interface for kick/call
   vhost-scsi-pci: implement device event interface for kick/call
   virtio-net-pci: implement device event interface for kick/call

 hmp-commands.hx                 | 14 ++++++++
 hw/block/virtio-blk.c           |  9 +++++
 hw/net/virtio-net.c             |  9 +++++
 hw/scsi/vhost-scsi.c            |  6 ++++
 hw/scsi/virtio-scsi.c           |  9 +++++
 hw/virtio/vhost-scsi-pci.c      | 10 ++++++
 hw/virtio/virtio-blk-pci.c      | 10 ++++++
 hw/virtio/virtio-net-pci.c      | 10 ++++++
 hw/virtio/virtio-scsi-pci.c     | 10 ++++++
 hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
 include/hw/qdev-core.h          |  9 +++++
 include/hw/virtio/vhost-scsi.h  |  3 ++
 include/hw/virtio/virtio-blk.h  |  2 ++
 include/hw/virtio/virtio-net.h  |  3 ++
 include/hw/virtio/virtio-scsi.h |  3 ++
 include/hw/virtio/virtio.h      |  3 ++
 include/monitor/hmp.h           |  1 +
 qapi/qdev.json                  | 30 +++++++++++++++++
 softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
 19 files changed, 261 insertions(+)


I did tests with below cases.

- virtio-blk-pci (ioeventfd on/off, iothread, live migration)
- virtio-scsi-pci (ioeventfd on/off)
- vhost-scsi-pci
- virtio-net-pci (ioeventfd on/off, vhost)

Thank you very much!

Dongli Zhang




^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-04-07 13:40   ` Eduardo Habkost
  2021-03-26  5:44 ` [PATCH 2/6] virtio: introduce helper function for kick/call device event Dongli Zhang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
the loss of doorbell kick, e.g.,

https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html

... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").

This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
to help narrow down if the issue is due to loss of irq/kick. So far the new
interface handles only two events: 'call' and 'kick'. Any device (e.g.,
virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
or legacy IRQ).

The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
on purpose by admin at QEMU/host side for a specific device.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hmp-commands.hx        | 14 +++++++++++
 include/hw/qdev-core.h |  9 +++++++
 include/monitor/hmp.h  |  1 +
 qapi/qdev.json         | 30 ++++++++++++++++++++++
 softmmu/qdev-monitor.c | 56 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 110 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 435c591a1c..d74b895fff 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1725,3 +1725,17 @@ ERST
         .flags      = "p",
     },
 
+    {
+        .name       = "x-debug-device-event",
+        .args_type  = "dev:s,event:s,queue:l",
+        .params     = "dev event queue",
+        .help       = "generate device event for a specific device queue",
+        .cmd        = hmp_x_debug_device_event,
+        .flags      = "p",
+    },
+
+SRST
+``x-debug-device-event`` *dev* *event* *queue*
+  Generate device event *event* for specific *queue* of *dev*
+ERST
+
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index bafc311bfa..1ea8bf23b9 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -29,9 +29,17 @@ typedef enum DeviceCategory {
     DEVICE_CATEGORY_MAX
 } DeviceCategory;
 
+enum {
+    DEVICE_EVENT_CALL,
+    DEVICE_EVENT_KICK,
+    DEVICE_EVENT_MAX
+};
+
 typedef void (*DeviceRealize)(DeviceState *dev, Error **errp);
 typedef void (*DeviceUnrealize)(DeviceState *dev);
 typedef void (*DeviceReset)(DeviceState *dev);
+typedef void (*DeviceEvent)(DeviceState *dev, int event, int queue,
+                            Error **errp);
 typedef void (*BusRealize)(BusState *bus, Error **errp);
 typedef void (*BusUnrealize)(BusState *bus);
 
@@ -132,6 +140,7 @@ struct DeviceClass {
     DeviceReset reset;
     DeviceRealize realize;
     DeviceUnrealize unrealize;
+    DeviceEvent event;
 
     /* device state */
     const VMStateDescription *vmsd;
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 605d57287a..c7795d4ba5 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict);
 void hmp_replay_break(Monitor *mon, const QDict *qdict);
 void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
 void hmp_replay_seek(Monitor *mon, const QDict *qdict);
+void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict);
 
 #endif
diff --git a/qapi/qdev.json b/qapi/qdev.json
index b83178220b..711c4a297a 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -124,3 +124,33 @@
 ##
 { 'event': 'DEVICE_DELETED',
   'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @x-debug-device-event:
+#
+# Generate device event for a specific device queue
+#
+# @dev: device path
+#
+# @event: event (e.g., kick or call) to trigger
+#
+# @queue: queue id
+#
+# Returns: Nothing on success
+#
+# Since: 6.1
+#
+# Notes: This is used to debug VM driver hang issue. The 'kick' event is to
+#        send notification to QEMU/vhost while the 'call' event is to
+#        interrupt VM on purpose.
+#
+# Example:
+#
+# -> { "execute": "x-debug-device_event",
+#      "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick",
+#                     "queue": 1 } }
+# <- { "return": {} }
+#
+##
+{ 'command': 'x-debug-device-event',
+  'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} }
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index a9955b97a0..bca53111fb 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -924,6 +924,62 @@ void hmp_device_del(Monitor *mon, const QDict *qdict)
     hmp_handle_error(mon, err);
 }
 
+static const char * const device_events[DEVICE_EVENT_MAX] = {
+    [DEVICE_EVENT_KICK] = "kick",
+    [DEVICE_EVENT_CALL] = "call"
+};
+
+static int get_device_event(const char *event)
+{
+    int evt;
+
+    for (evt = 0; evt < ARRAY_SIZE(device_events); evt++) {
+        if (!strcmp(device_events[evt], event)) {
+            return evt;
+        }
+    }
+
+    return -ENOENT;
+}
+
+void qmp_x_debug_device_event(const char *dev, const char *event,
+                              int64_t queue, Error **errp)
+{
+    DeviceState *device = find_device_state(dev, NULL);
+    DeviceClass *dc;
+    int evt;
+
+    if (!device) {
+        error_setg(errp, "Device %s not found", dev);
+        return;
+    }
+
+    dc = DEVICE_GET_CLASS(device);
+    if (!dc->event) {
+        error_setg(errp, "device_event is not supported");
+        return;
+    }
+
+    evt = get_device_event(event);
+    if (evt < 0) {
+        error_setg(errp, "Unsupported event %s", event);
+        return;
+    }
+
+    dc->event(device, evt, queue, errp);
+}
+
+void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict)
+{
+    const char *dev = qdict_get_str(qdict, "dev");
+    const char *event = qdict_get_str(qdict, "event");
+    int queue = qdict_get_try_int(qdict, "queue", -1);
+    Error *err = NULL;
+
+    qmp_x_debug_device_event(dev, event, queue, &err);
+    hmp_handle_error(mon, err);
+}
+
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
 {
     DeviceState *dev;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] virtio: introduce helper function for kick/call device event
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
  2021-03-26  5:44 ` [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-03-26  5:44 ` [PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call Dongli Zhang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

This is to introduce the helper function for virtio device to kick or
call.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/virtio/virtio.c         | 64 ++++++++++++++++++++++++++++++++++++++
 include/hw/virtio/virtio.h |  3 ++
 2 files changed, 67 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 07f4e60b30..e081041a75 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -30,6 +30,8 @@
 #include "sysemu/runstate.h"
 #include "standard-headers/linux/virtio_ids.h"
 
+/* #define DEBUG_VIRTIO_EVENT */
+
 /*
  * The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. This is the default, used by transports like PCI
@@ -2572,6 +2574,68 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
     virtio_irq(vq);
 }
 
+static void virtio_device_event_call(VirtQueue *vq, bool eventfd,
+                                     Error **errp)
+{
+#ifdef DEBUG_VIRTIO_EVENT
+    printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
+           object_get_canonical_path(OBJECT(vq->vdev)),
+           vq->queue_index, eventfd);
+#endif
+
+    if (eventfd) {
+        virtio_set_isr(vq->vdev, 0x1);
+        event_notifier_set(&vq->guest_notifier);
+    } else {
+        virtio_irq(vq);
+    }
+}
+
+static void virtio_device_event_kick(VirtQueue *vq, bool eventfd,
+                                     Error **errp)
+{
+#ifdef DEBUG_VIRTIO_EVENT
+    printf("The 'kick' event is triggered for path=%s, queue=%d.\n",
+           object_get_canonical_path(OBJECT(vq->vdev)), vq->queue_index);
+#endif
+
+    virtio_queue_notify(vq->vdev, virtio_get_queue_index(vq));
+}
+
+typedef void (VirtIOEvent)(VirtQueue *vq, bool eventfd, Error **errp);
+
+static VirtIOEvent *virtio_event_funcs[DEVICE_EVENT_MAX] = {
+    [DEVICE_EVENT_CALL] = virtio_device_event_call,
+    [DEVICE_EVENT_KICK] = virtio_device_event_kick
+};
+
+void virtio_device_event(DeviceState *dev, int event, int queue,
+                         bool eventfd, Error **errp)
+{
+    struct VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    int num = virtio_get_num_queues(vdev);
+    VirtQueue *vq;
+
+    assert(event < DEVICE_EVENT_MAX);
+
+    if (vdev->broken) {
+        error_setg(errp, "Broken device");
+        return;
+    }
+
+    if (queue < 0 || queue >= num) {
+        error_setg(errp, "Invalid queue %d", queue);
+        return;
+    }
+
+    vq = &vdev->vq[queue];
+
+    if (virtio_event_funcs[event])
+        virtio_event_funcs[event](vq, eventfd, errp);
+    else
+        error_setg(errp, "The event is not supported");
+}
+
 void virtio_notify_config(VirtIODevice *vdev)
 {
     if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK))
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b7ece7a6a8..21bb13ffa6 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -210,6 +210,9 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
 void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq);
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq);
 
+void virtio_device_event(DeviceState *dev, int event, int queue,
+                         bool eventfd, Error **errp);
+
 int virtio_save(VirtIODevice *vdev, QEMUFile *f);
 
 extern const VMStateInfo virtio_vmstate_info;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
  2021-03-26  5:44 ` [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event Dongli Zhang
  2021-03-26  5:44 ` [PATCH 2/6] virtio: introduce helper function for kick/call device event Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-03-26  5:44 ` [PATCH 4/6] virtio-scsi-pci: " Dongli Zhang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

This is to implement the device event interface for virtio-blk-pci.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/block/virtio-blk.c          |  9 +++++++++
 hw/virtio/virtio-blk-pci.c     | 10 ++++++++++
 include/hw/virtio/virtio-blk.h |  2 ++
 3 files changed, 21 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index d28979efb8..2b3583a913 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1118,6 +1118,15 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
     return 0;
 }
 
+void virtio_blk_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp)
+{
+    VirtIOBlock *s = VIRTIO_BLK(dev);
+    bool irqfd = s->dataplane_started && !s->dataplane_disabled;
+
+    virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 static void virtio_resize_cb(void *opaque)
 {
     VirtIODevice *vdev = opaque;
diff --git a/hw/virtio/virtio-blk-pci.c b/hw/virtio/virtio-blk-pci.c
index 9d5795810c..f1fc72e7f1 100644
--- a/hw/virtio/virtio-blk-pci.c
+++ b/hw/virtio/virtio-blk-pci.c
@@ -47,6 +47,15 @@ static Property virtio_blk_pci_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_blk_pci_event(DeviceState *dev, int event, int queue,
+                                 Error **errp)
+{
+    VirtIOBlkPCI *vblk = VIRTIO_BLK_PCI(dev);
+    DeviceState *vdev = DEVICE(&vblk->vdev);
+
+    virtio_blk_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
     VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(vpci_dev);
@@ -72,6 +81,7 @@ static void virtio_blk_pci_class_init(ObjectClass *klass, void *data)
 
     set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
     device_class_set_props(dc, virtio_blk_pci_properties);
+    dc->event = virtio_blk_pci_event;
     k->realize = virtio_blk_pci_realize;
     pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
     pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_BLOCK;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..500be01dff 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -92,5 +92,7 @@ typedef struct MultiReqBuffer {
 
 bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
+void virtio_blk_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp);
 
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] virtio-scsi-pci: implement device event interface for kick/call
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
                   ` (2 preceding siblings ...)
  2021-03-26  5:44 ` [PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-03-26  5:44 ` [PATCH 5/6] vhost-scsi-pci: " Dongli Zhang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

This is to implement the device event interface for virtio-scsi-pci.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/scsi/virtio-scsi.c           |  9 +++++++++
 hw/virtio/virtio-scsi-pci.c     | 10 ++++++++++
 include/hw/virtio/virtio-scsi.h |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 6d80730287..f437ed1a81 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -962,6 +962,15 @@ static struct SCSIBusInfo virtio_scsi_scsi_info = {
     .load_request = virtio_scsi_load_request,
 };
 
+void virtio_scsi_device_event(DeviceState *dev, int event, int queue,
+                              Error **errp)
+{
+    VirtIOSCSI *s = VIRTIO_SCSI(dev);
+    bool irqfd = s->dataplane_started && !s->dataplane_fenced;
+
+    virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 void virtio_scsi_common_realize(DeviceState *dev,
                                 VirtIOHandleOutput ctrl,
                                 VirtIOHandleOutput evt,
diff --git a/hw/virtio/virtio-scsi-pci.c b/hw/virtio/virtio-scsi-pci.c
index 97fab74236..d5db743692 100644
--- a/hw/virtio/virtio-scsi-pci.c
+++ b/hw/virtio/virtio-scsi-pci.c
@@ -43,6 +43,15 @@ static Property virtio_scsi_pci_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_scsi_pci_event(DeviceState *dev, int event, int queue,
+                                  Error **errp)
+{
+    VirtIOSCSIPCI *vscsi = VIRTIO_SCSI_PCI(dev);
+    DeviceState *vdev = DEVICE(&vscsi->vdev);
+
+    virtio_scsi_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
     VirtIOSCSIPCI *dev = VIRTIO_SCSI_PCI(vpci_dev);
@@ -82,6 +91,7 @@ static void virtio_scsi_pci_class_init(ObjectClass *klass, void *data)
     k->realize = virtio_scsi_pci_realize;
     set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
     device_class_set_props(dc, virtio_scsi_pci_properties);
+    dc->event = virtio_scsi_pci_event;
     pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
     pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI;
     pcidev_k->revision = 0x00;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 543681bc18..d1fff0eeac 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -163,4 +163,7 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp);
 int virtio_scsi_dataplane_start(VirtIODevice *s);
 void virtio_scsi_dataplane_stop(VirtIODevice *s);
 
+void virtio_scsi_device_event(DeviceState *dev, int event, int queue,
+                              Error **errp);
+
 #endif /* QEMU_VIRTIO_SCSI_H */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] vhost-scsi-pci: implement device event interface for kick/call
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
                   ` (3 preceding siblings ...)
  2021-03-26  5:44 ` [PATCH 4/6] virtio-scsi-pci: " Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-03-26  5:44 ` [PATCH 6/6] virtio-net-pci: " Dongli Zhang
  2021-03-26  7:24 ` [PATCH 0/6] Add debug interface to kick/call on purpose Jason Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

This is to implement the device event interface for vhost-scsi-pci.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/scsi/vhost-scsi.c           |  6 ++++++
 hw/virtio/vhost-scsi-pci.c     | 10 ++++++++++
 include/hw/virtio/vhost-scsi.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 4d70fa036b..11dd94ff92 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -163,6 +163,12 @@ static const VMStateDescription vmstate_virtio_vhost_scsi = {
     .pre_save = vhost_scsi_pre_save,
 };
 
+void vhost_scsi_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp)
+{
+    virtio_device_event(dev, event, queue, true, errp);
+}
+
 static void vhost_scsi_realize(DeviceState *dev, Error **errp)
 {
     VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
diff --git a/hw/virtio/vhost-scsi-pci.c b/hw/virtio/vhost-scsi-pci.c
index cb71a294fa..c7a614cb11 100644
--- a/hw/virtio/vhost-scsi-pci.c
+++ b/hw/virtio/vhost-scsi-pci.c
@@ -44,6 +44,15 @@ static Property vhost_scsi_pci_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vhost_scsi_pci_event(DeviceState *dev, int event, int queue,
+                                 Error **errp)
+{
+    VHostSCSIPCI *vscsi = VHOST_SCSI_PCI(dev);
+    DeviceState *vdev = DEVICE(&vscsi->vdev);
+
+    vhost_scsi_device_event(vdev, event, queue, errp);
+}
+
 static void vhost_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
     VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev);
@@ -70,6 +79,7 @@ static void vhost_scsi_pci_class_init(ObjectClass *klass, void *data)
     k->realize = vhost_scsi_pci_realize;
     set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
     device_class_set_props(dc, vhost_scsi_pci_properties);
+    dc->event = vhost_scsi_pci_event;
     pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
     pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI;
     pcidev_k->revision = 0x00;
diff --git a/include/hw/virtio/vhost-scsi.h b/include/hw/virtio/vhost-scsi.h
index 7dc2bdd69d..b47854a0c6 100644
--- a/include/hw/virtio/vhost-scsi.h
+++ b/include/hw/virtio/vhost-scsi.h
@@ -32,4 +32,7 @@ struct VHostSCSI {
     VHostSCSICommon parent_obj;
 };
 
+void vhost_scsi_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp);
+
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] virtio-net-pci: implement device event interface for kick/call
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
                   ` (4 preceding siblings ...)
  2021-03-26  5:44 ` [PATCH 5/6] vhost-scsi-pci: " Dongli Zhang
@ 2021-03-26  5:44 ` Dongli Zhang
  2021-03-26  7:24 ` [PATCH 0/6] Add debug interface to kick/call on purpose Jason Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26  5:44 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, jasowang, joe.jin, armbru,
	dgilbert, stefanha, pbonzini, mreitz

This is to implement the device event interface for virtio-net-pci.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/net/virtio-net.c            |  9 +++++++++
 hw/virtio/virtio-net-pci.c     | 10 ++++++++++
 include/hw/virtio/virtio-net.h |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 66b9ff4511..b5c3fa392c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3195,6 +3195,15 @@ static bool failover_hide_primary_device(DeviceListener *listener,
     return qatomic_read(&n->failover_primary_hidden);
 }
 
+void virtio_net_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp)
+{
+    VirtIONet *n = VIRTIO_NET(dev);
+    bool irqfd = n->vhost_started;
+
+    virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 static void virtio_net_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c
index aa0b3caecb..1fa5a6fe5d 100644
--- a/hw/virtio/virtio-net-pci.c
+++ b/hw/virtio/virtio-net-pci.c
@@ -46,6 +46,15 @@ static Property virtio_net_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_net_pci_event(DeviceState *dev, int event, int queue,
+                                 Error **errp)
+{
+    VirtIONetPCI *vnet = VIRTIO_NET_PCI(dev);
+    DeviceState *vdev = DEVICE(&vnet->vdev);
+
+    virtio_net_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_net_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
     DeviceState *qdev = DEVICE(vpci_dev);
@@ -77,6 +86,7 @@ static void virtio_net_pci_class_init(ObjectClass *klass, void *data)
     k->class_id = PCI_CLASS_NETWORK_ETHERNET;
     set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
     device_class_set_props(dc, virtio_net_properties);
+    dc->event = virtio_net_pci_event;
     vpciklass->realize = virtio_net_pci_realize;
 }
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 7e96d193aa..d88c9969ea 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -214,4 +214,7 @@ struct VirtIONet {
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
                                    const char *type);
 
+void virtio_net_device_event(DeviceState *dev, int event, int queue,
+                             Error **errp);
+
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
                   ` (5 preceding siblings ...)
  2021-03-26  5:44 ` [PATCH 6/6] virtio-net-pci: " Dongli Zhang
@ 2021-03-26  7:24 ` Jason Wang
  2021-03-26 21:16   ` Dongli Zhang
  6 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2021-03-26  7:24 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/3/26 下午1:44, Dongli Zhang 写道:
> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
> the loss of doorbell kick, e.g.,
>
> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html
>
> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>
> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
> to help narrow down if the issue is due to loss of irq/kick. So far the new
> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
> or legacy IRQ).
>
> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
> on purpose by admin at QEMU/host side for a specific device.
>
>
> This device can be used as a workaround if call/kick is lost due to
> virtualization software (e.g., kernel or QEMU) issue.
>
> We may also implement the interface for VFIO PCI, e.g., to write to
> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
> on purpose. This is considered future work once the virtio part is done.
>
>
> Below is from live crash analysis. Initially, the queue=2 has count=15 for
> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
> used available. We suspect this is because vhost-scsi was not notified by
> VM. In order to narrow down and analyze the issue, we use live crash to
> dump the current counter of eventfd for queue=2.
>
> crash> eventfd_ctx ffff8f67f6bbe700
> struct eventfd_ctx {
>    kref = {
>      refcount = {
>        refs = {
>          counter = 4
>        }
>      }
>    },
>    wqh = {
>      lock = {
>        {
>          rlock = {
>            raw_lock = {
>              val = {
>                counter = 0
>              }
>            }
>          }
>        }
>      },
>      head = {
>        next = 0xffff8f841dc08e18,
>        prev = 0xffff8f841dc08e18
>      }
>    },
>    count = 15, ---> eventfd is 15 !!!
>    flags = 526336
> }
>
> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
> with this interface.
>
> { "execute": "x-debug-device-event",
>    "arguments": { "dev": "/machine/peripheral/vscsi0",
>                   "event": "kick", "queue": 2 } }
>
> The counter is increased to 16. Suppose the hang issue is resolved, it
> indicates something bad is in software that the 'kick' is lost.


What do you mean by "software" here? And it looks to me you're testing 
whether event_notifier_set() is called by virtio_queue_notify() here. If 
so, I'm not sure how much value could we gain from a dedicated debug 
interface like this consider there're a lot of exisinting general 
purpose debugging method like tracing or gdb. I'd say the path from 
virtio_queue_notify() to event_notifier_set() is only a very small 
fraction of the process of virtqueue kick which is unlikey to be buggy. 
Consider usually the ioeventfd will be offloaded to KVM, it's more a 
chance that something is wrong in setuping ioeventfd instead of here. 
Irq is even more complicated.

I think we could not gain much for introducing an dedicated mechanism 
for such a corner case.

Thanks


>
> crash> eventfd_ctx ffff8f67f6bbe700
> struct eventfd_ctx {
>    kref = {
>      refcount = {
>        refs = {
>          counter = 4
>        }
>      }
>    },
>    wqh = {
>      lock = {
>        {
>          rlock = {
>            raw_lock = {
>              val = {
>                counter = 0
>              }
>            }
>          }
>        }
>      },
>      head = {
>        next = 0xffff8f841dc08e18,
>        prev = 0xffff8f841dc08e18
>      }
>    },
>    count = 16, ---> eventfd incremented to 16 !!!
>    flags = 526336
> }
>
>
> Original RFC link:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html
>
> Changed since RFC:
>    - add support for more virtio/vhost pci devices
>    - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this
>      mischeivous command had been used
>    - fix grammer error (s/lost/loss/)
>    - change version to 6.1
>    - fix incorrect example in qapi/qdev.json
>    - manage event types with enum/array, instead of hard coding
>
>
> Dongli Zhang (6):
>     qdev: introduce qapi/hmp command for kick/call event
>     virtio: introduce helper function for kick/call device event
>     virtio-blk-pci: implement device event interface for kick/call
>     virtio-scsi-pci: implement device event interface for kick/call
>     vhost-scsi-pci: implement device event interface for kick/call
>     virtio-net-pci: implement device event interface for kick/call
>
>   hmp-commands.hx                 | 14 ++++++++
>   hw/block/virtio-blk.c           |  9 +++++
>   hw/net/virtio-net.c             |  9 +++++
>   hw/scsi/vhost-scsi.c            |  6 ++++
>   hw/scsi/virtio-scsi.c           |  9 +++++
>   hw/virtio/vhost-scsi-pci.c      | 10 ++++++
>   hw/virtio/virtio-blk-pci.c      | 10 ++++++
>   hw/virtio/virtio-net-pci.c      | 10 ++++++
>   hw/virtio/virtio-scsi-pci.c     | 10 ++++++
>   hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
>   include/hw/qdev-core.h          |  9 +++++
>   include/hw/virtio/vhost-scsi.h  |  3 ++
>   include/hw/virtio/virtio-blk.h  |  2 ++
>   include/hw/virtio/virtio-net.h  |  3 ++
>   include/hw/virtio/virtio-scsi.h |  3 ++
>   include/hw/virtio/virtio.h      |  3 ++
>   include/monitor/hmp.h           |  1 +
>   qapi/qdev.json                  | 30 +++++++++++++++++
>   softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
>   19 files changed, 261 insertions(+)
>
>
> I did tests with below cases.
>
> - virtio-blk-pci (ioeventfd on/off, iothread, live migration)
> - virtio-scsi-pci (ioeventfd on/off)
> - vhost-scsi-pci
> - virtio-net-pci (ioeventfd on/off, vhost)
>
> Thank you very much!
>
> Dongli Zhang
>
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-03-26  7:24 ` [PATCH 0/6] Add debug interface to kick/call on purpose Jason Wang
@ 2021-03-26 21:16   ` Dongli Zhang
  2021-03-29  3:56     ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-03-26 21:16 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz

Hi Jason,

On 3/26/21 12:24 AM, Jason Wang wrote:
> 
> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>> the loss of doorbell kick, e.g.,
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>
>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>
>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>> or legacy IRQ).
>>
>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>> on purpose by admin at QEMU/host side for a specific device.
>>
>>
>> This device can be used as a workaround if call/kick is lost due to
>> virtualization software (e.g., kernel or QEMU) issue.
>>
>> We may also implement the interface for VFIO PCI, e.g., to write to
>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>> on purpose. This is considered future work once the virtio part is done.
>>
>>
>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>> used available. We suspect this is because vhost-scsi was not notified by
>> VM. In order to narrow down and analyze the issue, we use live crash to
>> dump the current counter of eventfd for queue=2.
>>
>> crash> eventfd_ctx ffff8f67f6bbe700
>> struct eventfd_ctx {
>>    kref = {
>>      refcount = {
>>        refs = {
>>          counter = 4
>>        }
>>      }
>>    },
>>    wqh = {
>>      lock = {
>>        {
>>          rlock = {
>>            raw_lock = {
>>              val = {
>>                counter = 0
>>              }
>>            }
>>          }
>>        }
>>      },
>>      head = {
>>        next = 0xffff8f841dc08e18,
>>        prev = 0xffff8f841dc08e18
>>      }
>>    },
>>    count = 15, ---> eventfd is 15 !!!
>>    flags = 526336
>> }
>>
>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>> with this interface.
>>
>> { "execute": "x-debug-device-event",
>>    "arguments": { "dev": "/machine/peripheral/vscsi0",
>>                   "event": "kick", "queue": 2 } }
>>
>> The counter is increased to 16. Suppose the hang issue is resolved, it
>> indicates something bad is in software that the 'kick' is lost.
> 
> 
> What do you mean by "software" here? And it looks to me you're testing whether
> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
> sure how much value could we gain from a dedicated debug interface like this
> consider there're a lot of exisinting general purpose debugging method like
> tracing or gdb. I'd say the path from virtio_queue_notify() to
> event_notifier_set() is only a very small fraction of the process of virtqueue
> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
> offloaded to KVM, it's more a chance that something is wrong in setuping
> ioeventfd instead of here. Irq is even more complicated.

Thank you very much!

I am not testing whether event_notifier_set() is called by virtio_queue_notify().

The 'software' indicates the data processing and event notification mechanism
involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
erroneously returns false due to corrupted ring buffer status.

This was initially proposed for vhost only and I was going to export
ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
better implement this at QEMU.

The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via
ioeventfd), while the backend vhost side sends responses back and calls the IRQ
(via ioeventfd).

Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
missed/ignored, or even never triggered. The example mentioned in the patchset
cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
was ignored, due to pci-bridge/hotplug issue.

The hotplug is with a very small window but the IO hangs permanently. I did test
that kicking the doorbell again will help recover the IO, so that I would be
able to conclude this was due to lost of kick.

The loss of irq/doorbell is painful especially in production environment where
we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU binding").

This can help "narrow down" if the IO/networking hang is due to loss of
IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in production
env. This can also be used as a workaround so that VM owner will not need to
reboot VM.

In addition, the VFIO will benefit from it. We will be able to test if to inject
IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
chance to loose IRQ during CPU hotplug or changing IRQ affinity).

> 
> I think we could not gain much for introducing an dedicated mechanism for such a
> corner case.

As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
trigger an ioport write on purpose.

The linux block layer also supports the below to kick the IO queue on purpose.

echo "kick" > /sys/kernel/debug/block/sda/state

Dongli Zhang

> 
> Thanks
> 
> 
>>
>> crash> eventfd_ctx ffff8f67f6bbe700
>> struct eventfd_ctx {
>>    kref = {
>>      refcount = {
>>        refs = {
>>          counter = 4
>>        }
>>      }
>>    },
>>    wqh = {
>>      lock = {
>>        {
>>          rlock = {
>>            raw_lock = {
>>              val = {
>>                counter = 0
>>              }
>>            }
>>          }
>>        }
>>      },
>>      head = {
>>        next = 0xffff8f841dc08e18,
>>        prev = 0xffff8f841dc08e18
>>      }
>>    },
>>    count = 16, ---> eventfd incremented to 16 !!!
>>    flags = 526336
>> }
>>
>>
>> Original RFC link:
>>
>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5UvUJ86I$
>>
>> Changed since RFC:
>>    - add support for more virtio/vhost pci devices
>>    - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this
>>      mischeivous command had been used
>>    - fix grammer error (s/lost/loss/)
>>    - change version to 6.1
>>    - fix incorrect example in qapi/qdev.json
>>    - manage event types with enum/array, instead of hard coding
>>
>>
>> Dongli Zhang (6):
>>     qdev: introduce qapi/hmp command for kick/call event
>>     virtio: introduce helper function for kick/call device event
>>     virtio-blk-pci: implement device event interface for kick/call
>>     virtio-scsi-pci: implement device event interface for kick/call
>>     vhost-scsi-pci: implement device event interface for kick/call
>>     virtio-net-pci: implement device event interface for kick/call
>>
>>   hmp-commands.hx                 | 14 ++++++++
>>   hw/block/virtio-blk.c           |  9 +++++
>>   hw/net/virtio-net.c             |  9 +++++
>>   hw/scsi/vhost-scsi.c            |  6 ++++
>>   hw/scsi/virtio-scsi.c           |  9 +++++
>>   hw/virtio/vhost-scsi-pci.c      | 10 ++++++
>>   hw/virtio/virtio-blk-pci.c      | 10 ++++++
>>   hw/virtio/virtio-net-pci.c      | 10 ++++++
>>   hw/virtio/virtio-scsi-pci.c     | 10 ++++++
>>   hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
>>   include/hw/qdev-core.h          |  9 +++++
>>   include/hw/virtio/vhost-scsi.h  |  3 ++
>>   include/hw/virtio/virtio-blk.h  |  2 ++
>>   include/hw/virtio/virtio-net.h  |  3 ++
>>   include/hw/virtio/virtio-scsi.h |  3 ++
>>   include/hw/virtio/virtio.h      |  3 ++
>>   include/monitor/hmp.h           |  1 +
>>   qapi/qdev.json                  | 30 +++++++++++++++++
>>   softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
>>   19 files changed, 261 insertions(+)
>>
>>
>> I did tests with below cases.
>>
>> - virtio-blk-pci (ioeventfd on/off, iothread, live migration)
>> - virtio-scsi-pci (ioeventfd on/off)
>> - vhost-scsi-pci
>> - virtio-net-pci (ioeventfd on/off, vhost)
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-03-26 21:16   ` Dongli Zhang
@ 2021-03-29  3:56     ` Jason Wang
  2021-03-30  7:29       ` Dongli Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2021-03-29  3:56 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/3/27 上午5:16, Dongli Zhang 写道:
> Hi Jason,
>
> On 3/26/21 12:24 AM, Jason Wang wrote:
>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>> the loss of doorbell kick, e.g.,
>>>
>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>
>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>
>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>> or legacy IRQ).
>>>
>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>> on purpose by admin at QEMU/host side for a specific device.
>>>
>>>
>>> This device can be used as a workaround if call/kick is lost due to
>>> virtualization software (e.g., kernel or QEMU) issue.
>>>
>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>> on purpose. This is considered future work once the virtio part is done.
>>>
>>>
>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>> used available. We suspect this is because vhost-scsi was not notified by
>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>> dump the current counter of eventfd for queue=2.
>>>
>>> crash> eventfd_ctx ffff8f67f6bbe700
>>> struct eventfd_ctx {
>>>     kref = {
>>>       refcount = {
>>>         refs = {
>>>           counter = 4
>>>         }
>>>       }
>>>     },
>>>     wqh = {
>>>       lock = {
>>>         {
>>>           rlock = {
>>>             raw_lock = {
>>>               val = {
>>>                 counter = 0
>>>               }
>>>             }
>>>           }
>>>         }
>>>       },
>>>       head = {
>>>         next = 0xffff8f841dc08e18,
>>>         prev = 0xffff8f841dc08e18
>>>       }
>>>     },
>>>     count = 15, ---> eventfd is 15 !!!
>>>     flags = 526336
>>> }
>>>
>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>> with this interface.
>>>
>>> { "execute": "x-debug-device-event",
>>>     "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>                    "event": "kick", "queue": 2 } }
>>>
>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>> indicates something bad is in software that the 'kick' is lost.
>> What do you mean by "software" here? And it looks to me you're testing whether
>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>> sure how much value could we gain from a dedicated debug interface like this
>> consider there're a lot of exisinting general purpose debugging method like
>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>> event_notifier_set() is only a very small fraction of the process of virtqueue
>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>> offloaded to KVM, it's more a chance that something is wrong in setuping
>> ioeventfd instead of here. Irq is even more complicated.
> Thank you very much!
>
> I am not testing whether event_notifier_set() is called by virtio_queue_notify().
>
> The 'software' indicates the data processing and event notification mechanism
> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
> erroneously returns false due to corrupted ring buffer status.


So there could be several factors that may block the notification:

1) eventfd bug (ioeventfd vs irqfd)
2) wrong virtqueue state (either driver or device)
3) missing barriers (either driver or device)
4) Qemu bug (irqchip and routing)
...

Consider we want to debug virtio issue, only 2) or 3) is what we really 
cared.

So for kick you did (assume vhost is on):

virtio_device_event_kick()
     virtio_queue_notify()
         event_notifier_set()

It looks to me you're actaully testing if ioeventfd is correctly set by 
Qemu.

For call you did:

virtio_device_event_call()
     event_notifier_set()

A test of irqfd is correctly set by Qemu. So all of those are not virtio 
specific stuffs but you introduce virtio specific command to do debug 
non virtio functions.

In the case of what you mentioned for vring_need_event(), what we really 
want is to dump the virtqueue state from the guest. This might requries 
some work of extending virtio spec (e.g to dump device status like 
indices, event, wrap counters).


> This was initially proposed for vhost only and I was going to export
> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
> better implement this at QEMU.
>
> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via
> ioeventfd), while the backend vhost side sends responses back and calls the IRQ
> (via ioeventfd).
>
> Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
> missed/ignored, or even never triggered. The example mentioned in the patchset
> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
> was ignored, due to pci-bridge/hotplug issue.


So this is not a good example since it was a chipset bug. You need to 
use other tool to debug chipset code isn't it?


> The hotplug is with a very small window but the IO hangs permanently. I did test
> that kicking the doorbell again will help recover the IO, so that I would be
> able to conclude this was due to lost of kick.
>
> The loss of irq/doorbell is painful especially in production environment where
> we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU binding").


So looking at the git history we can see it has little possibility that 
the missing is due to virtio/vhost itself. So the commit you mention 
here is not good as well since it's not a netfront/netbackend bug.

So for the case of event call, what you did is:

satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
                                      Error **errp)
{
#ifdef DEBUG_VIRTIO_EVENT
     printf("The 'call' event is triggered for path=%s, queue=%d, 
irqfd=%d.\n",
            object_get_canonical_path(OBJECT(vq->vdev)),
            vq->queue_index, eventfd);
#endif

     if (eventfd) {
         virtio_set_isr(vq->vdev, 0x1);
         event_notifier_set(&vq->guest_notifier);
     } else {
         virtio_irq(vq);
     }
}

This means, when eventfd is set, you bypasses the MSI mask which is very 
dangerous to make it used in the case of production environment. And if 
you check masking, it won't help a lot if the MSI is masked wrongly.


> This can help "narrow down" if the IO/networking hang is due to loss of
> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in production
> env. This can also be used as a workaround so that VM owner will not need to
> reboot VM.


So having such extra workaround is pretty dangerous in production 
environemnt where I think we need to be conservative which means we need 
to collect information instead of generating artificial event.

And it doesn't help if the wrokaround can be triggered through 
management API.


>
> In addition, the VFIO will benefit from it. We will be able to test if to inject
> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>
>> I think we could not gain much for introducing an dedicated mechanism for such a
>> corner case.
> As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
> trigger an ioport write on purpose.


If that applies. I would rather have a hmp_mem_write then we can test 
both MSI and doorbell. But again, they are very dangerous to be used in 
production envronment.


>
> The linux block layer also supports the below to kick the IO queue on purpose.
>
> echo "kick" > /sys/kernel/debug/block/sda/state


This might be fine for hardware device but not virtio. The device can 
choose to poll the virtqueue instead of depending of the doorbell to 
work. And for networking subsystem, we don't have such stuffs, instead 
ethtool support to dump ring and vendor specific stuffs which could be 
used for dumping virtqueue state in this case.

Thanks


>
> Dongli Zhang
>
>> Thanks
>>
>>
>>> crash> eventfd_ctx ffff8f67f6bbe700
>>> struct eventfd_ctx {
>>>     kref = {
>>>       refcount = {
>>>         refs = {
>>>           counter = 4
>>>         }
>>>       }
>>>     },
>>>     wqh = {
>>>       lock = {
>>>         {
>>>           rlock = {
>>>             raw_lock = {
>>>               val = {
>>>                 counter = 0
>>>               }
>>>             }
>>>           }
>>>         }
>>>       },
>>>       head = {
>>>         next = 0xffff8f841dc08e18,
>>>         prev = 0xffff8f841dc08e18
>>>       }
>>>     },
>>>     count = 16, ---> eventfd incremented to 16 !!!
>>>     flags = 526336
>>> }
>>>
>>>
>>> Original RFC link:
>>>
>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5UvUJ86I$
>>>
>>> Changed since RFC:
>>>     - add support for more virtio/vhost pci devices
>>>     - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this
>>>       mischeivous command had been used
>>>     - fix grammer error (s/lost/loss/)
>>>     - change version to 6.1
>>>     - fix incorrect example in qapi/qdev.json
>>>     - manage event types with enum/array, instead of hard coding
>>>
>>>
>>> Dongli Zhang (6):
>>>      qdev: introduce qapi/hmp command for kick/call event
>>>      virtio: introduce helper function for kick/call device event
>>>      virtio-blk-pci: implement device event interface for kick/call
>>>      virtio-scsi-pci: implement device event interface for kick/call
>>>      vhost-scsi-pci: implement device event interface for kick/call
>>>      virtio-net-pci: implement device event interface for kick/call
>>>
>>>    hmp-commands.hx                 | 14 ++++++++
>>>    hw/block/virtio-blk.c           |  9 +++++
>>>    hw/net/virtio-net.c             |  9 +++++
>>>    hw/scsi/vhost-scsi.c            |  6 ++++
>>>    hw/scsi/virtio-scsi.c           |  9 +++++
>>>    hw/virtio/vhost-scsi-pci.c      | 10 ++++++
>>>    hw/virtio/virtio-blk-pci.c      | 10 ++++++
>>>    hw/virtio/virtio-net-pci.c      | 10 ++++++
>>>    hw/virtio/virtio-scsi-pci.c     | 10 ++++++
>>>    hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
>>>    include/hw/qdev-core.h          |  9 +++++
>>>    include/hw/virtio/vhost-scsi.h  |  3 ++
>>>    include/hw/virtio/virtio-blk.h  |  2 ++
>>>    include/hw/virtio/virtio-net.h  |  3 ++
>>>    include/hw/virtio/virtio-scsi.h |  3 ++
>>>    include/hw/virtio/virtio.h      |  3 ++
>>>    include/monitor/hmp.h           |  1 +
>>>    qapi/qdev.json                  | 30 +++++++++++++++++
>>>    softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
>>>    19 files changed, 261 insertions(+)
>>>
>>>
>>> I did tests with below cases.
>>>
>>> - virtio-blk-pci (ioeventfd on/off, iothread, live migration)
>>> - virtio-scsi-pci (ioeventfd on/off)
>>> - vhost-scsi-pci
>>> - virtio-net-pci (ioeventfd on/off, vhost)
>>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>>>
>>>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-03-29  3:56     ` Jason Wang
@ 2021-03-30  7:29       ` Dongli Zhang
  2021-04-02  3:47         ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-03-30  7:29 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz



On 3/28/21 8:56 PM, Jason Wang wrote:
> 
> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>> Hi Jason,
>>
>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>> the loss of doorbell kick, e.g.,
>>>>
>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>
>>>>
>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>
>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>> or legacy IRQ).
>>>>
>>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>
>>>>
>>>> This device can be used as a workaround if call/kick is lost due to
>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>
>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>> on purpose. This is considered future work once the virtio part is done.
>>>>
>>>>
>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>> dump the current counter of eventfd for queue=2.
>>>>
>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>> struct eventfd_ctx {
>>>>     kref = {
>>>>       refcount = {
>>>>         refs = {
>>>>           counter = 4
>>>>         }
>>>>       }
>>>>     },
>>>>     wqh = {
>>>>       lock = {
>>>>         {
>>>>           rlock = {
>>>>             raw_lock = {
>>>>               val = {
>>>>                 counter = 0
>>>>               }
>>>>             }
>>>>           }
>>>>         }
>>>>       },
>>>>       head = {
>>>>         next = 0xffff8f841dc08e18,
>>>>         prev = 0xffff8f841dc08e18
>>>>       }
>>>>     },
>>>>     count = 15, ---> eventfd is 15 !!!
>>>>     flags = 526336
>>>> }
>>>>
>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>> with this interface.
>>>>
>>>> { "execute": "x-debug-device-event",
>>>>     "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>                    "event": "kick", "queue": 2 } }
>>>>
>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>> indicates something bad is in software that the 'kick' is lost.
>>> What do you mean by "software" here? And it looks to me you're testing whether
>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>> sure how much value could we gain from a dedicated debug interface like this
>>> consider there're a lot of exisinting general purpose debugging method like
>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>> event_notifier_set() is only a very small fraction of the process of virtqueue
>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>> ioeventfd instead of here. Irq is even more complicated.
>> Thank you very much!
>>
>> I am not testing whether event_notifier_set() is called by virtio_queue_notify().
>>
>> The 'software' indicates the data processing and event notification mechanism
>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>> erroneously returns false due to corrupted ring buffer status.
> 
> 
> So there could be several factors that may block the notification:
> 
> 1) eventfd bug (ioeventfd vs irqfd)
> 2) wrong virtqueue state (either driver or device)
> 3) missing barriers (either driver or device)
> 4) Qemu bug (irqchip and routing)
> ...

This is not only about whether notification is blocked.

It can also be used to help narrow down and understand if there is any
suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
drivers following virtio spec. It is closely related to many of other kernel
components.

Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
will be able to analyze what may happen along the IO completion path starting
from when /where the IRQ is injected ... perhaps the root cause is not with
virtio but blk-mq/scsi (this is just an example).


In addition, this idea should help for vfio-pci. Suppose the developer for a
specific device driver suspects IO/networking hangs because of loss if IRQ, we
will be able to verify if that assumption is correct by injecting an IRQ on purpose.

Therefore, this is not only about virtio PV driver (frontend/backend), but also
used to help analyze the issue related to entire IO/networking/passthrough
virtualization stacks, especially in production env where the issue can only be
reproduced randomly.

> 
> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
> So for kick you did (assume vhost is on):
> 
> virtio_device_event_kick()
>     virtio_queue_notify()
>         event_notifier_set()
> 
> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
> 
> For call you did:
> 
> virtio_device_event_call()
>     event_notifier_set()
> 
> A test of irqfd is correctly set by Qemu. So all of those are not virtio
> specific stuffs but you introduce virtio specific command to do debug non virtio
> functions.
> 
> In the case of what you mentioned for vring_need_event(), what we really want is
> to dump the virtqueue state from the guest. This might requries some work of
> extending virtio spec (e.g to dump device status like indices, event, wrap
> counters).

Suppose the issue is only randomly reproducible in production env, we should
always take 4) into consideration because we will not be able to know where is
the root cause at the very beginning of bug analysis.

> 
> 
>> This was initially proposed for vhost only and I was going to export
>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>> better implement this at QEMU.
>>
>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via
>> ioeventfd), while the backend vhost side sends responses back and calls the IRQ
>> (via ioeventfd).
>>
>> Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
>> missed/ignored, or even never triggered. The example mentioned in the patchset
>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>> was ignored, due to pci-bridge/hotplug issue.
> 
> 
> So this is not a good example since it was a chipset bug. You need to use other
> tool to debug chipset code isn't it?

As this issue is reproducible only randomly, we will not be able to realize it
is a chipset bug at the very beginning.

While the link is about vhost-net, it is applicable to vhost-scsi as well.
Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
and pci-hotplug, in order to identify the root case.

The "call/kick" interface is used to narrow down and verify the analysis,
especially when many kernel components are involved.

> 
> 
>> The hotplug is with a very small window but the IO hangs permanently. I did test
>> that kicking the doorbell again will help recover the IO, so that I would be
>> able to conclude this was due to lost of kick.
>>
>> The loss of irq/doorbell is painful especially in production environment where
>> we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU binding").
> 
> 
> So looking at the git history we can see it has little possibility that the
> missing is due to virtio/vhost itself. So the commit you mention here is not
> good as well since it's not a netfront/netbackend bug.

I mentioned the xen issue just to explain that the loss of event/irq issue may
happen and is very painful. Another xen example is (equivalent to KVM VFIO):

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7

That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.

> 
> So for the case of event call, what you did is:
> 
> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>                                      Error **errp)
> {
> #ifdef DEBUG_VIRTIO_EVENT
>     printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
>            object_get_canonical_path(OBJECT(vq->vdev)),
>            vq->queue_index, eventfd);
> #endif
> 
>     if (eventfd) {
>         virtio_set_isr(vq->vdev, 0x1);
>         event_notifier_set(&vq->guest_notifier);
>     } else {
>         virtio_irq(vq);
>     }
> }
> 
> This means, when eventfd is set, you bypasses the MSI mask which is very
> dangerous to make it used in the case of production environment. And if you
> check masking, it won't help a lot if the MSI is masked wrongly.

You are right.

Only the vhost-net is dangerous because it masks a vector by registering an
alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi will
un-register the guest notifier.

> 
> 
>> This can help "narrow down" if the IO/networking hang is due to loss of
>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in production
>> env. This can also be used as a workaround so that VM owner will not need to
>> reboot VM.
> 
> 
> So having such extra workaround is pretty dangerous in production environemnt
> where I think we need to be conservative which means we need to collect
> information instead of generating artificial event.
> 
> And it doesn't help if the wrokaround can be triggered through management API.

I agree with this.

This depends on the administrator. This workaround should only be used in very
limited and special case.

> 
> 
>>
>> In addition, the VFIO will benefit from it. We will be able to test if to inject
>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>
>>> I think we could not gain much for introducing an dedicated mechanism for such a
>>> corner case.
>> As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
>> trigger an ioport write on purpose.
> 
> 
> If that applies. I would rather have a hmp_mem_write then we can test both MSI
> and doorbell. But again, they are very dangerous to be used in production
> envronment.

This is just not convenient for production env administrator. We will need to
first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
the command after calculating the address of doorbell.

Something bad may happen if the developer did not calculate the address correctly.

It should be much more easier for developer to just ask administrator to "call"
queue X for a specific virtio device.

> 
> 
>>
>> The linux block layer also supports the below to kick the IO queue on purpose.
>>
>> echo "kick" > /sys/kernel/debug/block/sda/state
> 
> 
> This might be fine for hardware device but not virtio. The device can choose to
> poll the virtqueue instead of depending of the doorbell to work. And for
> networking subsystem, we don't have such stuffs, instead ethtool support to dump
> ring and vendor specific stuffs which could be used for dumping virtqueue state
> in this case.

This is just another example to help explain the philosophy behind the
"kick/call" idea: sometimes to trigger the event on purpose will help us narrow
down and verify our analysis of a kernel bug, especially a bug that is only
randomly reproducible in production environment.


I understand it is possibly not proper to introduce such interface to QEMU.
That's why I used to send out the RFC.

https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html

In my opinion, this interface is pretty useful when the diagnostic invokes many
kernel components, or when developers from different components are working on
the same bug, no matter whether the root cause is at virtio or not.


Thank you very much!

Dongli Zhang


> 
> Thanks
> 
> 
>>
>> Dongli Zhang
>>
>>> Thanks
>>>
>>>
>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>> struct eventfd_ctx {
>>>>     kref = {
>>>>       refcount = {
>>>>         refs = {
>>>>           counter = 4
>>>>         }
>>>>       }
>>>>     },
>>>>     wqh = {
>>>>       lock = {
>>>>         {
>>>>           rlock = {
>>>>             raw_lock = {
>>>>               val = {
>>>>                 counter = 0
>>>>               }
>>>>             }
>>>>           }
>>>>         }
>>>>       },
>>>>       head = {
>>>>         next = 0xffff8f841dc08e18,
>>>>         prev = 0xffff8f841dc08e18
>>>>       }
>>>>     },
>>>>     count = 16, ---> eventfd incremented to 16 !!!
>>>>     flags = 526336
>>>> }
>>>>
>>>>
>>>> Original RFC link:
>>>>
>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5UvUJ86I$
>>>>
>>>>
>>>> Changed since RFC:
>>>>     - add support for more virtio/vhost pci devices
>>>>     - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this
>>>>       mischeivous command had been used
>>>>     - fix grammer error (s/lost/loss/)
>>>>     - change version to 6.1
>>>>     - fix incorrect example in qapi/qdev.json
>>>>     - manage event types with enum/array, instead of hard coding
>>>>
>>>>
>>>> Dongli Zhang (6):
>>>>      qdev: introduce qapi/hmp command for kick/call event
>>>>      virtio: introduce helper function for kick/call device event
>>>>      virtio-blk-pci: implement device event interface for kick/call
>>>>      virtio-scsi-pci: implement device event interface for kick/call
>>>>      vhost-scsi-pci: implement device event interface for kick/call
>>>>      virtio-net-pci: implement device event interface for kick/call
>>>>
>>>>    hmp-commands.hx                 | 14 ++++++++
>>>>    hw/block/virtio-blk.c           |  9 +++++
>>>>    hw/net/virtio-net.c             |  9 +++++
>>>>    hw/scsi/vhost-scsi.c            |  6 ++++
>>>>    hw/scsi/virtio-scsi.c           |  9 +++++
>>>>    hw/virtio/vhost-scsi-pci.c      | 10 ++++++
>>>>    hw/virtio/virtio-blk-pci.c      | 10 ++++++
>>>>    hw/virtio/virtio-net-pci.c      | 10 ++++++
>>>>    hw/virtio/virtio-scsi-pci.c     | 10 ++++++
>>>>    hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
>>>>    include/hw/qdev-core.h          |  9 +++++
>>>>    include/hw/virtio/vhost-scsi.h  |  3 ++
>>>>    include/hw/virtio/virtio-blk.h  |  2 ++
>>>>    include/hw/virtio/virtio-net.h  |  3 ++
>>>>    include/hw/virtio/virtio-scsi.h |  3 ++
>>>>    include/hw/virtio/virtio.h      |  3 ++
>>>>    include/monitor/hmp.h           |  1 +
>>>>    qapi/qdev.json                  | 30 +++++++++++++++++
>>>>    softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
>>>>    19 files changed, 261 insertions(+)
>>>>
>>>>
>>>> I did tests with below cases.
>>>>
>>>> - virtio-blk-pci (ioeventfd on/off, iothread, live migration)
>>>> - virtio-scsi-pci (ioeventfd on/off)
>>>> - vhost-scsi-pci
>>>> - virtio-net-pci (ioeventfd on/off, vhost)
>>>>
>>>> Thank you very much!
>>>>
>>>> Dongli Zhang
>>>>
>>>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-03-30  7:29       ` Dongli Zhang
@ 2021-04-02  3:47         ` Jason Wang
  2021-04-05 20:00           ` Dongli Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2021-04-02  3:47 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/3/30 下午3:29, Dongli Zhang 写道:
>
> On 3/28/21 8:56 PM, Jason Wang wrote:
>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>> Hi Jason,
>>>
>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>> the loss of doorbell kick, e.g.,
>>>>>
>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>
>>>>>
>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>
>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>> or legacy IRQ).
>>>>>
>>>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>
>>>>>
>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>
>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>
>>>>>
>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>> dump the current counter of eventfd for queue=2.
>>>>>
>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>> struct eventfd_ctx {
>>>>>      kref = {
>>>>>        refcount = {
>>>>>          refs = {
>>>>>            counter = 4
>>>>>          }
>>>>>        }
>>>>>      },
>>>>>      wqh = {
>>>>>        lock = {
>>>>>          {
>>>>>            rlock = {
>>>>>              raw_lock = {
>>>>>                val = {
>>>>>                  counter = 0
>>>>>                }
>>>>>              }
>>>>>            }
>>>>>          }
>>>>>        },
>>>>>        head = {
>>>>>          next = 0xffff8f841dc08e18,
>>>>>          prev = 0xffff8f841dc08e18
>>>>>        }
>>>>>      },
>>>>>      count = 15, ---> eventfd is 15 !!!
>>>>>      flags = 526336
>>>>> }
>>>>>
>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>> with this interface.
>>>>>
>>>>> { "execute": "x-debug-device-event",
>>>>>      "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>                     "event": "kick", "queue": 2 } }
>>>>>
>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>> indicates something bad is in software that the 'kick' is lost.
>>>> What do you mean by "software" here? And it looks to me you're testing whether
>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>> sure how much value could we gain from a dedicated debug interface like this
>>>> consider there're a lot of exisinting general purpose debugging method like
>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>> event_notifier_set() is only a very small fraction of the process of virtqueue
>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>> ioeventfd instead of here. Irq is even more complicated.
>>> Thank you very much!
>>>
>>> I am not testing whether event_notifier_set() is called by virtio_queue_notify().
>>>
>>> The 'software' indicates the data processing and event notification mechanism
>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>> erroneously returns false due to corrupted ring buffer status.
>>
>> So there could be several factors that may block the notification:
>>
>> 1) eventfd bug (ioeventfd vs irqfd)
>> 2) wrong virtqueue state (either driver or device)
>> 3) missing barriers (either driver or device)
>> 4) Qemu bug (irqchip and routing)
>> ...
> This is not only about whether notification is blocked.
>
> It can also be used to help narrow down and understand if there is any
> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
> drivers following virtio spec. It is closely related to many of other kernel
> components.
>
> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
> will be able to analyze what may happen along the IO completion path starting
> from when /where the IRQ is injected ... perhaps the root cause is not with
> virtio but blk-mq/scsi (this is just an example).
>
>
> In addition, this idea should help for vfio-pci. Suppose the developer for a
> specific device driver suspects IO/networking hangs because of loss if IRQ, we
> will be able to verify if that assumption is correct by injecting an IRQ on purpose.
>
> Therefore, this is not only about virtio PV driver (frontend/backend), but also
> used to help analyze the issue related to entire IO/networking/passthrough
> virtualization stacks, especially in production env where the issue can only be
> reproduced randomly.


So it looks to me you'd better having things like this in the 
EventNotifier layer and introduce qmp commands to write/read that 
instead of starting with a virtio specific commands. Or it might be even 
helpful to start from some counters for event notifiers which could be 
accessed via monitor to help to for debugging to start with which is 
much more safe in the environment of production. Having artifical events 
are always dangerous.


>
>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>> So for kick you did (assume vhost is on):
>>
>> virtio_device_event_kick()
>>      virtio_queue_notify()
>>          event_notifier_set()
>>
>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>
>> For call you did:
>>
>> virtio_device_event_call()
>>      event_notifier_set()
>>
>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>> specific stuffs but you introduce virtio specific command to do debug non virtio
>> functions.
>>
>> In the case of what you mentioned for vring_need_event(), what we really want is
>> to dump the virtqueue state from the guest. This might requries some work of
>> extending virtio spec (e.g to dump device status like indices, event, wrap
>> counters).
> Suppose the issue is only randomly reproducible in production env, we should
> always take 4) into consideration because we will not be able to know where is
> the root cause at the very beginning of bug analysis.


So if it truns out to be an issue of irqchip, how will you do the 
debugging then? I guess what's really helpful is a qmp command to dump 
irqchip status/information.


>
>>
>>> This was initially proposed for vhost only and I was going to export
>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>> better implement this at QEMU.
>>>
>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via
>>> ioeventfd), while the backend vhost side sends responses back and calls the IRQ
>>> (via ioeventfd).
>>>
>>> Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
>>> missed/ignored, or even never triggered. The example mentioned in the patchset
>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>> was ignored, due to pci-bridge/hotplug issue.
>>
>> So this is not a good example since it was a chipset bug. You need to use other
>> tool to debug chipset code isn't it?
> As this issue is reproducible only randomly, we will not be able to realize it
> is a chipset bug at the very beginning.
>
> While the link is about vhost-net, it is applicable to vhost-scsi as well.
> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
> and pci-hotplug, in order to identify the root case.
>
> The "call/kick" interface is used to narrow down and verify the analysis,
> especially when many kernel components are involved.
>
>>
>>> The hotplug is with a very small window but the IO hangs permanently. I did test
>>> that kicking the doorbell again will help recover the IO, so that I would be
>>> able to conclude this was due to lost of kick.
>>>
>>> The loss of irq/doorbell is painful especially in production environment where
>>> we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU binding").
>>
>> So looking at the git history we can see it has little possibility that the
>> missing is due to virtio/vhost itself. So the commit you mention here is not
>> good as well since it's not a netfront/netbackend bug.
> I mentioned the xen issue just to explain that the loss of event/irq issue may
> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>
> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7


Sorry, I can't figure out how is this related to VFIO or virtio. It 
should be reproducible for devices without using eventfd?


>
> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>
>> So for the case of event call, what you did is:
>>
>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>                                       Error **errp)
>> {
>> #ifdef DEBUG_VIRTIO_EVENT
>>      printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
>>             object_get_canonical_path(OBJECT(vq->vdev)),
>>             vq->queue_index, eventfd);
>> #endif
>>
>>      if (eventfd) {
>>          virtio_set_isr(vq->vdev, 0x1);
>>          event_notifier_set(&vq->guest_notifier);
>>      } else {
>>          virtio_irq(vq);
>>      }
>> }
>>
>> This means, when eventfd is set, you bypasses the MSI mask which is very
>> dangerous to make it used in the case of production environment. And if you
>> check masking, it won't help a lot if the MSI is masked wrongly.
> You are right.
>
> Only the vhost-net is dangerous because it masks a vector by registering an
> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi will
> un-register the guest notifier.
>
>>
>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in production
>>> env. This can also be used as a workaround so that VM owner will not need to
>>> reboot VM.
>>
>> So having such extra workaround is pretty dangerous in production environemnt
>> where I think we need to be conservative which means we need to collect
>> information instead of generating artificial event.
>>
>> And it doesn't help if the wrokaround can be triggered through management API.
> I agree with this.
>
> This depends on the administrator. This workaround should only be used in very
> limited and special case.
>
>>
>>> In addition, the VFIO will benefit from it. We will be able to test if to inject
>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>
>>>> I think we could not gain much for introducing an dedicated mechanism for such a
>>>> corner case.
>>> As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
>>> trigger an ioport write on purpose.
>>
>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>> and doorbell. But again, they are very dangerous to be used in production
>> envronment.
> This is just not convenient for production env administrator. We will need to
> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
> the command after calculating the address of doorbell.
>
> Something bad may happen if the developer did not calculate the address correctly.


It won't be worse than hmp_ioport_write I think?


>
> It should be much more easier for developer to just ask administrator to "call"
> queue X for a specific virtio device.


We can have the commands like "info virtio" which can show all the 
MSI/doorbell information for user to use. Or limit its use for virtio 
and vfio device only to avoid unexpected result.


>
>>
>>> The linux block layer also supports the below to kick the IO queue on purpose.
>>>
>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>
>> This might be fine for hardware device but not virtio. The device can choose to
>> poll the virtqueue instead of depending of the doorbell to work. And for
>> networking subsystem, we don't have such stuffs, instead ethtool support to dump
>> ring and vendor specific stuffs which could be used for dumping virtqueue state
>> in this case.
> This is just another example to help explain the philosophy behind the
> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
> down and verify our analysis of a kernel bug, especially a bug that is only
> randomly reproducible in production environment.
>
>
> I understand it is possibly not proper to introduce such interface to QEMU.
> That's why I used to send out the RFC.
>
> https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html
>
> In my opinion, this interface is pretty useful when the diagnostic invokes many
> kernel components, or when developers from different components are working on
> the same bug, no matter whether the root cause is at virtio or not.


So for virtio, it's not hard to events without those interface. E.g for 
networking you can generate some traffic and trace on where-ever you 
suspect that could block the event (kick/call).

I still prefer hmp_mem_write()/read() which looks more generic, in the 
same time, we can add more debug informaiton likes:

1) satistics like eventfd counters
2) device information, register layout, doorbell/MSI-X information
3) irqchip infromation

Thanks


>
>
> Thank you very much!
>
> Dongli Zhang
>
>
>> Thanks
>>
>>
>>> Dongli Zhang
>>>
>>>> Thanks
>>>>
>>>>
>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>> struct eventfd_ctx {
>>>>>      kref = {
>>>>>        refcount = {
>>>>>          refs = {
>>>>>            counter = 4
>>>>>          }
>>>>>        }
>>>>>      },
>>>>>      wqh = {
>>>>>        lock = {
>>>>>          {
>>>>>            rlock = {
>>>>>              raw_lock = {
>>>>>                val = {
>>>>>                  counter = 0
>>>>>                }
>>>>>              }
>>>>>            }
>>>>>          }
>>>>>        },
>>>>>        head = {
>>>>>          next = 0xffff8f841dc08e18,
>>>>>          prev = 0xffff8f841dc08e18
>>>>>        }
>>>>>      },
>>>>>      count = 16, ---> eventfd incremented to 16 !!!
>>>>>      flags = 526336
>>>>> }
>>>>>
>>>>>
>>>>> Original RFC link:
>>>>>
>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5UvUJ86I$
>>>>>
>>>>>
>>>>> Changed since RFC:
>>>>>      - add support for more virtio/vhost pci devices
>>>>>      - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this
>>>>>        mischeivous command had been used
>>>>>      - fix grammer error (s/lost/loss/)
>>>>>      - change version to 6.1
>>>>>      - fix incorrect example in qapi/qdev.json
>>>>>      - manage event types with enum/array, instead of hard coding
>>>>>
>>>>>
>>>>> Dongli Zhang (6):
>>>>>       qdev: introduce qapi/hmp command for kick/call event
>>>>>       virtio: introduce helper function for kick/call device event
>>>>>       virtio-blk-pci: implement device event interface for kick/call
>>>>>       virtio-scsi-pci: implement device event interface for kick/call
>>>>>       vhost-scsi-pci: implement device event interface for kick/call
>>>>>       virtio-net-pci: implement device event interface for kick/call
>>>>>
>>>>>     hmp-commands.hx                 | 14 ++++++++
>>>>>     hw/block/virtio-blk.c           |  9 +++++
>>>>>     hw/net/virtio-net.c             |  9 +++++
>>>>>     hw/scsi/vhost-scsi.c            |  6 ++++
>>>>>     hw/scsi/virtio-scsi.c           |  9 +++++
>>>>>     hw/virtio/vhost-scsi-pci.c      | 10 ++++++
>>>>>     hw/virtio/virtio-blk-pci.c      | 10 ++++++
>>>>>     hw/virtio/virtio-net-pci.c      | 10 ++++++
>>>>>     hw/virtio/virtio-scsi-pci.c     | 10 ++++++
>>>>>     hw/virtio/virtio.c              | 64 ++++++++++++++++++++++++++++++++++++
>>>>>     include/hw/qdev-core.h          |  9 +++++
>>>>>     include/hw/virtio/vhost-scsi.h  |  3 ++
>>>>>     include/hw/virtio/virtio-blk.h  |  2 ++
>>>>>     include/hw/virtio/virtio-net.h  |  3 ++
>>>>>     include/hw/virtio/virtio-scsi.h |  3 ++
>>>>>     include/hw/virtio/virtio.h      |  3 ++
>>>>>     include/monitor/hmp.h           |  1 +
>>>>>     qapi/qdev.json                  | 30 +++++++++++++++++
>>>>>     softmmu/qdev-monitor.c          | 56 +++++++++++++++++++++++++++++++
>>>>>     19 files changed, 261 insertions(+)
>>>>>
>>>>>
>>>>> I did tests with below cases.
>>>>>
>>>>> - virtio-blk-pci (ioeventfd on/off, iothread, live migration)
>>>>> - virtio-scsi-pci (ioeventfd on/off)
>>>>> - vhost-scsi-pci
>>>>> - virtio-net-pci (ioeventfd on/off, vhost)
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> Dongli Zhang
>>>>>
>>>>>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-02  3:47         ` Jason Wang
@ 2021-04-05 20:00           ` Dongli Zhang
  2021-04-06  1:55             ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-04-05 20:00 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz



On 4/1/21 8:47 PM, Jason Wang wrote:
> 
> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>
>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>> Hi Jason,
>>>>
>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>> the loss of doorbell kick, e.g.,
>>>>>>
>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>
>>>>>>
>>>>>>
>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>
>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>>> or legacy IRQ).
>>>>>>
>>>>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>
>>>>>>
>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>
>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>
>>>>>>
>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>
>>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>>> struct eventfd_ctx {
>>>>>>      kref = {
>>>>>>        refcount = {
>>>>>>          refs = {
>>>>>>            counter = 4
>>>>>>          }
>>>>>>        }
>>>>>>      },
>>>>>>      wqh = {
>>>>>>        lock = {
>>>>>>          {
>>>>>>            rlock = {
>>>>>>              raw_lock = {
>>>>>>                val = {
>>>>>>                  counter = 0
>>>>>>                }
>>>>>>              }
>>>>>>            }
>>>>>>          }
>>>>>>        },
>>>>>>        head = {
>>>>>>          next = 0xffff8f841dc08e18,
>>>>>>          prev = 0xffff8f841dc08e18
>>>>>>        }
>>>>>>      },
>>>>>>      count = 15, ---> eventfd is 15 !!!
>>>>>>      flags = 526336
>>>>>> }
>>>>>>
>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>> with this interface.
>>>>>>
>>>>>> { "execute": "x-debug-device-event",
>>>>>>      "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>                     "event": "kick", "queue": 2 } }
>>>>>>
>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>> indicates something bad is in software that the 'kick' is lost.
>>>>> What do you mean by "software" here? And it looks to me you're testing whether
>>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>>> sure how much value could we gain from a dedicated debug interface like this
>>>>> consider there're a lot of exisinting general purpose debugging method like
>>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>>> event_notifier_set() is only a very small fraction of the process of virtqueue
>>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>>> ioeventfd instead of here. Irq is even more complicated.
>>>> Thank you very much!
>>>>
>>>> I am not testing whether event_notifier_set() is called by
>>>> virtio_queue_notify().
>>>>
>>>> The 'software' indicates the data processing and event notification mechanism
>>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
>>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>>> erroneously returns false due to corrupted ring buffer status.
>>>
>>> So there could be several factors that may block the notification:
>>>
>>> 1) eventfd bug (ioeventfd vs irqfd)
>>> 2) wrong virtqueue state (either driver or device)
>>> 3) missing barriers (either driver or device)
>>> 4) Qemu bug (irqchip and routing)
>>> ...
>> This is not only about whether notification is blocked.
>>
>> It can also be used to help narrow down and understand if there is any
>> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
>> drivers following virtio spec. It is closely related to many of other kernel
>> components.
>>
>> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
>> will be able to analyze what may happen along the IO completion path starting
>> from when /where the IRQ is injected ... perhaps the root cause is not with
>> virtio but blk-mq/scsi (this is just an example).
>>
>>
>> In addition, this idea should help for vfio-pci. Suppose the developer for a
>> specific device driver suspects IO/networking hangs because of loss if IRQ, we
>> will be able to verify if that assumption is correct by injecting an IRQ on
>> purpose.
>>
>> Therefore, this is not only about virtio PV driver (frontend/backend), but also
>> used to help analyze the issue related to entire IO/networking/passthrough
>> virtualization stacks, especially in production env where the issue can only be
>> reproduced randomly.
> 
> 
> So it looks to me you'd better having things like this in the EventNotifier
> layer and introduce qmp commands to write/read that instead of starting with a
> virtio specific commands. Or it might be even helpful to start from some
> counters for event notifiers which could be accessed via monitor to help to for
> debugging to start with which is much more safe in the environment of
> production. Having artifical events are always dangerous.


The EventNotifier is just fd used by different QEMU components. There is not a
way to track each EventNotifier used by a QEMU process so that I do not think we
are able to track at EventNotifier layer, unless we add extra code to track the
init/uninit of each eventfd, or modify kernel.

That's try I introduced "DeviceEvent event" to "struct DeviceClass" so that each
device type will be able to customize its own way to track its eventfd list.


Would you prefer "write to a specific eventfd for a specific QEMU device",
instead of "kick/call for a specific device"?


> 
> 
>>
>>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>>> So for kick you did (assume vhost is on):
>>>
>>> virtio_device_event_kick()
>>>      virtio_queue_notify()
>>>          event_notifier_set()
>>>
>>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>>
>>> For call you did:
>>>
>>> virtio_device_event_call()
>>>      event_notifier_set()
>>>
>>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>>> specific stuffs but you introduce virtio specific command to do debug non virtio
>>> functions.
>>>
>>> In the case of what you mentioned for vring_need_event(), what we really want is
>>> to dump the virtqueue state from the guest. This might requries some work of
>>> extending virtio spec (e.g to dump device status like indices, event, wrap
>>> counters).
>> Suppose the issue is only randomly reproducible in production env, we should
>> always take 4) into consideration because we will not be able to know where is
>> the root cause at the very beginning of bug analysis.
> 
> 
> So if it truns out to be an issue of irqchip, how will you do the debugging
> then? I guess what's really helpful is a qmp command to dump irqchip
> status/information.

Thank you very much for suggestion. That will be a different problem and we may
consider as future work.

This patchset is about to do introduce change/events to help narrow down where
may be the root case in order to facilitate diagnostic (especially for prod env
issue and when it is not easy to reproduce).

> 
> 
>>
>>>
>>>> This was initially proposed for vhost only and I was going to export
>>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>>> better implement this at QEMU.
>>>>
>>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
>>>> (via
>>>> ioeventfd), while the backend vhost side sends responses back and calls the IRQ
>>>> (via ioeventfd).
>>>>
>>>> Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
>>>> missed/ignored, or even never triggered. The example mentioned in the patchset
>>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>>> was ignored, due to pci-bridge/hotplug issue.
>>>
>>> So this is not a good example since it was a chipset bug. You need to use other
>>> tool to debug chipset code isn't it?
>> As this issue is reproducible only randomly, we will not be able to realize it
>> is a chipset bug at the very beginning.
>>
>> While the link is about vhost-net, it is applicable to vhost-scsi as well.
>> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
>> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
>> and pci-hotplug, in order to identify the root case.
>>
>> The "call/kick" interface is used to narrow down and verify the analysis,
>> especially when many kernel components are involved.
>>
>>>
>>>> The hotplug is with a very small window but the IO hangs permanently. I did
>>>> test
>>>> that kicking the doorbell again will help recover the IO, so that I would be
>>>> able to conclude this was due to lost of kick.
>>>>
>>>> The loss of irq/doorbell is painful especially in production environment where
>>>> we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
>>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
>>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
>>>> binding").
>>>
>>> So looking at the git history we can see it has little possibility that the
>>> missing is due to virtio/vhost itself. So the commit you mention here is not
>>> good as well since it's not a netfront/netbackend bug.
>> I mentioned the xen issue just to explain that the loss of event/irq issue may
>> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>>
>> https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
> 
> 
> 
> Sorry, I can't figure out how is this related to VFIO or virtio. It should be
> reproducible for devices without using eventfd?
> 

Yes, although not involving eventfd, other drivers/virtualization may encounter
the loss of irq/kick as well. There is no relation between xen and vfio/virtio.

That's why a diagnostic interface is appreciated.

In my opinion, the 'diagnostic' is not only to collect data, but also to
introduce event/change (e.g., inject IRQ) and then monitor/observe what will
happen to the stalled VM.

> 
>>
>> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
>> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>>
>>> So for the case of event call, what you did is:
>>>
>>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>>                                       Error **errp)
>>> {
>>> #ifdef DEBUG_VIRTIO_EVENT
>>>      printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
>>>             object_get_canonical_path(OBJECT(vq->vdev)),
>>>             vq->queue_index, eventfd);
>>> #endif
>>>
>>>      if (eventfd) {
>>>          virtio_set_isr(vq->vdev, 0x1);
>>>          event_notifier_set(&vq->guest_notifier);
>>>      } else {
>>>          virtio_irq(vq);
>>>      }
>>> }
>>>
>>> This means, when eventfd is set, you bypasses the MSI mask which is very
>>> dangerous to make it used in the case of production environment. And if you
>>> check masking, it won't help a lot if the MSI is masked wrongly.
>> You are right.
>>
>> Only the vhost-net is dangerous because it masks a vector by registering an
>> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi will
>> un-register the guest notifier.
>>
>>>
>>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
>>>> production
>>>> env. This can also be used as a workaround so that VM owner will not need to
>>>> reboot VM.
>>>
>>> So having such extra workaround is pretty dangerous in production environemnt
>>> where I think we need to be conservative which means we need to collect
>>> information instead of generating artificial event.
>>>
>>> And it doesn't help if the wrokaround can be triggered through management API.
>> I agree with this.
>>
>> This depends on the administrator. This workaround should only be used in very
>> limited and special case.
>>
>>>
>>>> In addition, the VFIO will benefit from it. We will be able to test if to
>>>> inject
>>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>>
>>>>> I think we could not gain much for introducing an dedicated mechanism for
>>>>> such a
>>>>> corner case.
>>>> As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
>>>> trigger an ioport write on purpose.
>>>
>>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>>> and doorbell. But again, they are very dangerous to be used in production
>>> envronment.
>> This is just not convenient for production env administrator. We will need to
>> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
>> the command after calculating the address of doorbell.
>>
>> Something bad may happen if the developer did not calculate the address
>> correctly.
> 
> 
> It won't be worse than hmp_ioport_write I think?

I always believe it is worth adding hmp_mem_write().

While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
easy/convenient as to write to eventfd.

> 
> 
>>
>> It should be much more easier for developer to just ask administrator to "call"
>> queue X for a specific virtio device.
> 
> 
> We can have the commands like "info virtio" which can show all the MSI/doorbell
> information for user to use. Or limit its use for virtio and vfio device only to
> avoid unexpected result.

So far the method by this patchset is to introduce "DeviceEvent event" to
"struct DeviceClass".

Only virtio-pci-xxx and vfio (future work) will implement this interface.


> 
> 
>>
>>>
>>>> The linux block layer also supports the below to kick the IO queue on purpose.
>>>>
>>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>>
>>> This might be fine for hardware device but not virtio. The device can choose to
>>> poll the virtqueue instead of depending of the doorbell to work. And for
>>> networking subsystem, we don't have such stuffs, instead ethtool support to dump
>>> ring and vendor specific stuffs which could be used for dumping virtqueue state
>>> in this case.
>> This is just another example to help explain the philosophy behind the
>> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
>> down and verify our analysis of a kernel bug, especially a bug that is only
>> randomly reproducible in production environment.
>>
>>
>> I understand it is possibly not proper to introduce such interface to QEMU.
>> That's why I used to send out the RFC.
>>
>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
>>
>> In my opinion, this interface is pretty useful when the diagnostic invokes many
>> kernel components, or when developers from different components are working on
>> the same bug, no matter whether the root cause is at virtio or not.
> 
> 
> So for virtio, it's not hard to events without those interface. E.g for
> networking you can generate some traffic and trace on where-ever you suspect
> that could block the event (kick/call).

Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
as well. I do not have a way to generate traffic for RX path in such situation.

> 
> I still prefer hmp_mem_write()/read() which looks more generic, in the same
> time, we can add more debug informaiton likes:
> 
> 1) satistics like eventfd counters
> 2) device information, register layout, doorbell/MSI-X information
> 3) irqchip infromation

Would you mind help for below questions?

1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
interfaces? I think it is safe because:

(1) This affects only specific VM (QEMU), not all/others.

(2) It is dangerous only when sysadmin triggers the events on purpose. If this
interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
will be dangerous as well.

(3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
not see issue for host kernel. It will be security bug if to read/write eventfd
from userspace crashes kernel space.

(4) We primarily use this interface when VM is running into issue (unless we use
it as workaround).


2. Is it fine to add such interface to QEMU software upstream, or such interface
is not good for software upstream so that the interface should be added only
when QEMU is customized for specific products' usage?


We may discuss how, e.g., hmp_mem_write() vs. kick/call if it is fine to add
such interfaces.

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-05 20:00           ` Dongli Zhang
@ 2021-04-06  1:55             ` Jason Wang
  2021-04-06  8:43               ` Dongli Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2021-04-06  1:55 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/4/6 上午4:00, Dongli Zhang 写道:
>
> On 4/1/21 8:47 PM, Jason Wang wrote:
>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>> Hi Jason,
>>>>>
>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>
>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>>>> or legacy IRQ).
>>>>>>>
>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>
>>>>>>>
>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>
>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>>
>>>>>>>
>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>
>>>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>>>> struct eventfd_ctx {
>>>>>>>       kref = {
>>>>>>>         refcount = {
>>>>>>>           refs = {
>>>>>>>             counter = 4
>>>>>>>           }
>>>>>>>         }
>>>>>>>       },
>>>>>>>       wqh = {
>>>>>>>         lock = {
>>>>>>>           {
>>>>>>>             rlock = {
>>>>>>>               raw_lock = {
>>>>>>>                 val = {
>>>>>>>                   counter = 0
>>>>>>>                 }
>>>>>>>               }
>>>>>>>             }
>>>>>>>           }
>>>>>>>         },
>>>>>>>         head = {
>>>>>>>           next = 0xffff8f841dc08e18,
>>>>>>>           prev = 0xffff8f841dc08e18
>>>>>>>         }
>>>>>>>       },
>>>>>>>       count = 15, ---> eventfd is 15 !!!
>>>>>>>       flags = 526336
>>>>>>> }
>>>>>>>
>>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>>> with this interface.
>>>>>>>
>>>>>>> { "execute": "x-debug-device-event",
>>>>>>>       "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>>                      "event": "kick", "queue": 2 } }
>>>>>>>
>>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>>> indicates something bad is in software that the 'kick' is lost.
>>>>>> What do you mean by "software" here? And it looks to me you're testing whether
>>>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>>>> sure how much value could we gain from a dedicated debug interface like this
>>>>>> consider there're a lot of exisinting general purpose debugging method like
>>>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>>>> event_notifier_set() is only a very small fraction of the process of virtqueue
>>>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>>>> ioeventfd instead of here. Irq is even more complicated.
>>>>> Thank you very much!
>>>>>
>>>>> I am not testing whether event_notifier_set() is called by
>>>>> virtio_queue_notify().
>>>>>
>>>>> The 'software' indicates the data processing and event notification mechanism
>>>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
>>>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>>>> erroneously returns false due to corrupted ring buffer status.
>>>> So there could be several factors that may block the notification:
>>>>
>>>> 1) eventfd bug (ioeventfd vs irqfd)
>>>> 2) wrong virtqueue state (either driver or device)
>>>> 3) missing barriers (either driver or device)
>>>> 4) Qemu bug (irqchip and routing)
>>>> ...
>>> This is not only about whether notification is blocked.
>>>
>>> It can also be used to help narrow down and understand if there is any
>>> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
>>> drivers following virtio spec. It is closely related to many of other kernel
>>> components.
>>>
>>> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
>>> will be able to analyze what may happen along the IO completion path starting
>>> from when /where the IRQ is injected ... perhaps the root cause is not with
>>> virtio but blk-mq/scsi (this is just an example).
>>>
>>>
>>> In addition, this idea should help for vfio-pci. Suppose the developer for a
>>> specific device driver suspects IO/networking hangs because of loss if IRQ, we
>>> will be able to verify if that assumption is correct by injecting an IRQ on
>>> purpose.
>>>
>>> Therefore, this is not only about virtio PV driver (frontend/backend), but also
>>> used to help analyze the issue related to entire IO/networking/passthrough
>>> virtualization stacks, especially in production env where the issue can only be
>>> reproduced randomly.
>>
>> So it looks to me you'd better having things like this in the EventNotifier
>> layer and introduce qmp commands to write/read that instead of starting with a
>> virtio specific commands. Or it might be even helpful to start from some
>> counters for event notifiers which could be accessed via monitor to help to for
>> debugging to start with which is much more safe in the environment of
>> production. Having artifical events are always dangerous.
>
> The EventNotifier is just fd used by different QEMU components. There is not a
> way to track each EventNotifier used by a QEMU process so that I do not think we
> are able to track at EventNotifier layer, unless we add extra code to track the
> init/uninit of each eventfd, or modify kernel.
>
> That's try I introduced "DeviceEvent event" to "struct DeviceClass" so that each
> device type will be able to customize its own way to track its eventfd list.
>
>
> Would you prefer "write to a specific eventfd for a specific QEMU device",
> instead of "kick/call for a specific device"?


It might be better. But note that eventfd is Linux specific, that's why 
we need do it at higher level (EventNotifier level to make it work for 
e.g win).

And it might be even better to start with reading the counters.


>
>
>>
>>>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>>>> So for kick you did (assume vhost is on):
>>>>
>>>> virtio_device_event_kick()
>>>>       virtio_queue_notify()
>>>>           event_notifier_set()
>>>>
>>>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>>>
>>>> For call you did:
>>>>
>>>> virtio_device_event_call()
>>>>       event_notifier_set()
>>>>
>>>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>>>> specific stuffs but you introduce virtio specific command to do debug non virtio
>>>> functions.
>>>>
>>>> In the case of what you mentioned for vring_need_event(), what we really want is
>>>> to dump the virtqueue state from the guest. This might requries some work of
>>>> extending virtio spec (e.g to dump device status like indices, event, wrap
>>>> counters).
>>> Suppose the issue is only randomly reproducible in production env, we should
>>> always take 4) into consideration because we will not be able to know where is
>>> the root cause at the very beginning of bug analysis.
>>
>> So if it truns out to be an issue of irqchip, how will you do the debugging
>> then? I guess what's really helpful is a qmp command to dump irqchip
>> status/information.
> Thank you very much for suggestion. That will be a different problem and we may
> consider as future work.
>
> This patchset is about to do introduce change/events to help narrow down where
> may be the root case in order to facilitate diagnostic (especially for prod env
> issue and when it is not easy to reproduce).
>
>>
>>>>> This was initially proposed for vhost only and I was going to export
>>>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>>>> better implement this at QEMU.
>>>>>
>>>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
>>>>> (via
>>>>> ioeventfd), while the backend vhost side sends responses back and calls the IRQ
>>>>> (via ioeventfd).
>>>>>
>>>>> Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
>>>>> missed/ignored, or even never triggered. The example mentioned in the patchset
>>>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>>>> was ignored, due to pci-bridge/hotplug issue.
>>>> So this is not a good example since it was a chipset bug. You need to use other
>>>> tool to debug chipset code isn't it?
>>> As this issue is reproducible only randomly, we will not be able to realize it
>>> is a chipset bug at the very beginning.
>>>
>>> While the link is about vhost-net, it is applicable to vhost-scsi as well.
>>> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
>>> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
>>> and pci-hotplug, in order to identify the root case.
>>>
>>> The "call/kick" interface is used to narrow down and verify the analysis,
>>> especially when many kernel components are involved.
>>>
>>>>> The hotplug is with a very small window but the IO hangs permanently. I did
>>>>> test
>>>>> that kicking the doorbell again will help recover the IO, so that I would be
>>>>> able to conclude this was due to lost of kick.
>>>>>
>>>>> The loss of irq/doorbell is painful especially in production environment where
>>>>> we are not able to attach to QEMU via gdb. While the patchset is only for QEMU,
>>>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux kernel
>>>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
>>>>> binding").
>>>> So looking at the git history we can see it has little possibility that the
>>>> missing is due to virtio/vhost itself. So the commit you mention here is not
>>>> good as well since it's not a netfront/netbackend bug.
>>> I mentioned the xen issue just to explain that the loss of event/irq issue may
>>> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>>>
>>> https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
>>
>>
>> Sorry, I can't figure out how is this related to VFIO or virtio. It should be
>> reproducible for devices without using eventfd?
>>
> Yes, although not involving eventfd, other drivers/virtualization may encounter
> the loss of irq/kick as well. There is no relation between xen and vfio/virtio.
>
> That's why a diagnostic interface is appreciated.
>
> In my opinion, the 'diagnostic' is not only to collect data,


Usually, collecting the data is the first step :)


>   but also to
> introduce event/change (e.g., inject IRQ) and then monitor/observe what will
> happen to the stalled VM.


It might be helpful yes, but it's also very dangerous.


>
>>> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
>>> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>>>
>>>> So for the case of event call, what you did is:
>>>>
>>>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>>>                                        Error **errp)
>>>> {
>>>> #ifdef DEBUG_VIRTIO_EVENT
>>>>       printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
>>>>              object_get_canonical_path(OBJECT(vq->vdev)),
>>>>              vq->queue_index, eventfd);
>>>> #endif
>>>>
>>>>       if (eventfd) {
>>>>           virtio_set_isr(vq->vdev, 0x1);
>>>>           event_notifier_set(&vq->guest_notifier);
>>>>       } else {
>>>>           virtio_irq(vq);
>>>>       }
>>>> }
>>>>
>>>> This means, when eventfd is set, you bypasses the MSI mask which is very
>>>> dangerous to make it used in the case of production environment. And if you
>>>> check masking, it won't help a lot if the MSI is masked wrongly.
>>> You are right.
>>>
>>> Only the vhost-net is dangerous because it masks a vector by registering an
>>> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi will
>>> un-register the guest notifier.
>>>
>>>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
>>>>> production
>>>>> env. This can also be used as a workaround so that VM owner will not need to
>>>>> reboot VM.
>>>> So having such extra workaround is pretty dangerous in production environemnt
>>>> where I think we need to be conservative which means we need to collect
>>>> information instead of generating artificial event.
>>>>
>>>> And it doesn't help if the wrokaround can be triggered through management API.
>>> I agree with this.
>>>
>>> This depends on the administrator. This workaround should only be used in very
>>> limited and special case.
>>>
>>>>> In addition, the VFIO will benefit from it. We will be able to test if to
>>>>> inject
>>>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>>>
>>>>>> I think we could not gain much for introducing an dedicated mechanism for
>>>>>> such a
>>>>>> corner case.
>>>>> As replied by Dave for prior RFC, the QEMU already supports hmp_ioport_write to
>>>>> trigger an ioport write on purpose.
>>>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>>>> and doorbell. But again, they are very dangerous to be used in production
>>>> envronment.
>>> This is just not convenient for production env administrator. We will need to
>>> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
>>> the command after calculating the address of doorbell.
>>>
>>> Something bad may happen if the developer did not calculate the address
>>> correctly.
>>
>> It won't be worse than hmp_ioport_write I think?
> I always believe it is worth adding hmp_mem_write().
>
> While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
> easy/convenient as to write to eventfd.
>
>>
>>> It should be much more easier for developer to just ask administrator to "call"
>>> queue X for a specific virtio device.
>>
>> We can have the commands like "info virtio" which can show all the MSI/doorbell
>> information for user to use. Or limit its use for virtio and vfio device only to
>> avoid unexpected result.
> So far the method by this patchset is to introduce "DeviceEvent event" to
> "struct DeviceClass".
>
> Only virtio-pci-xxx and vfio (future work) will implement this interface.
>
>
>>
>>>>> The linux block layer also supports the below to kick the IO queue on purpose.
>>>>>
>>>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>>> This might be fine for hardware device but not virtio. The device can choose to
>>>> poll the virtqueue instead of depending of the doorbell to work. And for
>>>> networking subsystem, we don't have such stuffs, instead ethtool support to dump
>>>> ring and vendor specific stuffs which could be used for dumping virtqueue state
>>>> in this case.
>>> This is just another example to help explain the philosophy behind the
>>> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
>>> down and verify our analysis of a kernel bug, especially a bug that is only
>>> randomly reproducible in production environment.
>>>
>>>
>>> I understand it is possibly not proper to introduce such interface to QEMU.
>>> That's why I used to send out the RFC.
>>>
>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
>>>
>>> In my opinion, this interface is pretty useful when the diagnostic invokes many
>>> kernel components, or when developers from different components are working on
>>> the same bug, no matter whether the root cause is at virtio or not.
>>
>> So for virtio, it's not hard to events without those interface. E.g for
>> networking you can generate some traffic and trace on where-ever you suspect
>> that could block the event (kick/call).
> Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
> vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
> as well. I do not have a way to generate traffic for RX path in such situation.


Right, but as discussed, we need interface to dump virtqueue state, then 
it would be very easy to start with.


>
>> I still prefer hmp_mem_write()/read() which looks more generic, in the same
>> time, we can add more debug informaiton likes:
>>
>> 1) satistics like eventfd counters
>> 2) device information, register layout, doorbell/MSI-X information
>> 3) irqchip infromation
> Would you mind help for below questions?
>
> 1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
> interfaces? I think it is safe because:
>
> (1) This affects only specific VM (QEMU), not all/others.
>
> (2) It is dangerous only when sysadmin triggers the events on purpose. If this
> interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
> inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
> will be dangerous as well.
>
> (3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
> not see issue for host kernel. It will be security bug if to read/write eventfd
> from userspace crashes kernel space.
>
> (4) We primarily use this interface when VM is running into issue (unless we use
> it as workaround).


Besides the above, I think it's only "safe" if we clearly define the 
semanic of this command. E.g:

1) Does it work at EventNotifier (eventfd) level or virtio/vfio level?
2) Can it bypass the masking?


>
>
> 2. Is it fine to add such interface to QEMU software upstream, or such interface
> is not good for software upstream so that the interface should be added only
> when QEMU is customized for specific products' usage?


We can listen to the comments from other experts on the list. But I 
think having a generic trigger at EventNotifier level should be ok.

Thanks


>
>
> We may discuss how, e.g., hmp_mem_write() vs. kick/call if it is fine to add
> such interfaces.
>
> Thank you very much!
>
> Dongli Zhang
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-06  1:55             ` Jason Wang
@ 2021-04-06  8:43               ` Dongli Zhang
  2021-04-06 23:27                 ` Dongli Zhang
  2021-04-07  2:18                 ` Jason Wang
  0 siblings, 2 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-04-06  8:43 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz



On 4/5/21 6:55 PM, Jason Wang wrote:
> 
> 在 2021/4/6 上午4:00, Dongli Zhang 写道:
>>
>> On 4/1/21 8:47 PM, Jason Wang wrote:
>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>>> Hi Jason,
>>>>>>
>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>>
>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>>>>> or legacy IRQ).
>>>>>>>>
>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device
>>>>>>>> (e.g.,
>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>>
>>>>>>>>
>>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>>
>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>>>
>>>>>>>>
>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>>
>>>>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>>>>> struct eventfd_ctx {
>>>>>>>>       kref = {
>>>>>>>>         refcount = {
>>>>>>>>           refs = {
>>>>>>>>             counter = 4
>>>>>>>>           }
>>>>>>>>         }
>>>>>>>>       },
>>>>>>>>       wqh = {
>>>>>>>>         lock = {
>>>>>>>>           {
>>>>>>>>             rlock = {
>>>>>>>>               raw_lock = {
>>>>>>>>                 val = {
>>>>>>>>                   counter = 0
>>>>>>>>                 }
>>>>>>>>               }
>>>>>>>>             }
>>>>>>>>           }
>>>>>>>>         },
>>>>>>>>         head = {
>>>>>>>>           next = 0xffff8f841dc08e18,
>>>>>>>>           prev = 0xffff8f841dc08e18
>>>>>>>>         }
>>>>>>>>       },
>>>>>>>>       count = 15, ---> eventfd is 15 !!!
>>>>>>>>       flags = 526336
>>>>>>>> }
>>>>>>>>
>>>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>>>> with this interface.
>>>>>>>>
>>>>>>>> { "execute": "x-debug-device-event",
>>>>>>>>       "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>>>                      "event": "kick", "queue": 2 } }
>>>>>>>>
>>>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>>>> indicates something bad is in software that the 'kick' is lost.
>>>>>>> What do you mean by "software" here? And it looks to me you're testing
>>>>>>> whether
>>>>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>>>>> sure how much value could we gain from a dedicated debug interface like this
>>>>>>> consider there're a lot of exisinting general purpose debugging method like
>>>>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>>>>> event_notifier_set() is only a very small fraction of the process of
>>>>>>> virtqueue
>>>>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>>>>> ioeventfd instead of here. Irq is even more complicated.
>>>>>> Thank you very much!
>>>>>>
>>>>>> I am not testing whether event_notifier_set() is called by
>>>>>> virtio_queue_notify().
>>>>>>
>>>>>> The 'software' indicates the data processing and event notification mechanism
>>>>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting
>>>>>> for an
>>>>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>>>>> erroneously returns false due to corrupted ring buffer status.
>>>>> So there could be several factors that may block the notification:
>>>>>
>>>>> 1) eventfd bug (ioeventfd vs irqfd)
>>>>> 2) wrong virtqueue state (either driver or device)
>>>>> 3) missing barriers (either driver or device)
>>>>> 4) Qemu bug (irqchip and routing)
>>>>> ...
>>>> This is not only about whether notification is blocked.
>>>>
>>>> It can also be used to help narrow down and understand if there is any
>>>> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
>>>> drivers following virtio spec. It is closely related to many of other kernel
>>>> components.
>>>>
>>>> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
>>>> will be able to analyze what may happen along the IO completion path starting
>>>> from when /where the IRQ is injected ... perhaps the root cause is not with
>>>> virtio but blk-mq/scsi (this is just an example).
>>>>
>>>>
>>>> In addition, this idea should help for vfio-pci. Suppose the developer for a
>>>> specific device driver suspects IO/networking hangs because of loss if IRQ, we
>>>> will be able to verify if that assumption is correct by injecting an IRQ on
>>>> purpose.
>>>>
>>>> Therefore, this is not only about virtio PV driver (frontend/backend), but also
>>>> used to help analyze the issue related to entire IO/networking/passthrough
>>>> virtualization stacks, especially in production env where the issue can only be
>>>> reproduced randomly.
>>>
>>> So it looks to me you'd better having things like this in the EventNotifier
>>> layer and introduce qmp commands to write/read that instead of starting with a
>>> virtio specific commands. Or it might be even helpful to start from some
>>> counters for event notifiers which could be accessed via monitor to help to for
>>> debugging to start with which is much more safe in the environment of
>>> production. Having artifical events are always dangerous.
>>
>> The EventNotifier is just fd used by different QEMU components. There is not a
>> way to track each EventNotifier used by a QEMU process so that I do not think we
>> are able to track at EventNotifier layer, unless we add extra code to track the
>> init/uninit of each eventfd, or modify kernel.
>>
>> That's try I introduced "DeviceEvent event" to "struct DeviceClass" so that each
>> device type will be able to customize its own way to track its eventfd list.
>>
>>
>> Would you prefer "write to a specific eventfd for a specific QEMU device",
>> instead of "kick/call for a specific device"?
> 
> 
> It might be better. But note that eventfd is Linux specific, that's why we need
> do it at higher level (EventNotifier level to make it work for e.g win).
> 
> And it might be even better to start with reading the counters.

Is it possible to read from eventfd without modifying kernel?

QEMU has only event_notifier_test_and_clear(). According to kernel code, to read
from eventfd will decreate ctx->count as line 190.

185 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
186 {
187         lockdep_assert_held(&ctx->wqh.lock);
188
189         *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
190         ctx->count -= *cnt;
191 }

Can I assume it is not appropriate to read from eventfd?

> 
> 
>>
>>
>>>
>>>>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>>>>> So for kick you did (assume vhost is on):
>>>>>
>>>>> virtio_device_event_kick()
>>>>>       virtio_queue_notify()
>>>>>           event_notifier_set()
>>>>>
>>>>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>>>>
>>>>> For call you did:
>>>>>
>>>>> virtio_device_event_call()
>>>>>       event_notifier_set()
>>>>>
>>>>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>>>>> specific stuffs but you introduce virtio specific command to do debug non
>>>>> virtio
>>>>> functions.
>>>>>
>>>>> In the case of what you mentioned for vring_need_event(), what we really
>>>>> want is
>>>>> to dump the virtqueue state from the guest. This might requries some work of
>>>>> extending virtio spec (e.g to dump device status like indices, event, wrap
>>>>> counters).
>>>> Suppose the issue is only randomly reproducible in production env, we should
>>>> always take 4) into consideration because we will not be able to know where is
>>>> the root cause at the very beginning of bug analysis.
>>>
>>> So if it truns out to be an issue of irqchip, how will you do the debugging
>>> then? I guess what's really helpful is a qmp command to dump irqchip
>>> status/information.
>> Thank you very much for suggestion. That will be a different problem and we may
>> consider as future work.
>>
>> This patchset is about to do introduce change/events to help narrow down where
>> may be the root case in order to facilitate diagnostic (especially for prod env
>> issue and when it is not easy to reproduce).
>>
>>>
>>>>>> This was initially proposed for vhost only and I was going to export
>>>>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>>>>> better implement this at QEMU.
>>>>>>
>>>>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>>>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
>>>>>> (via
>>>>>> ioeventfd), while the backend vhost side sends responses back and calls
>>>>>> the IRQ
>>>>>> (via ioeventfd).
>>>>>>
>>>>>> Unfortunately, sometimes there is issue with virtio/vhost so that
>>>>>> kick/call was
>>>>>> missed/ignored, or even never triggered. The example mentioned in the
>>>>>> patchset
>>>>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>>>>> was ignored, due to pci-bridge/hotplug issue.
>>>>> So this is not a good example since it was a chipset bug. You need to use
>>>>> other
>>>>> tool to debug chipset code isn't it?
>>>> As this issue is reproducible only randomly, we will not be able to realize it
>>>> is a chipset bug at the very beginning.
>>>>
>>>> While the link is about vhost-net, it is applicable to vhost-scsi as well.
>>>> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
>>>> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
>>>> and pci-hotplug, in order to identify the root case.
>>>>
>>>> The "call/kick" interface is used to narrow down and verify the analysis,
>>>> especially when many kernel components are involved.
>>>>
>>>>>> The hotplug is with a very small window but the IO hangs permanently. I did
>>>>>> test
>>>>>> that kicking the doorbell again will help recover the IO, so that I would be
>>>>>> able to conclude this was due to lost of kick.
>>>>>>
>>>>>> The loss of irq/doorbell is painful especially in production environment
>>>>>> where
>>>>>> we are not able to attach to QEMU via gdb. While the patchset is only for
>>>>>> QEMU,
>>>>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux
>>>>>> kernel
>>>>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
>>>>>> binding").
>>>>> So looking at the git history we can see it has little possibility that the
>>>>> missing is due to virtio/vhost itself. So the commit you mention here is not
>>>>> good as well since it's not a netfront/netbackend bug.
>>>> I mentioned the xen issue just to explain that the loss of event/irq issue may
>>>> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>>>>
>>>> https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
>>>>
>>>
>>>
>>> Sorry, I can't figure out how is this related to VFIO or virtio. It should be
>>> reproducible for devices without using eventfd?
>>>
>> Yes, although not involving eventfd, other drivers/virtualization may encounter
>> the loss of irq/kick as well. There is no relation between xen and vfio/virtio.
>>
>> That's why a diagnostic interface is appreciated.
>>
>> In my opinion, the 'diagnostic' is not only to collect data,
> 
> 
> Usually, collecting the data is the first step :)
> 
> 
>>   but also to
>> introduce event/change (e.g., inject IRQ) and then monitor/observe what will
>> happen to the stalled VM.
> 
> 
> It might be helpful yes, but it's also very dangerous.
> 
> 
>>
>>>> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
>>>> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>>>>
>>>>> So for the case of event call, what you did is:
>>>>>
>>>>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>>>>                                        Error **errp)
>>>>> {
>>>>> #ifdef DEBUG_VIRTIO_EVENT
>>>>>       printf("The 'call' event is triggered for path=%s, queue=%d,
>>>>> irqfd=%d.\n",
>>>>>              object_get_canonical_path(OBJECT(vq->vdev)),
>>>>>              vq->queue_index, eventfd);
>>>>> #endif
>>>>>
>>>>>       if (eventfd) {
>>>>>           virtio_set_isr(vq->vdev, 0x1);
>>>>>           event_notifier_set(&vq->guest_notifier);
>>>>>       } else {
>>>>>           virtio_irq(vq);
>>>>>       }
>>>>> }
>>>>>
>>>>> This means, when eventfd is set, you bypasses the MSI mask which is very
>>>>> dangerous to make it used in the case of production environment. And if you
>>>>> check masking, it won't help a lot if the MSI is masked wrongly.
>>>> You are right.
>>>>
>>>> Only the vhost-net is dangerous because it masks a vector by registering an
>>>> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi
>>>> will
>>>> un-register the guest notifier.
>>>>
>>>>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>>>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
>>>>>> production
>>>>>> env. This can also be used as a workaround so that VM owner will not need to
>>>>>> reboot VM.
>>>>> So having such extra workaround is pretty dangerous in production environemnt
>>>>> where I think we need to be conservative which means we need to collect
>>>>> information instead of generating artificial event.
>>>>>
>>>>> And it doesn't help if the wrokaround can be triggered through management API.
>>>> I agree with this.
>>>>
>>>> This depends on the administrator. This workaround should only be used in very
>>>> limited and special case.
>>>>
>>>>>> In addition, the VFIO will benefit from it. We will be able to test if to
>>>>>> inject
>>>>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>>>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>>>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>>>>
>>>>>>> I think we could not gain much for introducing an dedicated mechanism for
>>>>>>> such a
>>>>>>> corner case.
>>>>>> As replied by Dave for prior RFC, the QEMU already supports
>>>>>> hmp_ioport_write to
>>>>>> trigger an ioport write on purpose.
>>>>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>>>>> and doorbell. But again, they are very dangerous to be used in production
>>>>> envronment.
>>>> This is just not convenient for production env administrator. We will need to
>>>> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
>>>> the command after calculating the address of doorbell.
>>>>
>>>> Something bad may happen if the developer did not calculate the address
>>>> correctly.
>>>
>>> It won't be worse than hmp_ioport_write I think?
>> I always believe it is worth adding hmp_mem_write().
>>
>> While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
>> easy/convenient as to write to eventfd.
>>
>>>
>>>> It should be much more easier for developer to just ask administrator to "call"
>>>> queue X for a specific virtio device.
>>>
>>> We can have the commands like "info virtio" which can show all the MSI/doorbell
>>> information for user to use. Or limit its use for virtio and vfio device only to
>>> avoid unexpected result.
>> So far the method by this patchset is to introduce "DeviceEvent event" to
>> "struct DeviceClass".
>>
>> Only virtio-pci-xxx and vfio (future work) will implement this interface.
>>
>>
>>>
>>>>>> The linux block layer also supports the below to kick the IO queue on
>>>>>> purpose.
>>>>>>
>>>>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>>>> This might be fine for hardware device but not virtio. The device can
>>>>> choose to
>>>>> poll the virtqueue instead of depending of the doorbell to work. And for
>>>>> networking subsystem, we don't have such stuffs, instead ethtool support to
>>>>> dump
>>>>> ring and vendor specific stuffs which could be used for dumping virtqueue
>>>>> state
>>>>> in this case.
>>>> This is just another example to help explain the philosophy behind the
>>>> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
>>>> down and verify our analysis of a kernel bug, especially a bug that is only
>>>> randomly reproducible in production environment.
>>>>
>>>>
>>>> I understand it is possibly not proper to introduce such interface to QEMU.
>>>> That's why I used to send out the RFC.
>>>>
>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
>>>>
>>>>
>>>> In my opinion, this interface is pretty useful when the diagnostic invokes many
>>>> kernel components, or when developers from different components are working on
>>>> the same bug, no matter whether the root cause is at virtio or not.
>>>
>>> So for virtio, it's not hard to events without those interface. E.g for
>>> networking you can generate some traffic and trace on where-ever you suspect
>>> that could block the event (kick/call).
>> Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
>> vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
>> as well. I do not have a way to generate traffic for RX path in such situation.
> 
> 
> Right, but as discussed, we need interface to dump virtqueue state, then it
> would be very easy to start with.
> 
> 
>>
>>> I still prefer hmp_mem_write()/read() which looks more generic, in the same
>>> time, we can add more debug informaiton likes:
>>>
>>> 1) satistics like eventfd counters
>>> 2) device information, register layout, doorbell/MSI-X information
>>> 3) irqchip infromation
>> Would you mind help for below questions?
>>
>> 1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
>> interfaces? I think it is safe because:
>>
>> (1) This affects only specific VM (QEMU), not all/others.
>>
>> (2) It is dangerous only when sysadmin triggers the events on purpose. If this
>> interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
>> inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
>> will be dangerous as well.
>>
>> (3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
>> not see issue for host kernel. It will be security bug if to read/write eventfd
>> from userspace crashes kernel space.
>>
>> (4) We primarily use this interface when VM is running into issue (unless we use
>> it as workaround).
> 
> 
> Besides the above, I think it's only "safe" if we clearly define the semanic of
> this command. E.g:
> 
> 1) Does it work at EventNotifier (eventfd) level or virtio/vfio level?

I am still confused with the difference between EventNotifier level and
virtio/vfio level.

There is not a global mechanism to track the EventNotifier used by each device.
We will still need per-device-type interface to dump EventNotifier for each device.

Please see more below.

> 2) Can it bypass the masking?

This is a good question and please see below for the answer.

> 
> 
>>
>>
>> 2. Is it fine to add such interface to QEMU software upstream, or such interface
>> is not good for software upstream so that the interface should be added only
>> when QEMU is customized for specific products' usage?
> 
> 
> We can listen to the comments from other experts on the list. But I think having
> a generic trigger at EventNotifier level should be ok.

Would you mind share and confirm if this is what you are looking for?

To dump EventNotifier for each device.

(qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 dump

... list all event-notifier related to this device ...


Write to a specific EventNotifier. The id is from prior dump.

(qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 write <dump id>

... print which event-notifier is written to ...


This will answer your question that "Can it bypass the masking?".

For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
not able to bypass masking because masking is to unregister the eventfd. To
write to eventfd does not take effect.

However, it is possible to bypass masking for vhost-net because vhost-net
registered a specific masked_notifier eventfd in order to mask irq. To write to
original eventfd still takes effect.

We may leave the user to decide whether to write to 'masked_notifier' or
original 'guest_notifier' for vhost-net.

Thank you very much!

Dongli Zhang


> 
> Thanks
> 
> 
>>
>>
>> We may discuss how, e.g., hmp_mem_write() vs. kick/call if it is fine to add
>> such interfaces.
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-06  8:43               ` Dongli Zhang
@ 2021-04-06 23:27                 ` Dongli Zhang
  2021-04-07  2:20                   ` Jason Wang
  2021-04-07  2:18                 ` Jason Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-04-06 23:27 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz



On 4/6/21 1:43 AM, Dongli Zhang wrote:
> 
> 
> On 4/5/21 6:55 PM, Jason Wang wrote:
>>
>> 在 2021/4/6 上午4:00, Dongli Zhang 写道:
>>>
>>> On 4/1/21 8:47 PM, Jason Wang wrote:
>>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>>>
>>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>>>>>> or legacy IRQ).
>>>>>>>>>
>>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device
>>>>>>>>> (e.g.,
>>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>>>
>>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>>>
>>>>>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>>>>>> struct eventfd_ctx {
>>>>>>>>>       kref = {
>>>>>>>>>         refcount = {
>>>>>>>>>           refs = {
>>>>>>>>>             counter = 4
>>>>>>>>>           }
>>>>>>>>>         }
>>>>>>>>>       },
>>>>>>>>>       wqh = {
>>>>>>>>>         lock = {
>>>>>>>>>           {
>>>>>>>>>             rlock = {
>>>>>>>>>               raw_lock = {
>>>>>>>>>                 val = {
>>>>>>>>>                   counter = 0
>>>>>>>>>                 }
>>>>>>>>>               }
>>>>>>>>>             }
>>>>>>>>>           }
>>>>>>>>>         },
>>>>>>>>>         head = {
>>>>>>>>>           next = 0xffff8f841dc08e18,
>>>>>>>>>           prev = 0xffff8f841dc08e18
>>>>>>>>>         }
>>>>>>>>>       },
>>>>>>>>>       count = 15, ---> eventfd is 15 !!!
>>>>>>>>>       flags = 526336
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>>>>> with this interface.
>>>>>>>>>
>>>>>>>>> { "execute": "x-debug-device-event",
>>>>>>>>>       "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>>>>                      "event": "kick", "queue": 2 } }
>>>>>>>>>
>>>>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>>>>> indicates something bad is in software that the 'kick' is lost.
>>>>>>>> What do you mean by "software" here? And it looks to me you're testing
>>>>>>>> whether
>>>>>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>>>>>> sure how much value could we gain from a dedicated debug interface like this
>>>>>>>> consider there're a lot of exisinting general purpose debugging method like
>>>>>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>>>>>> event_notifier_set() is only a very small fraction of the process of
>>>>>>>> virtqueue
>>>>>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>>>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>>>>>> ioeventfd instead of here. Irq is even more complicated.
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> I am not testing whether event_notifier_set() is called by
>>>>>>> virtio_queue_notify().
>>>>>>>
>>>>>>> The 'software' indicates the data processing and event notification mechanism
>>>>>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting
>>>>>>> for an
>>>>>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>>>>>> erroneously returns false due to corrupted ring buffer status.
>>>>>> So there could be several factors that may block the notification:
>>>>>>
>>>>>> 1) eventfd bug (ioeventfd vs irqfd)
>>>>>> 2) wrong virtqueue state (either driver or device)
>>>>>> 3) missing barriers (either driver or device)
>>>>>> 4) Qemu bug (irqchip and routing)
>>>>>> ...
>>>>> This is not only about whether notification is blocked.
>>>>>
>>>>> It can also be used to help narrow down and understand if there is any
>>>>> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
>>>>> drivers following virtio spec. It is closely related to many of other kernel
>>>>> components.
>>>>>
>>>>> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
>>>>> will be able to analyze what may happen along the IO completion path starting
>>>>> from when /where the IRQ is injected ... perhaps the root cause is not with
>>>>> virtio but blk-mq/scsi (this is just an example).
>>>>>
>>>>>
>>>>> In addition, this idea should help for vfio-pci. Suppose the developer for a
>>>>> specific device driver suspects IO/networking hangs because of loss if IRQ, we
>>>>> will be able to verify if that assumption is correct by injecting an IRQ on
>>>>> purpose.
>>>>>
>>>>> Therefore, this is not only about virtio PV driver (frontend/backend), but also
>>>>> used to help analyze the issue related to entire IO/networking/passthrough
>>>>> virtualization stacks, especially in production env where the issue can only be
>>>>> reproduced randomly.
>>>>
>>>> So it looks to me you'd better having things like this in the EventNotifier
>>>> layer and introduce qmp commands to write/read that instead of starting with a
>>>> virtio specific commands. Or it might be even helpful to start from some
>>>> counters for event notifiers which could be accessed via monitor to help to for
>>>> debugging to start with which is much more safe in the environment of
>>>> production. Having artifical events are always dangerous.
>>>
>>> The EventNotifier is just fd used by different QEMU components. There is not a
>>> way to track each EventNotifier used by a QEMU process so that I do not think we
>>> are able to track at EventNotifier layer, unless we add extra code to track the
>>> init/uninit of each eventfd, or modify kernel.
>>>
>>> That's try I introduced "DeviceEvent event" to "struct DeviceClass" so that each
>>> device type will be able to customize its own way to track its eventfd list.
>>>
>>>
>>> Would you prefer "write to a specific eventfd for a specific QEMU device",
>>> instead of "kick/call for a specific device"?
>>
>>
>> It might be better. But note that eventfd is Linux specific, that's why we need
>> do it at higher level (EventNotifier level to make it work for e.g win).
>>
>> And it might be even better to start with reading the counters.
> 
> Is it possible to read from eventfd without modifying kernel?
> 
> QEMU has only event_notifier_test_and_clear(). According to kernel code, to read
> from eventfd will decreate ctx->count as line 190.
> 
> 185 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
> 186 {
> 187         lockdep_assert_held(&ctx->wqh.lock);
> 188
> 189         *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
> 190         ctx->count -= *cnt;
> 191 }
> 
> Can I assume it is not appropriate to read from eventfd?
> 
>>
>>
>>>
>>>
>>>>
>>>>>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>>>>>> So for kick you did (assume vhost is on):
>>>>>>
>>>>>> virtio_device_event_kick()
>>>>>>       virtio_queue_notify()
>>>>>>           event_notifier_set()
>>>>>>
>>>>>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>>>>>
>>>>>> For call you did:
>>>>>>
>>>>>> virtio_device_event_call()
>>>>>>       event_notifier_set()
>>>>>>
>>>>>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>>>>>> specific stuffs but you introduce virtio specific command to do debug non
>>>>>> virtio
>>>>>> functions.
>>>>>>
>>>>>> In the case of what you mentioned for vring_need_event(), what we really
>>>>>> want is
>>>>>> to dump the virtqueue state from the guest. This might requries some work of
>>>>>> extending virtio spec (e.g to dump device status like indices, event, wrap
>>>>>> counters).
>>>>> Suppose the issue is only randomly reproducible in production env, we should
>>>>> always take 4) into consideration because we will not be able to know where is
>>>>> the root cause at the very beginning of bug analysis.
>>>>
>>>> So if it truns out to be an issue of irqchip, how will you do the debugging
>>>> then? I guess what's really helpful is a qmp command to dump irqchip
>>>> status/information.
>>> Thank you very much for suggestion. That will be a different problem and we may
>>> consider as future work.
>>>
>>> This patchset is about to do introduce change/events to help narrow down where
>>> may be the root case in order to facilitate diagnostic (especially for prod env
>>> issue and when it is not easy to reproduce).
>>>
>>>>
>>>>>>> This was initially proposed for vhost only and I was going to export
>>>>>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>>>>>> better implement this at QEMU.
>>>>>>>
>>>>>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>>>>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
>>>>>>> (via
>>>>>>> ioeventfd), while the backend vhost side sends responses back and calls
>>>>>>> the IRQ
>>>>>>> (via ioeventfd).
>>>>>>>
>>>>>>> Unfortunately, sometimes there is issue with virtio/vhost so that
>>>>>>> kick/call was
>>>>>>> missed/ignored, or even never triggered. The example mentioned in the
>>>>>>> patchset
>>>>>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>>>>>> was ignored, due to pci-bridge/hotplug issue.
>>>>>> So this is not a good example since it was a chipset bug. You need to use
>>>>>> other
>>>>>> tool to debug chipset code isn't it?
>>>>> As this issue is reproducible only randomly, we will not be able to realize it
>>>>> is a chipset bug at the very beginning.
>>>>>
>>>>> While the link is about vhost-net, it is applicable to vhost-scsi as well.
>>>>> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
>>>>> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
>>>>> and pci-hotplug, in order to identify the root case.
>>>>>
>>>>> The "call/kick" interface is used to narrow down and verify the analysis,
>>>>> especially when many kernel components are involved.
>>>>>
>>>>>>> The hotplug is with a very small window but the IO hangs permanently. I did
>>>>>>> test
>>>>>>> that kicking the doorbell again will help recover the IO, so that I would be
>>>>>>> able to conclude this was due to lost of kick.
>>>>>>>
>>>>>>> The loss of irq/doorbell is painful especially in production environment
>>>>>>> where
>>>>>>> we are not able to attach to QEMU via gdb. While the patchset is only for
>>>>>>> QEMU,
>>>>>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux
>>>>>>> kernel
>>>>>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
>>>>>>> binding").
>>>>>> So looking at the git history we can see it has little possibility that the
>>>>>> missing is due to virtio/vhost itself. So the commit you mention here is not
>>>>>> good as well since it's not a netfront/netbackend bug.
>>>>> I mentioned the xen issue just to explain that the loss of event/irq issue may
>>>>> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>>>>>
>>>>> https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
>>>>>
>>>>
>>>>
>>>> Sorry, I can't figure out how is this related to VFIO or virtio. It should be
>>>> reproducible for devices without using eventfd?
>>>>
>>> Yes, although not involving eventfd, other drivers/virtualization may encounter
>>> the loss of irq/kick as well. There is no relation between xen and vfio/virtio.
>>>
>>> That's why a diagnostic interface is appreciated.
>>>
>>> In my opinion, the 'diagnostic' is not only to collect data,
>>
>>
>> Usually, collecting the data is the first step :)
>>
>>
>>>   but also to
>>> introduce event/change (e.g., inject IRQ) and then monitor/observe what will
>>> happen to the stalled VM.
>>
>>
>> It might be helpful yes, but it's also very dangerous.
>>
>>
>>>
>>>>> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
>>>>> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>>>>>
>>>>>> So for the case of event call, what you did is:
>>>>>>
>>>>>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>>>>>                                        Error **errp)
>>>>>> {
>>>>>> #ifdef DEBUG_VIRTIO_EVENT
>>>>>>       printf("The 'call' event is triggered for path=%s, queue=%d,
>>>>>> irqfd=%d.\n",
>>>>>>              object_get_canonical_path(OBJECT(vq->vdev)),
>>>>>>              vq->queue_index, eventfd);
>>>>>> #endif
>>>>>>
>>>>>>       if (eventfd) {
>>>>>>           virtio_set_isr(vq->vdev, 0x1);
>>>>>>           event_notifier_set(&vq->guest_notifier);
>>>>>>       } else {
>>>>>>           virtio_irq(vq);
>>>>>>       }
>>>>>> }
>>>>>>
>>>>>> This means, when eventfd is set, you bypasses the MSI mask which is very
>>>>>> dangerous to make it used in the case of production environment. And if you
>>>>>> check masking, it won't help a lot if the MSI is masked wrongly.
>>>>> You are right.
>>>>>
>>>>> Only the vhost-net is dangerous because it masks a vector by registering an
>>>>> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi
>>>>> will
>>>>> un-register the guest notifier.
>>>>>
>>>>>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>>>>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
>>>>>>> production
>>>>>>> env. This can also be used as a workaround so that VM owner will not need to
>>>>>>> reboot VM.
>>>>>> So having such extra workaround is pretty dangerous in production environemnt
>>>>>> where I think we need to be conservative which means we need to collect
>>>>>> information instead of generating artificial event.
>>>>>>
>>>>>> And it doesn't help if the wrokaround can be triggered through management API.
>>>>> I agree with this.
>>>>>
>>>>> This depends on the administrator. This workaround should only be used in very
>>>>> limited and special case.
>>>>>
>>>>>>> In addition, the VFIO will benefit from it. We will be able to test if to
>>>>>>> inject
>>>>>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>>>>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>>>>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>>>>>
>>>>>>>> I think we could not gain much for introducing an dedicated mechanism for
>>>>>>>> such a
>>>>>>>> corner case.
>>>>>>> As replied by Dave for prior RFC, the QEMU already supports
>>>>>>> hmp_ioport_write to
>>>>>>> trigger an ioport write on purpose.
>>>>>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>>>>>> and doorbell. But again, they are very dangerous to be used in production
>>>>>> envronment.
>>>>> This is just not convenient for production env administrator. We will need to
>>>>> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
>>>>> the command after calculating the address of doorbell.
>>>>>
>>>>> Something bad may happen if the developer did not calculate the address
>>>>> correctly.
>>>>
>>>> It won't be worse than hmp_ioport_write I think?
>>> I always believe it is worth adding hmp_mem_write().
>>>
>>> While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
>>> easy/convenient as to write to eventfd.
>>>
>>>>
>>>>> It should be much more easier for developer to just ask administrator to "call"
>>>>> queue X for a specific virtio device.
>>>>
>>>> We can have the commands like "info virtio" which can show all the MSI/doorbell
>>>> information for user to use. Or limit its use for virtio and vfio device only to
>>>> avoid unexpected result.
>>> So far the method by this patchset is to introduce "DeviceEvent event" to
>>> "struct DeviceClass".
>>>
>>> Only virtio-pci-xxx and vfio (future work) will implement this interface.
>>>
>>>
>>>>
>>>>>>> The linux block layer also supports the below to kick the IO queue on
>>>>>>> purpose.
>>>>>>>
>>>>>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>>>>> This might be fine for hardware device but not virtio. The device can
>>>>>> choose to
>>>>>> poll the virtqueue instead of depending of the doorbell to work. And for
>>>>>> networking subsystem, we don't have such stuffs, instead ethtool support to
>>>>>> dump
>>>>>> ring and vendor specific stuffs which could be used for dumping virtqueue
>>>>>> state
>>>>>> in this case.
>>>>> This is just another example to help explain the philosophy behind the
>>>>> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
>>>>> down and verify our analysis of a kernel bug, especially a bug that is only
>>>>> randomly reproducible in production environment.
>>>>>
>>>>>
>>>>> I understand it is possibly not proper to introduce such interface to QEMU.
>>>>> That's why I used to send out the RFC.
>>>>>
>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
>>>>>
>>>>>
>>>>> In my opinion, this interface is pretty useful when the diagnostic invokes many
>>>>> kernel components, or when developers from different components are working on
>>>>> the same bug, no matter whether the root cause is at virtio or not.
>>>>
>>>> So for virtio, it's not hard to events without those interface. E.g for
>>>> networking you can generate some traffic and trace on where-ever you suspect
>>>> that could block the event (kick/call).
>>> Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
>>> vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
>>> as well. I do not have a way to generate traffic for RX path in such situation.
>>
>>
>> Right, but as discussed, we need interface to dump virtqueue state, then it
>> would be very easy to start with.
>>
>>
>>>
>>>> I still prefer hmp_mem_write()/read() which looks more generic, in the same
>>>> time, we can add more debug informaiton likes:
>>>>
>>>> 1) satistics like eventfd counters
>>>> 2) device information, register layout, doorbell/MSI-X information
>>>> 3) irqchip infromation
>>> Would you mind help for below questions?
>>>
>>> 1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
>>> interfaces? I think it is safe because:
>>>
>>> (1) This affects only specific VM (QEMU), not all/others.
>>>
>>> (2) It is dangerous only when sysadmin triggers the events on purpose. If this
>>> interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
>>> inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
>>> will be dangerous as well.
>>>
>>> (3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
>>> not see issue for host kernel. It will be security bug if to read/write eventfd
>>> from userspace crashes kernel space.
>>>
>>> (4) We primarily use this interface when VM is running into issue (unless we use
>>> it as workaround).
>>
>>
>> Besides the above, I think it's only "safe" if we clearly define the semanic of
>> this command. E.g:
>>
>> 1) Does it work at EventNotifier (eventfd) level or virtio/vfio level?
> 
> I am still confused with the difference between EventNotifier level and
> virtio/vfio level.
> 
> There is not a global mechanism to track the EventNotifier used by each device.
> We will still need per-device-type interface to dump EventNotifier for each device.
> 
> Please see more below.
> 
>> 2) Can it bypass the masking?
> 
> This is a good question and please see below for the answer.
> 
>>
>>
>>>
>>>
>>> 2. Is it fine to add such interface to QEMU software upstream, or such interface
>>> is not good for software upstream so that the interface should be added only
>>> when QEMU is customized for specific products' usage?
>>
>>
>> We can listen to the comments from other experts on the list. But I think having
>> a generic trigger at EventNotifier level should be ok.
> 
> Would you mind share and confirm if this is what you are looking for?
> 
> To dump EventNotifier for each device.
> 
> (qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 dump
> 
> ... list all event-notifier related to this device ...
> 
> 
> Write to a specific EventNotifier. The id is from prior dump.
> 
> (qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 write <dump id>
> 
> ... print which event-notifier is written to ...
> 
> 
> This will answer your question that "Can it bypass the masking?".
> 
> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
> not able to bypass masking because masking is to unregister the eventfd. To
> write to eventfd does not take effect.
> 
> However, it is possible to bypass masking for vhost-net because vhost-net
> registered a specific masked_notifier eventfd in order to mask irq. To write to
> original eventfd still takes effect.
> 
> We may leave the user to decide whether to write to 'masked_notifier' or
> original 'guest_notifier' for vhost-net.

My fault here. To write to masked_notifier will always be masked :(

If it is EventNotifier level, we will not care whether the EventNotifier is
masked or not. It just provides an interface to write to EventNotifier.

To dump the MSI-x table for both virtio and vfio will help confirm if the vector
is masked.

Thank you very much!

Dongli Zhang

> 
> Thank you very much!
> 
> Dongli Zhang
> 
> 
>>
>> Thanks
>>
>>
>>>
>>>
>>> We may discuss how, e.g., hmp_mem_write() vs. kick/call if it is fine to add
>>> such interfaces.
>>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>>>
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-06  8:43               ` Dongli Zhang
  2021-04-06 23:27                 ` Dongli Zhang
@ 2021-04-07  2:18                 ` Jason Wang
  1 sibling, 0 replies; 22+ messages in thread
From: Jason Wang @ 2021-04-07  2:18 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/4/6 下午4:43, Dongli Zhang 写道:
>
> On 4/5/21 6:55 PM, Jason Wang wrote:
>> 在 2021/4/6 上午4:00, Dongli Zhang 写道:
>>> On 4/1/21 8:47 PM, Jason Wang wrote:
>>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>>>
>>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>>>>>>> or legacy IRQ).
>>>>>>>>>
>>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device
>>>>>>>>> (e.g.,
>>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>>>
>>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>>>
>>>>>>>>> crash> eventfd_ctx ffff8f67f6bbe700
>>>>>>>>> struct eventfd_ctx {
>>>>>>>>>        kref = {
>>>>>>>>>          refcount = {
>>>>>>>>>            refs = {
>>>>>>>>>              counter = 4
>>>>>>>>>            }
>>>>>>>>>          }
>>>>>>>>>        },
>>>>>>>>>        wqh = {
>>>>>>>>>          lock = {
>>>>>>>>>            {
>>>>>>>>>              rlock = {
>>>>>>>>>                raw_lock = {
>>>>>>>>>                  val = {
>>>>>>>>>                    counter = 0
>>>>>>>>>                  }
>>>>>>>>>                }
>>>>>>>>>              }
>>>>>>>>>            }
>>>>>>>>>          },
>>>>>>>>>          head = {
>>>>>>>>>            next = 0xffff8f841dc08e18,
>>>>>>>>>            prev = 0xffff8f841dc08e18
>>>>>>>>>          }
>>>>>>>>>        },
>>>>>>>>>        count = 15, ---> eventfd is 15 !!!
>>>>>>>>>        flags = 526336
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>>>>> with this interface.
>>>>>>>>>
>>>>>>>>> { "execute": "x-debug-device-event",
>>>>>>>>>        "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>>>>                       "event": "kick", "queue": 2 } }
>>>>>>>>>
>>>>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>>>>> indicates something bad is in software that the 'kick' is lost.
>>>>>>>> What do you mean by "software" here? And it looks to me you're testing
>>>>>>>> whether
>>>>>>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>>>>>>> sure how much value could we gain from a dedicated debug interface like this
>>>>>>>> consider there're a lot of exisinting general purpose debugging method like
>>>>>>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>>>>>>> event_notifier_set() is only a very small fraction of the process of
>>>>>>>> virtqueue
>>>>>>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>>>>>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>>>>>>> ioeventfd instead of here. Irq is even more complicated.
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> I am not testing whether event_notifier_set() is called by
>>>>>>> virtio_queue_notify().
>>>>>>>
>>>>>>> The 'software' indicates the data processing and event notification mechanism
>>>>>>> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting
>>>>>>> for an
>>>>>>> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
>>>>>>> erroneously returns false due to corrupted ring buffer status.
>>>>>> So there could be several factors that may block the notification:
>>>>>>
>>>>>> 1) eventfd bug (ioeventfd vs irqfd)
>>>>>> 2) wrong virtqueue state (either driver or device)
>>>>>> 3) missing barriers (either driver or device)
>>>>>> 4) Qemu bug (irqchip and routing)
>>>>>> ...
>>>>> This is not only about whether notification is blocked.
>>>>>
>>>>> It can also be used to help narrow down and understand if there is any
>>>>> suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only
>>>>> drivers following virtio spec. It is closely related to many of other kernel
>>>>> components.
>>>>>
>>>>> Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we
>>>>> will be able to analyze what may happen along the IO completion path starting
>>>>> from when /where the IRQ is injected ... perhaps the root cause is not with
>>>>> virtio but blk-mq/scsi (this is just an example).
>>>>>
>>>>>
>>>>> In addition, this idea should help for vfio-pci. Suppose the developer for a
>>>>> specific device driver suspects IO/networking hangs because of loss if IRQ, we
>>>>> will be able to verify if that assumption is correct by injecting an IRQ on
>>>>> purpose.
>>>>>
>>>>> Therefore, this is not only about virtio PV driver (frontend/backend), but also
>>>>> used to help analyze the issue related to entire IO/networking/passthrough
>>>>> virtualization stacks, especially in production env where the issue can only be
>>>>> reproduced randomly.
>>>> So it looks to me you'd better having things like this in the EventNotifier
>>>> layer and introduce qmp commands to write/read that instead of starting with a
>>>> virtio specific commands. Or it might be even helpful to start from some
>>>> counters for event notifiers which could be accessed via monitor to help to for
>>>> debugging to start with which is much more safe in the environment of
>>>> production. Having artifical events are always dangerous.
>>> The EventNotifier is just fd used by different QEMU components. There is not a
>>> way to track each EventNotifier used by a QEMU process so that I do not think we
>>> are able to track at EventNotifier layer, unless we add extra code to track the
>>> init/uninit of each eventfd, or modify kernel.
>>>
>>> That's try I introduced "DeviceEvent event" to "struct DeviceClass" so that each
>>> device type will be able to customize its own way to track its eventfd list.
>>>
>>>
>>> Would you prefer "write to a specific eventfd for a specific QEMU device",
>>> instead of "kick/call for a specific device"?
>>
>> It might be better. But note that eventfd is Linux specific, that's why we need
>> do it at higher level (EventNotifier level to make it work for e.g win).
>>
>> And it might be even better to start with reading the counters.
> Is it possible to read from eventfd without modifying kernel?
>
> QEMU has only event_notifier_test_and_clear(). According to kernel code, to read
> from eventfd will decreate ctx->count as line 190.
>
> 185 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
> 186 {
> 187         lockdep_assert_held(&ctx->wqh.lock);
> 188
> 189         *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
> 190         ctx->count -= *cnt;
> 191 }
>
> Can I assume it is not appropriate to read from eventfd?


I don't get here, this is how eventfd is designed (read to clear 
behaviour if EFD_SEMAPHORE is not specified). When reading this counter 
the user should know the semantic of eventfd read.


>
>>
>>>
>>>>>> Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
>>>>>> So for kick you did (assume vhost is on):
>>>>>>
>>>>>> virtio_device_event_kick()
>>>>>>        virtio_queue_notify()
>>>>>>            event_notifier_set()
>>>>>>
>>>>>> It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
>>>>>>
>>>>>> For call you did:
>>>>>>
>>>>>> virtio_device_event_call()
>>>>>>        event_notifier_set()
>>>>>>
>>>>>> A test of irqfd is correctly set by Qemu. So all of those are not virtio
>>>>>> specific stuffs but you introduce virtio specific command to do debug non
>>>>>> virtio
>>>>>> functions.
>>>>>>
>>>>>> In the case of what you mentioned for vring_need_event(), what we really
>>>>>> want is
>>>>>> to dump the virtqueue state from the guest. This might requries some work of
>>>>>> extending virtio spec (e.g to dump device status like indices, event, wrap
>>>>>> counters).
>>>>> Suppose the issue is only randomly reproducible in production env, we should
>>>>> always take 4) into consideration because we will not be able to know where is
>>>>> the root cause at the very beginning of bug analysis.
>>>> So if it truns out to be an issue of irqchip, how will you do the debugging
>>>> then? I guess what's really helpful is a qmp command to dump irqchip
>>>> status/information.
>>> Thank you very much for suggestion. That will be a different problem and we may
>>> consider as future work.
>>>
>>> This patchset is about to do introduce change/events to help narrow down where
>>> may be the root case in order to facilitate diagnostic (especially for prod env
>>> issue and when it is not easy to reproduce).
>>>
>>>>>>> This was initially proposed for vhost only and I was going to export
>>>>>>> ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
>>>>>>> better implement this at QEMU.
>>>>>>>
>>>>>>> The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
>>>>>>> KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
>>>>>>> (via
>>>>>>> ioeventfd), while the backend vhost side sends responses back and calls
>>>>>>> the IRQ
>>>>>>> (via ioeventfd).
>>>>>>>
>>>>>>> Unfortunately, sometimes there is issue with virtio/vhost so that
>>>>>>> kick/call was
>>>>>>> missed/ignored, or even never triggered. The example mentioned in the
>>>>>>> patchset
>>>>>>> cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
>>>>>>> was ignored, due to pci-bridge/hotplug issue.
>>>>>> So this is not a good example since it was a chipset bug. You need to use
>>>>>> other
>>>>>> tool to debug chipset code isn't it?
>>>>> As this issue is reproducible only randomly, we will not be able to realize it
>>>>> is a chipset bug at the very beginning.
>>>>>
>>>>> While the link is about vhost-net, it is applicable to vhost-scsi as well.
>>>>> Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
>>>>> all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
>>>>> and pci-hotplug, in order to identify the root case.
>>>>>
>>>>> The "call/kick" interface is used to narrow down and verify the analysis,
>>>>> especially when many kernel components are involved.
>>>>>
>>>>>>> The hotplug is with a very small window but the IO hangs permanently. I did
>>>>>>> test
>>>>>>> that kicking the doorbell again will help recover the IO, so that I would be
>>>>>>> able to conclude this was due to lost of kick.
>>>>>>>
>>>>>>> The loss of irq/doorbell is painful especially in production environment
>>>>>>> where
>>>>>>> we are not able to attach to QEMU via gdb. While the patchset is only for
>>>>>>> QEMU,
>>>>>>> Xen PV driver used to experience loss of IRQ issue as well, e.g., linux
>>>>>>> kernel
>>>>>>> commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
>>>>>>> binding").
>>>>>> So looking at the git history we can see it has little possibility that the
>>>>>> missing is due to virtio/vhost itself. So the commit you mention here is not
>>>>>> good as well since it's not a netfront/netbackend bug.
>>>>> I mentioned the xen issue just to explain that the loss of event/irq issue may
>>>>> happen and is very painful. Another xen example is (equivalent to KVM VFIO):
>>>>>
>>>>> https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
>>>>>
>>>>
>>>> Sorry, I can't figure out how is this related to VFIO or virtio. It should be
>>>> reproducible for devices without using eventfd?
>>>>
>>> Yes, although not involving eventfd, other drivers/virtualization may encounter
>>> the loss of irq/kick as well. There is no relation between xen and vfio/virtio.
>>>
>>> That's why a diagnostic interface is appreciated.
>>>
>>> In my opinion, the 'diagnostic' is not only to collect data,
>>
>> Usually, collecting the data is the first step :)
>>
>>
>>>    but also to
>>> introduce event/change (e.g., inject IRQ) and then monitor/observe what will
>>> happen to the stalled VM.
>>
>> It might be helpful yes, but it's also very dangerous.
>>
>>
>>>>> That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
>>>>> blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
>>>>>
>>>>>> So for the case of event call, what you did is:
>>>>>>
>>>>>> satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
>>>>>>                                         Error **errp)
>>>>>> {
>>>>>> #ifdef DEBUG_VIRTIO_EVENT
>>>>>>        printf("The 'call' event is triggered for path=%s, queue=%d,
>>>>>> irqfd=%d.\n",
>>>>>>               object_get_canonical_path(OBJECT(vq->vdev)),
>>>>>>               vq->queue_index, eventfd);
>>>>>> #endif
>>>>>>
>>>>>>        if (eventfd) {
>>>>>>            virtio_set_isr(vq->vdev, 0x1);
>>>>>>            event_notifier_set(&vq->guest_notifier);
>>>>>>        } else {
>>>>>>            virtio_irq(vq);
>>>>>>        }
>>>>>> }
>>>>>>
>>>>>> This means, when eventfd is set, you bypasses the MSI mask which is very
>>>>>> dangerous to make it used in the case of production environment. And if you
>>>>>> check masking, it won't help a lot if the MSI is masked wrongly.
>>>>> You are right.
>>>>>
>>>>> Only the vhost-net is dangerous because it masks a vector by registering an
>>>>> alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi
>>>>> will
>>>>> un-register the guest notifier.
>>>>>
>>>>>>> This can help "narrow down" if the IO/networking hang is due to loss of
>>>>>>> IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
>>>>>>> production
>>>>>>> env. This can also be used as a workaround so that VM owner will not need to
>>>>>>> reboot VM.
>>>>>> So having such extra workaround is pretty dangerous in production environemnt
>>>>>> where I think we need to be conservative which means we need to collect
>>>>>> information instead of generating artificial event.
>>>>>>
>>>>>> And it doesn't help if the wrokaround can be triggered through management API.
>>>>> I agree with this.
>>>>>
>>>>> This depends on the administrator. This workaround should only be used in very
>>>>> limited and special case.
>>>>>
>>>>>>> In addition, the VFIO will benefit from it. We will be able to test if to
>>>>>>> inject
>>>>>>> IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
>>>>>>> blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
>>>>>>> chance to loose IRQ during CPU hotplug or changing IRQ affinity).
>>>>>>>
>>>>>>>> I think we could not gain much for introducing an dedicated mechanism for
>>>>>>>> such a
>>>>>>>> corner case.
>>>>>>> As replied by Dave for prior RFC, the QEMU already supports
>>>>>>> hmp_ioport_write to
>>>>>>> trigger an ioport write on purpose.
>>>>>> If that applies. I would rather have a hmp_mem_write then we can test both MSI
>>>>>> and doorbell. But again, they are very dangerous to be used in production
>>>>>> envronment.
>>>>> This is just not convenient for production env administrator. We will need to
>>>>> first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
>>>>> the command after calculating the address of doorbell.
>>>>>
>>>>> Something bad may happen if the developer did not calculate the address
>>>>> correctly.
>>>> It won't be worse than hmp_ioport_write I think?
>>> I always believe it is worth adding hmp_mem_write().
>>>
>>> While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
>>> easy/convenient as to write to eventfd.
>>>
>>>>> It should be much more easier for developer to just ask administrator to "call"
>>>>> queue X for a specific virtio device.
>>>> We can have the commands like "info virtio" which can show all the MSI/doorbell
>>>> information for user to use. Or limit its use for virtio and vfio device only to
>>>> avoid unexpected result.
>>> So far the method by this patchset is to introduce "DeviceEvent event" to
>>> "struct DeviceClass".
>>>
>>> Only virtio-pci-xxx and vfio (future work) will implement this interface.
>>>
>>>
>>>>>>> The linux block layer also supports the below to kick the IO queue on
>>>>>>> purpose.
>>>>>>>
>>>>>>> echo "kick" > /sys/kernel/debug/block/sda/state
>>>>>> This might be fine for hardware device but not virtio. The device can
>>>>>> choose to
>>>>>> poll the virtqueue instead of depending of the doorbell to work. And for
>>>>>> networking subsystem, we don't have such stuffs, instead ethtool support to
>>>>>> dump
>>>>>> ring and vendor specific stuffs which could be used for dumping virtqueue
>>>>>> state
>>>>>> in this case.
>>>>> This is just another example to help explain the philosophy behind the
>>>>> "kick/call" idea: sometimes to trigger the event on purpose will help us narrow
>>>>> down and verify our analysis of a kernel bug, especially a bug that is only
>>>>> randomly reproducible in production environment.
>>>>>
>>>>>
>>>>> I understand it is possibly not proper to introduce such interface to QEMU.
>>>>> That's why I used to send out the RFC.
>>>>>
>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
>>>>>
>>>>>
>>>>> In my opinion, this interface is pretty useful when the diagnostic invokes many
>>>>> kernel components, or when developers from different components are working on
>>>>> the same bug, no matter whether the root cause is at virtio or not.
>>>> So for virtio, it's not hard to events without those interface. E.g for
>>>> networking you can generate some traffic and trace on where-ever you suspect
>>>> that could block the event (kick/call).
>>> Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
>>> vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
>>> as well. I do not have a way to generate traffic for RX path in such situation.
>>
>> Right, but as discussed, we need interface to dump virtqueue state, then it
>> would be very easy to start with.
>>
>>
>>>> I still prefer hmp_mem_write()/read() which looks more generic, in the same
>>>> time, we can add more debug informaiton likes:
>>>>
>>>> 1) satistics like eventfd counters
>>>> 2) device information, register layout, doorbell/MSI-X information
>>>> 3) irqchip infromation
>>> Would you mind help for below questions?
>>>
>>> 1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
>>> interfaces? I think it is safe because:
>>>
>>> (1) This affects only specific VM (QEMU), not all/others.
>>>
>>> (2) It is dangerous only when sysadmin triggers the events on purpose. If this
>>> interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
>>> inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
>>> will be dangerous as well.
>>>
>>> (3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
>>> not see issue for host kernel. It will be security bug if to read/write eventfd
>>> from userspace crashes kernel space.
>>>
>>> (4) We primarily use this interface when VM is running into issue (unless we use
>>> it as workaround).
>>
>> Besides the above, I think it's only "safe" if we clearly define the semanic of
>> this command. E.g:
>>
>> 1) Does it work at EventNotifier (eventfd) level or virtio/vfio level?
> I am still confused with the difference between EventNotifier level and
> virtio/vfio level.
>
> There is not a global mechanism to track the EventNotifier used by each device.
> We will still need per-device-type interface to dump EventNotifier for each device.
>
> Please see more below.
>
>> 2) Can it bypass the masking?
> This is a good question and please see below for the answer.
>
>>
>>>
>>> 2. Is it fine to add such interface to QEMU software upstream, or such interface
>>> is not good for software upstream so that the interface should be added only
>>> when QEMU is customized for specific products' usage?
>>
>> We can listen to the comments from other experts on the list. But I think having
>> a generic trigger at EventNotifier level should be ok.
> Would you mind share and confirm if this is what you are looking for?
>
> To dump EventNotifier for each device.
>
> (qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 dump
>
> ... list all event-notifier related to this device ...
>
>
> Write to a specific EventNotifier. The id is from prior dump.
>
> (qemu) x-debug-device-event-notifier /machine/peripheral/vscsi0 write <dump id>
>
> ... print which event-notifier is written to ...
>

Something like this, yes.


> This will answer your question that "Can it bypass the masking?".
>
> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
> not able to bypass masking because masking is to unregister the eventfd. To
> write to eventfd does not take effect.
>
> However, it is possible to bypass masking for vhost-net because vhost-net
> registered a specific masked_notifier eventfd in order to mask irq. To write to
> original eventfd still takes effect.


Right, using those commands assumes the user has a clear understanding 
of how eventnotifier is expected to work for different backends.


>
> We may leave the user to decide whether to write to 'masked_notifier' or
> original 'guest_notifier' for vhost-net.


Yes.

Thanks


>
> Thank you very much!
>
> Dongli Zhang
>
>
>> Thanks
>>
>>
>>>
>>> We may discuss how, e.g., hmp_mem_write() vs. kick/call if it is fine to add
>>> such interfaces.
>>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>>>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-06 23:27                 ` Dongli Zhang
@ 2021-04-07  2:20                   ` Jason Wang
  2021-04-08  5:51                     ` Dongli Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2021-04-07  2:20 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/4/7 上午7:27, Dongli Zhang 写道:
>> This will answer your question that "Can it bypass the masking?".
>>
>> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
>> not able to bypass masking because masking is to unregister the eventfd. To
>> write to eventfd does not take effect.
>>
>> However, it is possible to bypass masking for vhost-net because vhost-net
>> registered a specific masked_notifier eventfd in order to mask irq. To write to
>> original eventfd still takes effect.
>>
>> We may leave the user to decide whether to write to 'masked_notifier' or
>> original 'guest_notifier' for vhost-net.
> My fault here. To write to masked_notifier will always be masked:(


Only when there's no bug in the qemu.


>
> If it is EventNotifier level, we will not care whether the EventNotifier is
> masked or not. It just provides an interface to write to EventNotifier.


Yes.


>
> To dump the MSI-x table for both virtio and vfio will help confirm if the vector
> is masked.


That would be helpful as well. It's probably better to extend "info pci" 
command.

Thanks


>
> Thank you very much!
>
> Dongli Zhang
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event
  2021-03-26  5:44 ` [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event Dongli Zhang
@ 2021-04-07 13:40   ` Eduardo Habkost
  2021-04-08  5:49     ` Dongli Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Eduardo Habkost @ 2021-04-07 13:40 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: kwolf, fam, berrange, qemu-block, mst, armbru, jasowang, joe.jin,
	qemu-devel, dgilbert, stefanha, pbonzini, mreitz

On Thu, Mar 25, 2021 at 10:44:28PM -0700, Dongli Zhang wrote:
> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
> the loss of doorbell kick, e.g.,
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html
> 
> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
> 
> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
> to help narrow down if the issue is due to loss of irq/kick. So far the new
> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
> or legacy IRQ).
> 
> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
> on purpose by admin at QEMU/host side for a specific device.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
[...]
> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
> index 605d57287a..c7795d4ba5 100644
> --- a/include/monitor/hmp.h
> +++ b/include/monitor/hmp.h
> @@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict);
>  void hmp_replay_break(Monitor *mon, const QDict *qdict);
>  void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
>  void hmp_replay_seek(Monitor *mon, const QDict *qdict);
> +void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict);
>  
>  #endif
> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index b83178220b..711c4a297a 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -124,3 +124,33 @@
>  ##
>  { 'event': 'DEVICE_DELETED',
>    'data': { '*device': 'str', 'path': 'str' } }
> +
> +##
> +# @x-debug-device-event:
> +#
> +# Generate device event for a specific device queue
> +#
> +# @dev: device path
> +#
> +# @event: event (e.g., kick or call) to trigger

Any specific reason to not use an enum here?

In addition to making the QAPI schema and documentation more
descriptive, it would save you the work of manually defining the
DEVICE_EVENT_* constants and implementing get_device_event().


> +#
> +# @queue: queue id
> +#
> +# Returns: Nothing on success
> +#
> +# Since: 6.1
> +#
> +# Notes: This is used to debug VM driver hang issue. The 'kick' event is to
> +#        send notification to QEMU/vhost while the 'call' event is to
> +#        interrupt VM on purpose.
> +#
> +# Example:
> +#
> +# -> { "execute": "x-debug-device_event",
> +#      "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick",
> +#                     "queue": 1 } }
> +# <- { "return": {} }
> +#
> +##
> +{ 'command': 'x-debug-device-event',
> +  'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} }
[...]

-- 
Eduardo



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event
  2021-04-07 13:40   ` Eduardo Habkost
@ 2021-04-08  5:49     ` Dongli Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Dongli Zhang @ 2021-04-08  5:49 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: kwolf, fam, berrange, qemu-block, mst, jasowang, joe.jin, armbru,
	qemu-devel, stefanha, pbonzini, mreitz, dgilbert



On 4/7/21 6:40 AM, Eduardo Habkost wrote:
> On Thu, Mar 25, 2021 at 10:44:28PM -0700, Dongli Zhang wrote:
>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>> the loss of doorbell kick, e.g.,
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!NaqdV_o0gMJkUtVWaHyLRwKDa_8MsiuANAqEcM-Ooy4pYE3R1bwPmLdCTkE0gq6gywY$ 
>>
>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>
>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>> or legacy IRQ).
>>
>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>> on purpose by admin at QEMU/host side for a specific device.
>>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> [...]
>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>> index 605d57287a..c7795d4ba5 100644
>> --- a/include/monitor/hmp.h
>> +++ b/include/monitor/hmp.h
>> @@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_break(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_seek(Monitor *mon, const QDict *qdict);
>> +void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict);
>>  
>>  #endif
>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>> index b83178220b..711c4a297a 100644
>> --- a/qapi/qdev.json
>> +++ b/qapi/qdev.json
>> @@ -124,3 +124,33 @@
>>  ##
>>  { 'event': 'DEVICE_DELETED',
>>    'data': { '*device': 'str', 'path': 'str' } }
>> +
>> +##
>> +# @x-debug-device-event:
>> +#
>> +# Generate device event for a specific device queue
>> +#
>> +# @dev: device path
>> +#
>> +# @event: event (e.g., kick or call) to trigger
> 
> Any specific reason to not use an enum here?
> 
> In addition to making the QAPI schema and documentation more
> descriptive, it would save you the work of manually defining the
> DEVICE_EVENT_* constants and implementing get_device_event().

Thank you very much for the suggestion!

I will use enum in json file.

Dongli Zhang

> 
> 
>> +#
>> +# @queue: queue id
>> +#
>> +# Returns: Nothing on success
>> +#
>> +# Since: 6.1
>> +#
>> +# Notes: This is used to debug VM driver hang issue. The 'kick' event is to
>> +#        send notification to QEMU/vhost while the 'call' event is to
>> +#        interrupt VM on purpose.
>> +#
>> +# Example:
>> +#
>> +# -> { "execute": "x-debug-device_event",
>> +#      "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick",
>> +#                     "queue": 1 } }
>> +# <- { "return": {} }
>> +#
>> +##
>> +{ 'command': 'x-debug-device-event',
>> +  'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} }
> [...]
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-07  2:20                   ` Jason Wang
@ 2021-04-08  5:51                     ` Dongli Zhang
  2021-04-08  5:59                       ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dongli Zhang @ 2021-04-08  5:51 UTC (permalink / raw)
  To: Jason Wang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz



On 4/6/21 7:20 PM, Jason Wang wrote:
> 
> 在 2021/4/7 上午7:27, Dongli Zhang 写道:
>>> This will answer your question that "Can it bypass the masking?".
>>>
>>> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
>>> not able to bypass masking because masking is to unregister the eventfd. To
>>> write to eventfd does not take effect.
>>>
>>> However, it is possible to bypass masking for vhost-net because vhost-net
>>> registered a specific masked_notifier eventfd in order to mask irq. To write to
>>> original eventfd still takes effect.
>>>
>>> We may leave the user to decide whether to write to 'masked_notifier' or
>>> original 'guest_notifier' for vhost-net.
>> My fault here. To write to masked_notifier will always be masked:(
> 
> 
> Only when there's no bug in the qemu.
> 
> 
>>
>> If it is EventNotifier level, we will not care whether the EventNotifier is
>> masked or not. It just provides an interface to write to EventNotifier.
> 
> 
> Yes.
> 
> 
>>
>> To dump the MSI-x table for both virtio and vfio will help confirm if the vector
>> is masked.
> 
> 
> That would be helpful as well. It's probably better to extend "info pci" command.
> 
> Thanks

I will try if to add to "info pci" (introduce new arg option to "info pci"), or
to introduce new command.

About the EventNotifier, I will classify them as guest notifier or host notifier
so that it will be much more easier for user to tell if the eventfd is for
injecting IRQ or kicking the doorbell.

Thank you very much for all suggestions!

Dongli Zhang

> 
> 
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] Add debug interface to kick/call on purpose
  2021-04-08  5:51                     ` Dongli Zhang
@ 2021-04-08  5:59                       ` Jason Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Wang @ 2021-04-08  5:59 UTC (permalink / raw)
  To: Dongli Zhang, qemu-block, qemu-devel
  Cc: kwolf, fam, berrange, ehabkost, mst, joe.jin, armbru, dgilbert,
	stefanha, pbonzini, mreitz


在 2021/4/8 下午1:51, Dongli Zhang 写道:
>
> On 4/6/21 7:20 PM, Jason Wang wrote:
>> 在 2021/4/7 上午7:27, Dongli Zhang 写道:
>>>> This will answer your question that "Can it bypass the masking?".
>>>>
>>>> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is
>>>> not able to bypass masking because masking is to unregister the eventfd. To
>>>> write to eventfd does not take effect.
>>>>
>>>> However, it is possible to bypass masking for vhost-net because vhost-net
>>>> registered a specific masked_notifier eventfd in order to mask irq. To write to
>>>> original eventfd still takes effect.
>>>>
>>>> We may leave the user to decide whether to write to 'masked_notifier' or
>>>> original 'guest_notifier' for vhost-net.
>>> My fault here. To write to masked_notifier will always be masked:(
>>
>> Only when there's no bug in the qemu.
>>
>>
>>> If it is EventNotifier level, we will not care whether the EventNotifier is
>>> masked or not. It just provides an interface to write to EventNotifier.
>>
>> Yes.
>>
>>
>>> To dump the MSI-x table for both virtio and vfio will help confirm if the vector
>>> is masked.
>>
>> That would be helpful as well. It's probably better to extend "info pci" command.
>>
>> Thanks
> I will try if to add to "info pci" (introduce new arg option to "info pci"), or
> to introduce new command.


It's better to just reuse "info pci".


>
> About the EventNotifier, I will classify them as guest notifier or host notifier
> so that it will be much more easier for user to tell if the eventfd is for
> injecting IRQ or kicking the doorbell.


Sounds good.


>
> Thank you very much for all suggestions!
>
> Dongli Zhang


Thanks


>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>>>



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-08  6:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26  5:44 [PATCH 0/6] Add debug interface to kick/call on purpose Dongli Zhang
2021-03-26  5:44 ` [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event Dongli Zhang
2021-04-07 13:40   ` Eduardo Habkost
2021-04-08  5:49     ` Dongli Zhang
2021-03-26  5:44 ` [PATCH 2/6] virtio: introduce helper function for kick/call device event Dongli Zhang
2021-03-26  5:44 ` [PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call Dongli Zhang
2021-03-26  5:44 ` [PATCH 4/6] virtio-scsi-pci: " Dongli Zhang
2021-03-26  5:44 ` [PATCH 5/6] vhost-scsi-pci: " Dongli Zhang
2021-03-26  5:44 ` [PATCH 6/6] virtio-net-pci: " Dongli Zhang
2021-03-26  7:24 ` [PATCH 0/6] Add debug interface to kick/call on purpose Jason Wang
2021-03-26 21:16   ` Dongli Zhang
2021-03-29  3:56     ` Jason Wang
2021-03-30  7:29       ` Dongli Zhang
2021-04-02  3:47         ` Jason Wang
2021-04-05 20:00           ` Dongli Zhang
2021-04-06  1:55             ` Jason Wang
2021-04-06  8:43               ` Dongli Zhang
2021-04-06 23:27                 ` Dongli Zhang
2021-04-07  2:20                   ` Jason Wang
2021-04-08  5:51                     ` Dongli Zhang
2021-04-08  5:59                       ` Jason Wang
2021-04-07  2:18                 ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).