All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
@ 2022-05-19 14:19 Konstantin Khlebnikov
  2022-05-19 14:19 ` [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error() Konstantin Khlebnikov
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2022-05-19 14:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: yc-core

This event represents device runtime errors to give time and
reason why device is broken.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 hw/core/qdev.c         |    7 +++++++
 include/hw/qdev-core.h |    1 +
 qapi/qdev.json         |   26 ++++++++++++++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 84f3019440..e95ceb071b 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -347,6 +347,13 @@ void qdev_unrealize(DeviceState *dev)
     object_property_set_bool(OBJECT(dev), "realized", false, &error_abort);
 }
 
+void qdev_report_runtime_error(DeviceState *dev, const Error *err)
+{
+    qapi_event_send_device_runtime_error(!!dev->id, dev->id,
+                                         dev->canonical_path,
+                                         error_get_pretty(err));
+}
+
 static int qdev_assert_realized_properly_cb(Object *obj, void *opaque)
 {
     DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE));
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 92c3d65208..9ced2e0f09 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -396,6 +396,7 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
  * the life of the simulation and should not be unrealized and freed.
  */
 void qdev_unrealize(DeviceState *dev);
+void qdev_report_runtime_error(DeviceState *dev, const Error *err);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version);
 HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev);
diff --git a/qapi/qdev.json b/qapi/qdev.json
index 26cd10106b..89ef32eef7 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -159,3 +159,29 @@
 ##
 { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
   'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @DEVICE_RUNTIME_ERROR:
+#
+# Emitted when a device fails in runtime.
+#
+# @device: the device's ID if it has one
+#
+# @path: the device's QOM path
+#
+# @reason: human readable description
+#
+# Since: 7.1
+#
+# Example:
+#
+# <- { "event": "DEVICE_RUNTIME_ERROR"
+#      "data": { "device": "virtio-net-pci-0",
+#                "path": "/machine/peripheral/virtio-net-pci-0",
+#                "reason": "virtio-net header incorrect" },
+#      },
+#      "timestamp": { "seconds": 1615570772, "microseconds": 202844 } }
+#
+##
+{ 'event': 'DEVICE_RUNTIME_ERROR',
+        'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error()
  2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
@ 2022-05-19 14:19 ` Konstantin Khlebnikov
  2022-05-24 19:25   ` Vladimir Sementsov-Ogievskiy
  2022-05-19 14:19 ` [PATCH 3/4] vhost: add method vhost_set_vring_err Konstantin Khlebnikov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2022-05-19 14:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: yc-core

Repalce virtio_error() with macro which forms structured Error and
reports it as device runtime-error in addition to present actions.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 hw/virtio/virtio.c         |    9 +++------
 include/hw/virtio/virtio.h |   10 +++++++++-
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5d607aeaa0..638d779bf2 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3642,13 +3642,10 @@ void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name)
     vdev->bus_name = g_strdup(bus_name);
 }
 
-void G_GNUC_PRINTF(2, 3) virtio_error(VirtIODevice *vdev, const char *fmt, ...)
+void virtio_fatal_error(VirtIODevice *vdev, Error *err)
 {
-    va_list ap;
-
-    va_start(ap, fmt);
-    error_vreport(fmt, ap);
-    va_end(ap);
+    qdev_report_runtime_error(&vdev->parent_obj, err);
+    error_report_err(err);
 
     if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
         vdev->status = vdev->status | VIRTIO_CONFIG_S_NEEDS_RESET;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index db1c0ddf6b..a165e35b0b 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -16,6 +16,7 @@
 
 #include "exec/memory.h"
 #include "hw/qdev-core.h"
+#include "qapi/error.h"
 #include "net/net.h"
 #include "migration/vmstate.h"
 #include "qemu/event_notifier.h"
@@ -172,7 +173,14 @@ void virtio_init(VirtIODevice *vdev, uint16_t device_id, size_t config_size);
 
 void virtio_cleanup(VirtIODevice *vdev);
 
-void virtio_error(VirtIODevice *vdev, const char *fmt, ...) G_GNUC_PRINTF(2, 3);
+#define virtio_error(vdev, fmt, ...) {              \
+    Error *_err = NULL;                             \
+    error_setg(&_err, (fmt), ## __VA_ARGS__);       \
+    virtio_fatal_error(vdev, _err);                 \
+} while (0)
+
+/* Reports and frees error, breaks device */
+void virtio_fatal_error(VirtIODevice *vdev, Error *err);
 
 /* Set the child bus name. */
 void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] vhost: add method vhost_set_vring_err
  2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
  2022-05-19 14:19 ` [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error() Konstantin Khlebnikov
@ 2022-05-19 14:19 ` Konstantin Khlebnikov
  2022-05-19 14:19 ` [PATCH 4/4] vhost: forward vring errors into virtio device Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2022-05-19 14:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: yc-core

Kernel and user vhost may report virtqueue errors via eventfd.
This is only reliable way to get notification about protocol error.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 hw/virtio/vhost-backend.c         |    7 +++++++
 hw/virtio/vhost-user.c            |    6 ++++++
 include/hw/virtio/vhost-backend.h |    3 +++
 3 files changed, 16 insertions(+)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 4de8b6b3b0..8e581575c9 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -146,6 +146,12 @@ static int vhost_kernel_set_vring_call(struct vhost_dev *dev,
     return vhost_kernel_call(dev, VHOST_SET_VRING_CALL, file);
 }
 
+static int vhost_kernel_set_vring_err(struct vhost_dev *dev,
+                                      struct vhost_vring_file *file)
+{
+    return vhost_kernel_call(dev, VHOST_SET_VRING_ERR, file);
+}
+
 static int vhost_kernel_set_vring_busyloop_timeout(struct vhost_dev *dev,
                                                    struct vhost_vring_state *s)
 {
@@ -309,6 +315,7 @@ const VhostOps kernel_ops = {
         .vhost_get_vring_base = vhost_kernel_get_vring_base,
         .vhost_set_vring_kick = vhost_kernel_set_vring_kick,
         .vhost_set_vring_call = vhost_kernel_set_vring_call,
+        .vhost_set_vring_err = vhost_kernel_set_vring_err,
         .vhost_set_vring_busyloop_timeout =
                                 vhost_kernel_set_vring_busyloop_timeout,
         .vhost_set_features = vhost_kernel_set_features,
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index b040c1ad2b..37f8d4ab6f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1313,6 +1313,11 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev,
     return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
 }
 
+static int vhost_user_set_vring_err(struct vhost_dev *dev,
+                                    struct vhost_vring_file *file)
+{
+    return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_ERR, file);
+}
 
 static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
 {
@@ -2618,6 +2623,7 @@ const VhostOps user_ops = {
         .vhost_get_vring_base = vhost_user_get_vring_base,
         .vhost_set_vring_kick = vhost_user_set_vring_kick,
         .vhost_set_vring_call = vhost_user_set_vring_call,
+        .vhost_set_vring_err = vhost_user_set_vring_err,
         .vhost_set_features = vhost_user_set_features,
         .vhost_get_features = vhost_user_get_features,
         .vhost_set_owner = vhost_user_set_owner,
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 81bf3109f8..eab46d7f0b 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -69,6 +69,8 @@ typedef int (*vhost_set_vring_kick_op)(struct vhost_dev *dev,
                                        struct vhost_vring_file *file);
 typedef int (*vhost_set_vring_call_op)(struct vhost_dev *dev,
                                        struct vhost_vring_file *file);
+typedef int (*vhost_set_vring_err_op)(struct vhost_dev *dev,
+                                      struct vhost_vring_file *file);
 typedef int (*vhost_set_vring_busyloop_timeout_op)(struct vhost_dev *dev,
                                                    struct vhost_vring_state *r);
 typedef int (*vhost_set_features_op)(struct vhost_dev *dev,
@@ -145,6 +147,7 @@ typedef struct VhostOps {
     vhost_get_vring_base_op vhost_get_vring_base;
     vhost_set_vring_kick_op vhost_set_vring_kick;
     vhost_set_vring_call_op vhost_set_vring_call;
+    vhost_set_vring_err_op vhost_set_vring_err;
     vhost_set_vring_busyloop_timeout_op vhost_set_vring_busyloop_timeout;
     vhost_set_features_op vhost_set_features;
     vhost_get_features_op vhost_get_features;



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] vhost: forward vring errors into virtio device
  2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
  2022-05-19 14:19 ` [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error() Konstantin Khlebnikov
  2022-05-19 14:19 ` [PATCH 3/4] vhost: add method vhost_set_vring_err Konstantin Khlebnikov
@ 2022-05-19 14:19 ` Konstantin Khlebnikov
  2022-05-24 19:04 ` [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Vladimir Sementsov-Ogievskiy
  2022-05-25 10:54 ` Markus Armbruster
  4 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2022-05-19 14:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: yc-core

Setup eventfd for vring error notifications.
Add eventfd for each virt-queue to detect which queue faced error.

For example vhost-net in kernel silently stop working at first error.
Now we'll see message and qmp event if guest driver did something wrong.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 hw/virtio/vhost.c         |   37 +++++++++++++++++++++++++++++++++++++
 include/hw/virtio/vhost.h |    1 +
 2 files changed, 38 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index dd3263df56..5015e1fd81 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1274,6 +1274,19 @@ static int vhost_virtqueue_set_busyloop_timeout(struct vhost_dev *dev,
     return 0;
 }
 
+static void vhost_virtqueue_error_notifier(EventNotifier *n)
+{
+    struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
+                                              error_notifier);
+    struct vhost_dev *dev = vq->dev;
+    int index = vq - dev->vqs;
+
+    if (event_notifier_test_and_clear(n) && dev->vdev) {
+        virtio_error(dev->vdev, "vhost vring error in virtqueue %d",
+                     dev->vq_index + index);
+    }
+}
+
 static int vhost_virtqueue_init(struct vhost_dev *dev,
                                 struct vhost_virtqueue *vq, int n)
 {
@@ -1295,7 +1308,27 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 
     vq->dev = dev;
 
+    if (dev->vhost_ops->vhost_set_vring_err) {
+        r = event_notifier_init(&vq->error_notifier, 0);
+        if (r < 0) {
+            goto fail_call;
+        }
+
+        file.fd = event_notifier_get_fd(&vq->error_notifier);
+        r = dev->vhost_ops->vhost_set_vring_err(dev, &file);
+        if (r) {
+            VHOST_OPS_DEBUG(r, "vhost_set_vring_err failed");
+            goto fail_err;
+        }
+
+        event_notifier_set_handler(&vq->error_notifier,
+                                   vhost_virtqueue_error_notifier);
+    }
+
     return 0;
+
+fail_err:
+    event_notifier_cleanup(&vq->error_notifier);
 fail_call:
     event_notifier_cleanup(&vq->masked_notifier);
     return r;
@@ -1304,6 +1337,10 @@ fail_call:
 static void vhost_virtqueue_cleanup(struct vhost_virtqueue *vq)
 {
     event_notifier_cleanup(&vq->masked_notifier);
+    if (vq->dev->vhost_ops->vhost_set_vring_err) {
+        event_notifier_set_handler(&vq->error_notifier, NULL);
+        event_notifier_cleanup(&vq->error_notifier);
+    }
 }
 
 int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index b291fe4e24..1e7cbd9a10 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -29,6 +29,7 @@ struct vhost_virtqueue {
     unsigned long long used_phys;
     unsigned used_size;
     EventNotifier masked_notifier;
+    EventNotifier error_notifier;
     struct vhost_dev *dev;
 };
 



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2022-05-19 14:19 ` [PATCH 4/4] vhost: forward vring errors into virtio device Konstantin Khlebnikov
@ 2022-05-24 19:04 ` Vladimir Sementsov-Ogievskiy
  2022-05-25  8:26   ` Konstantin Khlebnikov
  2022-05-25 10:54 ` Markus Armbruster
  4 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-05-24 19:04 UTC (permalink / raw)
  To: Konstantin Khlebnikov, qemu-devel
  Cc: yc-core, eblake, Markus Armbruster, mst, Denis Plotnikov

First, cover letter is absent. Konstantin, could you please provide a description what the whole series does?

Second, add maintainers to CC:
+Micheal
+Eric
+Markus

On 5/19/22 17:19, Konstantin Khlebnikov wrote:
> This event represents device runtime errors to give time and
> reason why device is broken.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---

The patch itself seems good to me:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error()
  2022-05-19 14:19 ` [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error() Konstantin Khlebnikov
@ 2022-05-24 19:25   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-05-24 19:25 UTC (permalink / raw)
  To: Konstantin Khlebnikov, qemu-devel; +Cc: yc-core

On 5/19/22 17:19, Konstantin Khlebnikov wrote:
> Repalce virtio_error() with macro which forms structured Error and
> reports it as device runtime-error in addition to present actions.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   hw/virtio/virtio.c         |    9 +++------
>   include/hw/virtio/virtio.h |   10 +++++++++-
>   2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 5d607aeaa0..638d779bf2 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -3642,13 +3642,10 @@ void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name)
>       vdev->bus_name = g_strdup(bus_name);
>   }
>   
> -void G_GNUC_PRINTF(2, 3) virtio_error(VirtIODevice *vdev, const char *fmt, ...)
> +void virtio_fatal_error(VirtIODevice *vdev, Error *err)
>   {
> -    va_list ap;
> -
> -    va_start(ap, fmt);
> -    error_vreport(fmt, ap);
> -    va_end(ap);
> +    qdev_report_runtime_error(&vdev->parent_obj, err);
> +    error_report_err(err);
>   
>       if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>           vdev->status = vdev->status | VIRTIO_CONFIG_S_NEEDS_RESET;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index db1c0ddf6b..a165e35b0b 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -16,6 +16,7 @@
>   
>   #include "exec/memory.h"
>   #include "hw/qdev-core.h"
> +#include "qapi/error.h"
>   #include "net/net.h"
>   #include "migration/vmstate.h"
>   #include "qemu/event_notifier.h"
> @@ -172,7 +173,14 @@ void virtio_init(VirtIODevice *vdev, uint16_t device_id, size_t config_size);
>   
>   void virtio_cleanup(VirtIODevice *vdev);
>   
> -void virtio_error(VirtIODevice *vdev, const char *fmt, ...) G_GNUC_PRINTF(2, 3);
> +#define virtio_error(vdev, fmt, ...) {              \
> +    Error *_err = NULL;                             \
> +    error_setg(&_err, (fmt), ## __VA_ARGS__);       \
> +    virtio_fatal_error(vdev, _err);                 \
> +} while (0)
> +
> +/* Reports and frees error, breaks device */
> +void virtio_fatal_error(VirtIODevice *vdev, Error *err);
>   
>   /* Set the child bus name. */
>   void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);
> 

Hmm. So we create temporary Error object just to pass it to qdev_report_runtime_error..

I think we can avoid introducing this intermediate Error object together with new macro:
just convert argument list to string with help of g_strdup_vprintf() in original virtio_error(), error_report this string and pass to qdev_report_runtime_error() (which should be simplified to get just a string).

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-24 19:04 ` [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Vladimir Sementsov-Ogievskiy
@ 2022-05-25  8:26   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2022-05-25  8:26 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: yc-core, eblake, Markus Armbruster, mst, Denis Plotnikov

[-- Attachment #1: Type: text/html, Size: 2668 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
                   ` (3 preceding siblings ...)
  2022-05-24 19:04 ` [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Vladimir Sementsov-Ogievskiy
@ 2022-05-25 10:54 ` Markus Armbruster
  2022-05-27 12:49   ` Roman Kagan
  4 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2022-05-25 10:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: qemu-devel, yc-core, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

I almost missed this series.  Please cc: maintainers.
scripts/get_maintainer.pl can help you find them, but do use your
judgement to maybe trim its output.

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> This event represents device runtime errors to give time and
> reason why device is broken.

Can you give an or more examples of the "device runtime errors" you have
in mind?

>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  hw/core/qdev.c         |    7 +++++++
>  include/hw/qdev-core.h |    1 +
>  qapi/qdev.json         |   26 ++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+)
>
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index 84f3019440..e95ceb071b 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -347,6 +347,13 @@ void qdev_unrealize(DeviceState *dev)
>      object_property_set_bool(OBJECT(dev), "realized", false, &error_abort);
>  }
>  
> +void qdev_report_runtime_error(DeviceState *dev, const Error *err)
> +{
> +    qapi_event_send_device_runtime_error(!!dev->id, dev->id,
> +                                         dev->canonical_path,
> +                                         error_get_pretty(err));
> +}
> +
>  static int qdev_assert_realized_properly_cb(Object *obj, void *opaque)
>  {
>      DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE));
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 92c3d65208..9ced2e0f09 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -396,6 +396,7 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
>   * the life of the simulation and should not be unrealized and freed.
>   */
>  void qdev_unrealize(DeviceState *dev);
> +void qdev_report_runtime_error(DeviceState *dev, const Error *err);
>  void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
>                                   int required_for_version);
>  HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev);
> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 26cd10106b..89ef32eef7 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -159,3 +159,29 @@
>  ##
>  { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
>    'data': { '*device': 'str', 'path': 'str' } }
> +
> +##
> +# @DEVICE_RUNTIME_ERROR:
> +#
> +# Emitted when a device fails in runtime.

at runtime

I figure there are plenty of device state transitions in the tree that
are failures.  Better express that here.  Unless you intend to hunt them
all down so you can add this event :)

> +#
> +# @device: the device's ID if it has one
> +#
> +# @path: the device's QOM path
> +#
> +# @reason: human readable description
> +#
> +# Since: 7.1
> +#
> +# Example:
> +#
> +# <- { "event": "DEVICE_RUNTIME_ERROR"
> +#      "data": { "device": "virtio-net-pci-0",
> +#                "path": "/machine/peripheral/virtio-net-pci-0",
> +#                "reason": "virtio-net header incorrect" },
> +#      },
> +#      "timestamp": { "seconds": 1615570772, "microseconds": 202844 } }
> +#
> +##
> +{ 'event': 'DEVICE_RUNTIME_ERROR',
> +        'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-25 10:54 ` Markus Armbruster
@ 2022-05-27 12:49   ` Roman Kagan
  2022-05-30 11:28     ` Markus Armbruster
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Kagan @ 2022-05-27 12:49 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> 
> > This event represents device runtime errors to give time and
> > reason why device is broken.
> 
> Can you give an or more examples of the "device runtime errors" you have
> in mind?

Initially we wanted to address a situation when a vhost device
discovered an inconsistency during virtqueue processing and silently
stopped the virtqueue.  This resulted in device stall (partial for
multiqueue devices) and we were the last to notice that.

The solution appeared to be to employ errfd and, upon receiving a
notification through it, to emit a QMP event which is actionable in the
management layer or further up the stack.

Then we observed that virtio (non-vhost) devices suffer from the same
issue: they only log the error but don't signal it to the management
layer.  The case was very similar so we thought it would make sense to
share the infrastructure and the QMP event between virtio and vhost.

Then Konstantin went a bit further and generalized the concept into
generic "device runtime error".  I'm personally not completely convinced
this generalization is appropriate here; we'd appreciate the opinions
from the community on the matter.

HTH,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-27 12:49   ` Roman Kagan
@ 2022-05-30 11:28     ` Markus Armbruster
  2022-05-30 15:04       ` Roman Kagan
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2022-05-30 11:28 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

Roman Kagan <rvkagan@yandex-team.ru> writes:

> On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
>> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>> 
>> > This event represents device runtime errors to give time and
>> > reason why device is broken.
>> 
>> Can you give an or more examples of the "device runtime errors" you have
>> in mind?
>
> Initially we wanted to address a situation when a vhost device
> discovered an inconsistency during virtqueue processing and silently
> stopped the virtqueue.  This resulted in device stall (partial for
> multiqueue devices) and we were the last to notice that.
>
> The solution appeared to be to employ errfd and, upon receiving a
> notification through it, to emit a QMP event which is actionable in the
> management layer or further up the stack.
>
> Then we observed that virtio (non-vhost) devices suffer from the same
> issue: they only log the error but don't signal it to the management
> layer.  The case was very similar so we thought it would make sense to
> share the infrastructure and the QMP event between virtio and vhost.
>
> Then Konstantin went a bit further and generalized the concept into
> generic "device runtime error".  I'm personally not completely convinced
> this generalization is appropriate here; we'd appreciate the opinions
> from the community on the matter.

"Device emulation sending an even on entering certain error states, so
that a management application can do something about it" feels
reasonable enough to me as a general concept.

The key point is of course "can do something": the event needs to be
actionable.  Can you describe possible actions for the cases you
implement?

Once we all have a better idea of the event's purpose, usage, and
limitations, we should revisit its documentation.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-30 11:28     ` Markus Armbruster
@ 2022-05-30 15:04       ` Roman Kagan
  2022-06-20 13:49         ` Roman Kagan
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Kagan @ 2022-05-30 15:04 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> Roman Kagan <rvkagan@yandex-team.ru> writes:
> 
> > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> >> 
> >> > This event represents device runtime errors to give time and
> >> > reason why device is broken.
> >> 
> >> Can you give an or more examples of the "device runtime errors" you have
> >> in mind?
> >
> > Initially we wanted to address a situation when a vhost device
> > discovered an inconsistency during virtqueue processing and silently
> > stopped the virtqueue.  This resulted in device stall (partial for
> > multiqueue devices) and we were the last to notice that.
> >
> > The solution appeared to be to employ errfd and, upon receiving a
> > notification through it, to emit a QMP event which is actionable in the
> > management layer or further up the stack.
> >
> > Then we observed that virtio (non-vhost) devices suffer from the same
> > issue: they only log the error but don't signal it to the management
> > layer.  The case was very similar so we thought it would make sense to
> > share the infrastructure and the QMP event between virtio and vhost.
> >
> > Then Konstantin went a bit further and generalized the concept into
> > generic "device runtime error".  I'm personally not completely convinced
> > this generalization is appropriate here; we'd appreciate the opinions
> > from the community on the matter.
> 
> "Device emulation sending an even on entering certain error states, so
> that a management application can do something about it" feels
> reasonable enough to me as a general concept.
> 
> The key point is of course "can do something": the event needs to be
> actionable.  Can you describe possible actions for the cases you
> implement?

The first one that we had in mind was informational, like triggering an
alert in the monitoring system and/or painting the VM as malfunctioning
in the owner's UI.

There can be more advanced scenarios like autorecovery by resetting the
faulty VM, or fencing it if it's a cluster member.

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-05-30 15:04       ` Roman Kagan
@ 2022-06-20 13:49         ` Roman Kagan
  2022-06-21 11:55           ` Markus Armbruster
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Kagan @ 2022-06-20 13:49 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> > Roman Kagan <rvkagan@yandex-team.ru> writes:
> > 
> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> > >> 
> > >> > This event represents device runtime errors to give time and
> > >> > reason why device is broken.
> > >> 
> > >> Can you give an or more examples of the "device runtime errors" you have
> > >> in mind?
> > >
> > > Initially we wanted to address a situation when a vhost device
> > > discovered an inconsistency during virtqueue processing and silently
> > > stopped the virtqueue.  This resulted in device stall (partial for
> > > multiqueue devices) and we were the last to notice that.
> > >
> > > The solution appeared to be to employ errfd and, upon receiving a
> > > notification through it, to emit a QMP event which is actionable in the
> > > management layer or further up the stack.
> > >
> > > Then we observed that virtio (non-vhost) devices suffer from the same
> > > issue: they only log the error but don't signal it to the management
> > > layer.  The case was very similar so we thought it would make sense to
> > > share the infrastructure and the QMP event between virtio and vhost.
> > >
> > > Then Konstantin went a bit further and generalized the concept into
> > > generic "device runtime error".  I'm personally not completely convinced
> > > this generalization is appropriate here; we'd appreciate the opinions
> > > from the community on the matter.
> > 
> > "Device emulation sending an even on entering certain error states, so
> > that a management application can do something about it" feels
> > reasonable enough to me as a general concept.
> > 
> > The key point is of course "can do something": the event needs to be
> > actionable.  Can you describe possible actions for the cases you
> > implement?
> 
> The first one that we had in mind was informational, like triggering an
> alert in the monitoring system and/or painting the VM as malfunctioning
> in the owner's UI.
> 
> There can be more advanced scenarios like autorecovery by resetting the
> faulty VM, or fencing it if it's a cluster member.

The discussion kind of stalled here.  Do you think the approach makes
sense or not?  Should we try and resubmit the series with a proper cover
letter and possibly other improvements or is it a dead end?

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-06-20 13:49         ` Roman Kagan
@ 2022-06-21 11:55           ` Markus Armbruster
  2022-06-21 12:02             ` Roman Kagan
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2022-06-21 11:55 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

Roman Kagan <rvkagan@yandex-team.ru> writes:

> On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
>> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
>> > Roman Kagan <rvkagan@yandex-team.ru> writes:
>> > 
>> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
>> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>> > >> 
>> > >> > This event represents device runtime errors to give time and
>> > >> > reason why device is broken.
>> > >> 
>> > >> Can you give an or more examples of the "device runtime errors" you have
>> > >> in mind?
>> > >
>> > > Initially we wanted to address a situation when a vhost device
>> > > discovered an inconsistency during virtqueue processing and silently
>> > > stopped the virtqueue.  This resulted in device stall (partial for
>> > > multiqueue devices) and we were the last to notice that.
>> > >
>> > > The solution appeared to be to employ errfd and, upon receiving a
>> > > notification through it, to emit a QMP event which is actionable in the
>> > > management layer or further up the stack.
>> > >
>> > > Then we observed that virtio (non-vhost) devices suffer from the same
>> > > issue: they only log the error but don't signal it to the management
>> > > layer.  The case was very similar so we thought it would make sense to
>> > > share the infrastructure and the QMP event between virtio and vhost.
>> > >
>> > > Then Konstantin went a bit further and generalized the concept into
>> > > generic "device runtime error".  I'm personally not completely convinced
>> > > this generalization is appropriate here; we'd appreciate the opinions
>> > > from the community on the matter.
>> > 
>> > "Device emulation sending an even on entering certain error states, so
>> > that a management application can do something about it" feels
>> > reasonable enough to me as a general concept.
>> > 
>> > The key point is of course "can do something": the event needs to be
>> > actionable.  Can you describe possible actions for the cases you
>> > implement?
>> 
>> The first one that we had in mind was informational, like triggering an
>> alert in the monitoring system and/or painting the VM as malfunctioning
>> in the owner's UI.
>> 
>> There can be more advanced scenarios like autorecovery by resetting the
>> faulty VM, or fencing it if it's a cluster member.
>
> The discussion kind of stalled here.

My apologies...

>                                       Do you think the approach makes
> sense or not?  Should we try and resubmit the series with a proper cover
> letter and possibly other improvements or is it a dead end?

As QAPI schema maintainer, my concern is interface design.  To sell this
interface to me (so to speak), you have to show it's useful and
reasonably general.  Reasonably general, because we don't want to
accumulate one-offs, even if they have their uses.

I think this is mostly a matter of commit message(s) and documentation
here.  Explain your intended use cases.  Maybe hand-wave at other use
cases you can think of.  Document that you're implementing the event
only for the specific errors you need, but that it could be implemented
more widely as needed.  "Complete" feels impractical, though.

Makes sense?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event
  2022-06-21 11:55           ` Markus Armbruster
@ 2022-06-21 12:02             ` Roman Kagan
  0 siblings, 0 replies; 14+ messages in thread
From: Roman Kagan @ 2022-06-21 12:02 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Konstantin Khlebnikov, qemu-devel, yc-core, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake

On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote:
> Roman Kagan <rvkagan@yandex-team.ru> writes:
> 
> > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> >> > Roman Kagan <rvkagan@yandex-team.ru> writes:
> >> > 
> >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> >> > >> 
> >> > >> > This event represents device runtime errors to give time and
> >> > >> > reason why device is broken.
> >> > >> 
> >> > >> Can you give an or more examples of the "device runtime errors" you have
> >> > >> in mind?
> >> > >
> >> > > Initially we wanted to address a situation when a vhost device
> >> > > discovered an inconsistency during virtqueue processing and silently
> >> > > stopped the virtqueue.  This resulted in device stall (partial for
> >> > > multiqueue devices) and we were the last to notice that.
> >> > >
> >> > > The solution appeared to be to employ errfd and, upon receiving a
> >> > > notification through it, to emit a QMP event which is actionable in the
> >> > > management layer or further up the stack.
> >> > >
> >> > > Then we observed that virtio (non-vhost) devices suffer from the same
> >> > > issue: they only log the error but don't signal it to the management
> >> > > layer.  The case was very similar so we thought it would make sense to
> >> > > share the infrastructure and the QMP event between virtio and vhost.
> >> > >
> >> > > Then Konstantin went a bit further and generalized the concept into
> >> > > generic "device runtime error".  I'm personally not completely convinced
> >> > > this generalization is appropriate here; we'd appreciate the opinions
> >> > > from the community on the matter.
> >> > 
> >> > "Device emulation sending an even on entering certain error states, so
> >> > that a management application can do something about it" feels
> >> > reasonable enough to me as a general concept.
> >> > 
> >> > The key point is of course "can do something": the event needs to be
> >> > actionable.  Can you describe possible actions for the cases you
> >> > implement?
> >> 
> >> The first one that we had in mind was informational, like triggering an
> >> alert in the monitoring system and/or painting the VM as malfunctioning
> >> in the owner's UI.
> >> 
> >> There can be more advanced scenarios like autorecovery by resetting the
> >> faulty VM, or fencing it if it's a cluster member.
> >
> > The discussion kind of stalled here.
> 
> My apologies...
> 
> >                                       Do you think the approach makes
> > sense or not?  Should we try and resubmit the series with a proper cover
> > letter and possibly other improvements or is it a dead end?
> 
> As QAPI schema maintainer, my concern is interface design.  To sell this
> interface to me (so to speak), you have to show it's useful and
> reasonably general.  Reasonably general, because we don't want to
> accumulate one-offs, even if they have their uses.
> 
> I think this is mostly a matter of commit message(s) and documentation
> here.  Explain your intended use cases.  Maybe hand-wave at other use
> cases you can think of.  Document that you're implementing the event
> only for the specific errors you need, but that it could be implemented
> more widely as needed.  "Complete" feels impractical, though.
> 
> Makes sense?

Absolutely.  We'll rework and resubmit the series addressing the issues
you've noted, and we'll see how it goes.

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-06-21 12:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19 14:19 [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Konstantin Khlebnikov
2022-05-19 14:19 ` [PATCH 2/4] virtio: forward errors into qdev_report_runtime_error() Konstantin Khlebnikov
2022-05-24 19:25   ` Vladimir Sementsov-Ogievskiy
2022-05-19 14:19 ` [PATCH 3/4] vhost: add method vhost_set_vring_err Konstantin Khlebnikov
2022-05-19 14:19 ` [PATCH 4/4] vhost: forward vring errors into virtio device Konstantin Khlebnikov
2022-05-24 19:04 ` [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event Vladimir Sementsov-Ogievskiy
2022-05-25  8:26   ` Konstantin Khlebnikov
2022-05-25 10:54 ` Markus Armbruster
2022-05-27 12:49   ` Roman Kagan
2022-05-30 11:28     ` Markus Armbruster
2022-05-30 15:04       ` Roman Kagan
2022-06-20 13:49         ` Roman Kagan
2022-06-21 11:55           ` Markus Armbruster
2022-06-21 12:02             ` Roman Kagan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.