QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
@ 2021-03-12 20:07 Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event Daniel Henrique Barboza
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].

Patches 1 and 3 are independent of the ppc patches and can be applied
separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
are dependent on the QAPI patches.


[1] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg01900.html

Daniel Henrique Barboza (4):
  qapi/qdev.json: add DEVICE_NOT_DELETED event
  spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout
  qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event
  spapr.c: use DEVICE_UNPLUG_ERROR event in
    spapr_memory_unplug_rollback()

 hw/ppc/spapr.c     |  2 +-
 hw/ppc/spapr_drc.c |  8 ++++++++
 qapi/machine.json  | 23 +++++++++++++++++++++++
 qapi/qdev.json     | 28 ++++++++++++++++++++++++++++
 4 files changed, 60 insertions(+), 1 deletion(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
@ 2021-03-12 20:07 ` Daniel Henrique Barboza
  2021-03-23 18:00   ` Eric Blake
  2021-03-12 20:07 ` [PATCH 2/4] spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout Daniel Henrique Barboza
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, Daniel Henrique Barboza, qemu-ppc, groug, david

This new event informs QAPI listeners that a previously issued
'device_del' command failed to delete the device from the machine.

Note that no assertion can be made about the failure reason. The goal of
this event is to inform management that QEMU is not able to assess
whether the hotunplug is taking too long to complete or failed in the
guest and, as result, the device is not removed from QOM. When receiving
this event, users/management must check inside the guest to verify the
result of the hotunplug operation.

This scenario happens with architectures where the guest does not have
an official way to report the hotunplug error back to the hypervisor,
such as PowerPC and the pseries machine type.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 qapi/qdev.json | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/qapi/qdev.json b/qapi/qdev.json
index b83178220b..df9a1b9e67 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -124,3 +124,31 @@
 ##
 { 'event': 'DEVICE_DELETED',
   'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @DEVICE_NOT_DELETED:
+#
+# Emitted whenever the device removal process expired and the device
+# still exists in QOM. This indicates that the guest took too long
+# to acknowlege the device removal, and we can not be sure of whether
+# the process will be completed in the guest later on or a guest
+# side error occurred.
+#
+# It is not safe to reuse the specified device ID.
+#
+# @device: device name
+#
+# @path: device path
+#
+# Since: 6.0
+#
+# Example:
+#
+# <- { "event": "DEVICE_NOT_DELETED",
+#      "data": { "device": "core1",
+#                "path": "/machine/peripheral/core1" },
+#      "timestamp": { "seconds": 1615570254, "microseconds": 573986 } }
+#
+##
+{ 'event': 'DEVICE_NOT_DELETED',
+  'data': { '*device': 'str', 'path': 'str' } }
\ No newline at end of file
-- 
2.29.2



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/4] spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event Daniel Henrique Barboza
@ 2021-03-12 20:07 ` Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 3/4] qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, Daniel Henrique Barboza, qemu-ppc, groug, david

The newly added DEVICE_NOT_DELETED QAPI event is adequate to be
sent when the hotunplug timeout expires, letting users know that
something happened inside the guest and the pseries machine didn't
delete the device from QOM.

After this patch, if an user try to hotunplug the last online CPU of
the guest, the "DEIVCE_NOT_DELETED" event will be issued when the
hotunplug timeout for 'core1' is expired:

{"execute": "device_del", "arguments": {"id": "core1"} }
{"return": {}}

{"execute": "device_del", "arguments": {"id": "core1"} }
{"error": {"class": "GenericError",
           "desc": "core-id 1 unplug is still pending, 12 seconds timeout remaining"}}

{"execute": "device_del", "arguments": {"id": "core1"} }
{"error": {"class": "GenericError",
           "desc": "core-id 1 unplug is still pending, 5 seconds timeout remaining"}}

{"timestamp": {"seconds": 1615570254, "microseconds": 573986},
  "event": "DEVICE_NOT_DELETED",
  "data": {"device": "core1", "path": "/machine/peripheral/core1"}}

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_drc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
index 8a71b03800..14f39cec71 100644
--- a/hw/ppc/spapr_drc.c
+++ b/hw/ppc/spapr_drc.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qnull.h"
+#include "qapi/qapi-events-qdev.h"
 #include "cpu.h"
 #include "qemu/cutils.h"
 #include "hw/ppc/spapr_drc.h"
@@ -529,9 +530,16 @@ static const VMStateDescription vmstate_spapr_drc = {
 static void drc_unplug_timeout_cb(void *opaque)
 {
     SpaprDrc *drc = opaque;
+    DeviceState *dev = drc->dev;
 
     if (drc->unplug_requested) {
         drc->unplug_requested = false;
+
+        if (dev) {
+            qapi_event_send_device_not_deleted(!!dev->id,
+                                               dev->id,
+                                               dev->canonical_path);
+        }
     }
 }
 
-- 
2.29.2



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/4] qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 2/4] spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout Daniel Henrique Barboza
@ 2021-03-12 20:07 ` Daniel Henrique Barboza
  2021-03-12 20:07 ` [PATCH 4/4] spapr.c: use DEVICE_UNPLUG_ERROR event in spapr_memory_unplug_rollback() Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, Daniel Henrique Barboza, qemu-ppc, groug, david

At this moment we only provide one event to report a hotunplug error,
MEM_UNPLUG_ERROR. There will be other device types that are going to be
throwing unplug errors in the future though.

Instead of creating a (device_type)_UNPLUG_ERROR for each new device,
create a generic DEVICE_UNPLUG_ERROR event that can be used by all
unplug errors in the future.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 qapi/machine.json | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/qapi/machine.json b/qapi/machine.json
index 330189efe3..9b2c93aad3 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1471,3 +1471,26 @@
 ##
 { 'event': 'MEM_UNPLUG_ERROR',
   'data': { 'device': 'str', 'msg': 'str' } }
+
+##
+# @DEVICE_UNPLUG_ERROR:
+#
+# Emitted when a device hot unplug error occurs.
+#
+# @device: device name
+#
+# @msg: Informative message
+#
+# Since: 6.0
+#
+# Example:
+#
+# <- { "event": "DEVICE_UNPLUG_ERROR"
+#      "data": { "device": "dimm1",
+#                "msg": "Memory hotunplug rejected by the guest for device dimm1"
+#      },
+#      "timestamp": { "seconds": 1615570772, "microseconds": 202844 } }
+#
+##
+{ 'event': 'DEVICE_UNPLUG_ERROR',
+  'data': { 'device': 'str', 'msg': 'str' } }
\ No newline at end of file
-- 
2.29.2



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/4] spapr.c: use DEVICE_UNPLUG_ERROR event in spapr_memory_unplug_rollback()
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-03-12 20:07 ` [PATCH 3/4] qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event Daniel Henrique Barboza
@ 2021-03-12 20:07 ` Daniel Henrique Barboza
  2021-03-23  1:12 ` [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events David Gibson
  2021-04-20 17:11 ` Daniel Henrique Barboza
  5 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, Daniel Henrique Barboza, qemu-ppc, groug, david

Other device types in the pseries machine will use DEVICE_UNPLUG_ERROR
to report hotunplug errors. Use it to report the memory hotunplug error
in spapr_memory_unplug_rollback() as well.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..697664e72f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3624,7 +3624,7 @@ void spapr_memory_unplug_rollback(SpaprMachineState *spapr, DeviceState *dev)
      */
     qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
                                  "for device %s", dev->id);
-    qapi_event_send_mem_unplug_error(dev->id, qapi_error);
+    qapi_event_send_device_unplug_error(dev->id, qapi_error);
 }
 
 /* Callback to be called during DRC release. */
-- 
2.29.2



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2021-03-12 20:07 ` [PATCH 4/4] spapr.c: use DEVICE_UNPLUG_ERROR event in spapr_memory_unplug_rollback() Daniel Henrique Barboza
@ 2021-03-23  1:12 ` David Gibson
  2021-03-23 17:10   ` Daniel Henrique Barboza
  2021-04-20 17:11 ` Daniel Henrique Barboza
  5 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2021-03-23  1:12 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: armbru, qemu-ppc, qemu-devel, groug


[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
> Hi,
> 
> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> 
> Patches 1 and 3 are independent of the ppc patches and can be applied
> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> are dependent on the QAPI patches.

Implementation looks fine, but I think there's a bit more to discuss
before we can apply.

I think it would make sense to re-order this and put UNPLUG_ERROR
first.  Its semantics are clearer, and I think there's a stronger case
for it.

I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
it really tell the user/management anything useful beyond what
receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-23  1:12 ` [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events David Gibson
@ 2021-03-23 17:10   ` Daniel Henrique Barboza
  2021-03-24  1:40     ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-23 17:10 UTC (permalink / raw)
  To: David Gibson; +Cc: armbru, qemu-ppc, qemu-devel, groug



On 3/22/21 10:12 PM, David Gibson wrote:
> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
>> Hi,
>>
>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
>>
>> Patches 1 and 3 are independent of the ppc patches and can be applied
>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
>> are dependent on the QAPI patches.
> 
> Implementation looks fine, but I think there's a bit more to discuss
> before we can apply.
> 
> I think it would make sense to re-order this and put UNPLUG_ERROR
> first.  Its semantics are clearer, and I think there's a stronger case
> for it.

Alright

> 
> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> it really tell the user/management anything useful beyond what
> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?


It informs that the hotunplug operation exceed the timeout that QEMU
internals considers adequate, but QEMU can't assert that it was caused
by an error or an unexpected delay. The end result is that the device
is not going to be deleted from QMP, so DEVICE_NOT_DELETED.

Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
event.



Thanks,


DHB


> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event
  2021-03-12 20:07 ` [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event Daniel Henrique Barboza
@ 2021-03-23 18:00   ` Eric Blake
  2021-03-23 18:12     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Blake @ 2021-03-23 18:00 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel; +Cc: david, qemu-ppc, armbru, groug

On 3/12/21 2:07 PM, Daniel Henrique Barboza wrote:
> This new event informs QAPI listeners that a previously issued
> 'device_del' command failed to delete the device from the machine.
> 
> Note that no assertion can be made about the failure reason. The goal of
> this event is to inform management that QEMU is not able to assess
> whether the hotunplug is taking too long to complete or failed in the
> guest and, as result, the device is not removed from QOM. When receiving
> this event, users/management must check inside the guest to verify the
> result of the hotunplug operation.
> 
> This scenario happens with architectures where the guest does not have
> an official way to report the hotunplug error back to the hypervisor,
> such as PowerPC and the pseries machine type.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  qapi/qdev.json | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index b83178220b..df9a1b9e67 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -124,3 +124,31 @@
>  ##
>  { 'event': 'DEVICE_DELETED',
>    'data': { '*device': 'str', 'path': 'str' } }
> +
> +##
> +# @DEVICE_NOT_DELETED:
> +#
> +# Emitted whenever the device removal process expired and the device
> +# still exists in QOM. This indicates that the guest took too long
> +# to acknowlege the device removal, and we can not be sure of whether

acknowledge

> +# the process will be completed in the guest later on or a guest
> +# side error occurred.
> +#
> +# It is not safe to reuse the specified device ID.
> +#
> +# @device: device name
> +#
> +# @path: device path
> +#
> +# Since: 6.0

This is a new event, and we've missed feature freeze; is this fixing a
bug that was not present in 5.2 (in which case it is fine for -rc1), or
is this a long-standing problem where one more release without the
mechanism won't make life any worse?

> +#
> +# Example:
> +#
> +# <- { "event": "DEVICE_NOT_DELETED",
> +#      "data": { "device": "core1",
> +#                "path": "/machine/peripheral/core1" },
> +#      "timestamp": { "seconds": 1615570254, "microseconds": 573986 } }
> +#
> +##
> +{ 'event': 'DEVICE_NOT_DELETED',
> +  'data': { '*device': 'str', 'path': 'str' } }
> \ No newline at end of file

You'll want to fix that.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event
  2021-03-23 18:00   ` Eric Blake
@ 2021-03-23 18:12     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-23 18:12 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david, qemu-ppc, armbru, groug



On 3/23/21 3:00 PM, Eric Blake wrote:
> On 3/12/21 2:07 PM, Daniel Henrique Barboza wrote:
>> This new event informs QAPI listeners that a previously issued
>> 'device_del' command failed to delete the device from the machine.
>>
>> Note that no assertion can be made about the failure reason. The goal of
>> this event is to inform management that QEMU is not able to assess
>> whether the hotunplug is taking too long to complete or failed in the
>> guest and, as result, the device is not removed from QOM. When receiving
>> this event, users/management must check inside the guest to verify the
>> result of the hotunplug operation.
>>
>> This scenario happens with architectures where the guest does not have
>> an official way to report the hotunplug error back to the hypervisor,
>> such as PowerPC and the pseries machine type.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   qapi/qdev.json | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>> index b83178220b..df9a1b9e67 100644
>> --- a/qapi/qdev.json
>> +++ b/qapi/qdev.json
>> @@ -124,3 +124,31 @@
>>   ##
>>   { 'event': 'DEVICE_DELETED',
>>     'data': { '*device': 'str', 'path': 'str' } }
>> +
>> +##
>> +# @DEVICE_NOT_DELETED:
>> +#
>> +# Emitted whenever the device removal process expired and the device
>> +# still exists in QOM. This indicates that the guest took too long
>> +# to acknowlege the device removal, and we can not be sure of whether
> 
> acknowledge
> 
>> +# the process will be completed in the guest later on or a guest
>> +# side error occurred.
>> +#
>> +# It is not safe to reuse the specified device ID.
>> +#
>> +# @device: device name
>> +#
>> +# @path: device path
>> +#
>> +# Since: 6.0
> 
> This is a new event, and we've missed feature freeze; is this fixing a
> bug that was not present in 5.2 (in which case it is fine for -rc1), or
> is this a long-standing problem where one more release without the
> mechanism won't make life any worse?

I believe these events I'm trying to add can be postponed to the next release.



> 
>> +#
>> +# Example:
>> +#
>> +# <- { "event": "DEVICE_NOT_DELETED",
>> +#      "data": { "device": "core1",
>> +#                "path": "/machine/peripheral/core1" },
>> +#      "timestamp": { "seconds": 1615570254, "microseconds": 573986 } }
>> +#
>> +##
>> +{ 'event': 'DEVICE_NOT_DELETED',
>> +  'data': { '*device': 'str', 'path': 'str' } }
>> \ No newline at end of file
> 
> You'll want to fix that.


Yep, in both patches (1 and 3).


Thanks,


DHB


> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-23 17:10   ` Daniel Henrique Barboza
@ 2021-03-24  1:40     ` David Gibson
  2021-03-24 19:09       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2021-03-24  1:40 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: armbru, qemu-ppc, qemu-devel, groug


[-- Attachment #1: Type: text/plain, Size: 2629 bytes --]

On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/22/21 10:12 PM, David Gibson wrote:
> > On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
> > > Hi,
> > > 
> > > This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > > DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > > 
> > > Patches 1 and 3 are independent of the ppc patches and can be applied
> > > separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > > are dependent on the QAPI patches.
> > 
> > Implementation looks fine, but I think there's a bit more to discuss
> > before we can apply.
> > 
> > I think it would make sense to re-order this and put UNPLUG_ERROR
> > first.  Its semantics are clearer, and I think there's a stronger case
> > for it.
> 
> Alright
> 
> > 
> > I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > it really tell the user/management anything useful beyond what
> > receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
> 
> 
> It informs that the hotunplug operation exceed the timeout that QEMU
> internals considers adequate, but QEMU can't assert that it was caused
> by an error or an unexpected delay. The end result is that the device
> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.

Is it, though?  I mean, it is with this implementation for papr:
because we clear the unplug_requested flag, even if the guest later
tries to complete the unplug, it will fail.

But if I understand what Markus was saying correctly, that might not
be possible for all hotplug systems.  I believe Markus was suggesting
that DEVICE_NOT_DELETED could just mean that we haven't deleted the
device yet, but it could still happen later.

And in that case, I'm not yet sold on the value of a message that
essentially just means "Ayup, still dunno what's happening, sorry".

> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> event.

Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
be "guest rejected hotplug", or something more specific, in the rare
case that the guest has a way of signalling something more specific,
or "timeout" - but the later *only* to be sent in cases where on the
timeout we're able to block any later completion of the unplug (as we
can on papr).

Thoughs, Markus?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-24  1:40     ` David Gibson
@ 2021-03-24 19:09       ` Daniel Henrique Barboza
  2021-03-25  1:32         ` David Gibson
  2021-03-29 23:28         ` Igor Mammedov
  0 siblings, 2 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-24 19:09 UTC (permalink / raw)
  To: David Gibson; +Cc: armbru, qemu-ppc, qemu-devel, groug



On 3/23/21 10:40 PM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/22/21 10:12 PM, David Gibson wrote:
>>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
>>>> Hi,
>>>>
>>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
>>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
>>>>
>>>> Patches 1 and 3 are independent of the ppc patches and can be applied
>>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
>>>> are dependent on the QAPI patches.
>>>
>>> Implementation looks fine, but I think there's a bit more to discuss
>>> before we can apply.
>>>
>>> I think it would make sense to re-order this and put UNPLUG_ERROR
>>> first.  Its semantics are clearer, and I think there's a stronger case
>>> for it.
>>
>> Alright
>>
>>>
>>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
>>> it really tell the user/management anything useful beyond what
>>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
>>
>>
>> It informs that the hotunplug operation exceed the timeout that QEMU
>> internals considers adequate, but QEMU can't assert that it was caused
>> by an error or an unexpected delay. The end result is that the device
>> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.
> 
> Is it, though?  I mean, it is with this implementation for papr:
> because we clear the unplug_requested flag, even if the guest later
> tries to complete the unplug, it will fail.
> 
> But if I understand what Markus was saying correctly, that might not
> be possible for all hotplug systems.  I believe Markus was suggesting
> that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> device yet, but it could still happen later.
> 
> And in that case, I'm not yet sold on the value of a message that
> essentially just means "Ayup, still dunno what's happening, sorry".
> 
>> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
>> event.
> 
> Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> be "guest rejected hotplug", or something more specific, in the rare
> case that the guest has a way of signalling something more specific,
> or "timeout" - but the later *only* to be sent in cases where on the
> timeout we're able to block any later completion of the unplug (as we
> can on papr).

I believe that's already covered by the existing API:


+# @DEVICE_UNPLUG_ERROR:
+#
+# Emitted when a device hot unplug error occurs.
+#
+# @device: device name
+#
+# @msg: Informative message

The 'informative message' would be the reason the event occurred. In patch
4/4, for the memory hotunplug refused by the guest, it is being set as:

      qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
                                   "for device %s", dev->id);
      qapi_event_send_device_unplug_error(dev->id, qapi_error);



We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
case (currently on patch 2/4) by just changing 'msg', e.g.:


      qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
      qapi_event_send_device_unplug_error(dev->id, qapi_error);


Thanks,

DHB


> 
> Thoughs, Markus?
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-24 19:09       ` Daniel Henrique Barboza
@ 2021-03-25  1:32         ` David Gibson
  2021-03-29 23:28         ` Igor Mammedov
  1 sibling, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-03-25  1:32 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: armbru, qemu-ppc, qemu-devel, groug


[-- Attachment #1: Type: text/plain, Size: 3994 bytes --]

On Wed, Mar 24, 2021 at 04:09:59PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/23/21 10:40 PM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > 
> > > On 3/22/21 10:12 PM, David Gibson wrote:
> > > > On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
> > > > > Hi,
> > > > > 
> > > > > This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > > > > DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > > > > 
> > > > > Patches 1 and 3 are independent of the ppc patches and can be applied
> > > > > separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > > > > are dependent on the QAPI patches.
> > > > 
> > > > Implementation looks fine, but I think there's a bit more to discuss
> > > > before we can apply.
> > > > 
> > > > I think it would make sense to re-order this and put UNPLUG_ERROR
> > > > first.  Its semantics are clearer, and I think there's a stronger case
> > > > for it.
> > > 
> > > Alright
> > > 
> > > > 
> > > > I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > > > it really tell the user/management anything useful beyond what
> > > > receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
> > > 
> > > 
> > > It informs that the hotunplug operation exceed the timeout that QEMU
> > > internals considers adequate, but QEMU can't assert that it was caused
> > > by an error or an unexpected delay. The end result is that the device
> > > is not going to be deleted from QMP, so DEVICE_NOT_DELETED.
> > 
> > Is it, though?  I mean, it is with this implementation for papr:
> > because we clear the unplug_requested flag, even if the guest later
> > tries to complete the unplug, it will fail.
> > 
> > But if I understand what Markus was saying correctly, that might not
> > be possible for all hotplug systems.  I believe Markus was suggesting
> > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > device yet, but it could still happen later.
> > 
> > And in that case, I'm not yet sold on the value of a message that
> > essentially just means "Ayup, still dunno what's happening, sorry".
> > 
> > > Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> > > event.
> > 
> > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > be "guest rejected hotplug", or something more specific, in the rare
> > case that the guest has a way of signalling something more specific,
> > or "timeout" - but the later *only* to be sent in cases where on the
> > timeout we're able to block any later completion of the unplug (as we
> > can on papr).
> 
> I believe that's already covered by the existing API:
> 
> 
> +# @DEVICE_UNPLUG_ERROR:
> +#
> +# Emitted when a device hot unplug error occurs.
> +#
> +# @device: device name
> +#
> +# @msg: Informative message

Oh, sorry, I missed that

> The 'informative message' would be the reason the event occurred. In patch
> 4/4, for the memory hotunplug refused by the guest, it is being set as:
> 
>      qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
>                                   "for device %s", dev->id);
>      qapi_event_send_device_unplug_error(dev->id, qapi_error);
> 
> 
> 
> We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
> case (currently on patch 2/4) by just changing 'msg', e.g.:
> 
> 
>      qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
>      qapi_event_send_device_unplug_error(dev->id, qapi_error);

I think that makes sense for the cases on papr.  Less sure about the
cases Markus has mentioned.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-24 19:09       ` Daniel Henrique Barboza
  2021-03-25  1:32         ` David Gibson
@ 2021-03-29 23:28         ` Igor Mammedov
  2021-03-30 23:46           ` David Gibson
  2021-03-31 19:40           ` Daniel Henrique Barboza
  1 sibling, 2 replies; 20+ messages in thread
From: Igor Mammedov @ 2021-03-29 23:28 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-devel, groug, qemu-ppc, armbru, David Gibson

On Wed, 24 Mar 2021 16:09:59 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> On 3/23/21 10:40 PM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:  
> >>
> >>
> >> On 3/22/21 10:12 PM, David Gibson wrote:  
> >>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:  
> >>>> Hi,
> >>>>
> >>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> >>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> >>>>
> >>>> Patches 1 and 3 are independent of the ppc patches and can be applied
> >>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> >>>> are dependent on the QAPI patches.  
> >>>
> >>> Implementation looks fine, but I think there's a bit more to discuss
> >>> before we can apply.
> >>>
> >>> I think it would make sense to re-order this and put UNPLUG_ERROR
> >>> first.  Its semantics are clearer, and I think there's a stronger case
> >>> for it.  
> >>
> >> Alright
> >>  
> >>>
> >>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> >>> it really tell the user/management anything useful beyond what
> >>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?  
> >>
> >>
> >> It informs that the hotunplug operation exceed the timeout that QEMU
> >> internals considers adequate, but QEMU can't assert that it was caused
> >> by an error or an unexpected delay. The end result is that the device
> >> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.  
> > 
> > Is it, though?  I mean, it is with this implementation for papr:
> > because we clear the unplug_requested flag, even if the guest later
> > tries to complete the unplug, it will fail.
> > 
> > But if I understand what Markus was saying correctly, that might not
> > be possible for all hotplug systems.  I believe Markus was suggesting
> > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > device yet, but it could still happen later.
> > 
> > And in that case, I'm not yet sold on the value of a message that
> > essentially just means "Ayup, still dunno what's happening, sorry".
> >   
> >> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> >> event.  
> > 
> > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > be "guest rejected hotplug", or something more specific, in the rare
> > case that the guest has a way of signalling something more specific,
> > or "timeout" - but the later *only* to be sent in cases where on the
> > timeout we're able to block any later completion of the unplug (as we
> > can on papr).

Is canceling unplug on timeout documented somewhere (like some spec)?

If not it might (theoretically) confuse guest when it tries to unplug
after timeout and leave guest in some unexpected state. 

> 
> I believe that's already covered by the existing API:
> 
> 
> +# @DEVICE_UNPLUG_ERROR:
> +#
> +# Emitted when a device hot unplug error occurs.
> +#
> +# @device: device name
> +#
> +# @msg: Informative message
> 
> The 'informative message' would be the reason the event occurred. In patch
> 4/4, for the memory hotunplug refused by the guest, it is being set as:
> 
>       qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
>                                    "for device %s", dev->id);
>       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> 
> 
> 
> We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
> case (currently on patch 2/4) by just changing 'msg', e.g.:
> 
> 
>       qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
>       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> 

lets make everything support ACPI (just kidding).

maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
which sort of does the same thing (and more) but instead of strings uses status codes
defined by spec.

Idea similar to DEVICE_UNPLUG_ERROR was considered back then, but instead of QEMU being
a poor translator of status codes to non machine-readable strings we went with
exposing well documented status codes to user. This way user can implement
specific reactions to particular errors just looking at JSON + spec.

> Thanks,
> 
> DHB
> 
> 
> > 
> > Thoughs, Markus?
> >   
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-29 23:28         ` Igor Mammedov
@ 2021-03-30 23:46           ` David Gibson
  2021-03-31  9:49             ` Igor Mammedov
  2021-03-31 19:47             ` Daniel Henrique Barboza
  2021-03-31 19:40           ` Daniel Henrique Barboza
  1 sibling, 2 replies; 20+ messages in thread
From: David Gibson @ 2021-03-30 23:46 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: groug, Daniel Henrique Barboza, qemu-ppc, armbru, qemu-devel


[-- Attachment #1: Type: text/plain, Size: 5922 bytes --]

On Tue, Mar 30, 2021 at 01:28:31AM +0200, Igor Mammedov wrote:
> On Wed, 24 Mar 2021 16:09:59 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
> > On 3/23/21 10:40 PM, David Gibson wrote:
> > > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:  
> > >>
> > >>
> > >> On 3/22/21 10:12 PM, David Gibson wrote:  
> > >>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:  
> > >>>> Hi,
> > >>>>
> > >>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > >>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > >>>>
> > >>>> Patches 1 and 3 are independent of the ppc patches and can be applied
> > >>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > >>>> are dependent on the QAPI patches.  
> > >>>
> > >>> Implementation looks fine, but I think there's a bit more to discuss
> > >>> before we can apply.
> > >>>
> > >>> I think it would make sense to re-order this and put UNPLUG_ERROR
> > >>> first.  Its semantics are clearer, and I think there's a stronger case
> > >>> for it.  
> > >>
> > >> Alright
> > >>  
> > >>>
> > >>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > >>> it really tell the user/management anything useful beyond what
> > >>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?  
> > >>
> > >>
> > >> It informs that the hotunplug operation exceed the timeout that QEMU
> > >> internals considers adequate, but QEMU can't assert that it was caused
> > >> by an error or an unexpected delay. The end result is that the device
> > >> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.  
> > > 
> > > Is it, though?  I mean, it is with this implementation for papr:
> > > because we clear the unplug_requested flag, even if the guest later
> > > tries to complete the unplug, it will fail.
> > > 
> > > But if I understand what Markus was saying correctly, that might not
> > > be possible for all hotplug systems.  I believe Markus was suggesting
> > > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > > device yet, but it could still happen later.
> > > 
> > > And in that case, I'm not yet sold on the value of a message that
> > > essentially just means "Ayup, still dunno what's happening, sorry".
> > >   
> > >> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> > >> event.  
> > > 
> > > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > > be "guest rejected hotplug", or something more specific, in the rare
> > > case that the guest has a way of signalling something more specific,
> > > or "timeout" - but the later *only* to be sent in cases where on the
> > > timeout we're able to block any later completion of the unplug (as we
> > > can on papr).
> 
> Is canceling unplug on timeout documented somewhere (like some spec)?

Uh.. not as such.  In the PAPR model, hotplugs and unplugs are mostly
guest directed, so the question doesn't really arise.

> If not it might (theoretically) confuse guest when it tries to unplug
> after timeout and leave guest in some unexpected state.

Possible, but probably not that likely.  The mechanism we use to
"cancel" the hotplugs is that we just fail the hypercalls that the
guest will need to call to actually complete the hotplug.  We also
fail those in some other situations, and that seems to work.

That said, I no longer think this cancelling on timeout is a good
idea, since it mismatches what happens on other platforms more than I
think we need to.

My now preferred approach is to revert the timeout changes, but
instead allow retries of unplugs to be issued.  I think that's just a
matter of resending the unplug message to the guest, while making it
otherwise a no-op on the qemu side.

> > I believe that's already covered by the existing API:
> > 
> > 
> > +# @DEVICE_UNPLUG_ERROR:
> > +#
> > +# Emitted when a device hot unplug error occurs.
> > +#
> > +# @device: device name
> > +#
> > +# @msg: Informative message
> > 
> > The 'informative message' would be the reason the event occurred. In patch
> > 4/4, for the memory hotunplug refused by the guest, it is being set as:
> > 
> >       qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
> >                                    "for device %s", dev->id);
> >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > 
> > 
> > 
> > We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
> > case (currently on patch 2/4) by just changing 'msg', e.g.:
> > 
> > 
> >       qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
> >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > 
> 
> lets make everything support ACPI (just kidding).

Heh.  If nothing else, doesn't help us with existing guests.

> maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
> which sort of does the same thing (and more) but instead of strings uses status codes
> defined by spec.

Hmm.  I'm a bit dubious about issuing ACPI messages for a non ACPI
guest, but maybe that could work.

> Idea similar to DEVICE_UNPLUG_ERROR was considered back then, but instead of QEMU being
> a poor translator of status codes to non machine-readable strings we went with
> exposing well documented status codes to user. This way user can implement
> specific reactions to particular errors just looking at JSON + spec.
> 
> > Thanks,
> > 
> > DHB
> > 
> > 
> > > 
> > > Thoughs, Markus?
> > >   
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-30 23:46           ` David Gibson
@ 2021-03-31  9:49             ` Igor Mammedov
  2021-04-01  1:31               ` David Gibson
  2021-03-31 19:47             ` Daniel Henrique Barboza
  1 sibling, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2021-03-31  9:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Daniel P. Berrangé,
	qemu-devel, Daniel Henrique Barboza, groug, armbru, qemu-ppc

On Wed, 31 Mar 2021 10:46:49 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Mar 30, 2021 at 01:28:31AM +0200, Igor Mammedov wrote:
> > On Wed, 24 Mar 2021 16:09:59 -0300
> > Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> >   
> > > On 3/23/21 10:40 PM, David Gibson wrote:  
> > > > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:    
> > > >>
> > > >>
> > > >> On 3/22/21 10:12 PM, David Gibson wrote:    
> > > >>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:    
> > > >>>> Hi,
> > > >>>>
> > > >>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > > >>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > > >>>>
> > > >>>> Patches 1 and 3 are independent of the ppc patches and can be applied
> > > >>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > > >>>> are dependent on the QAPI patches.    
> > > >>>
> > > >>> Implementation looks fine, but I think there's a bit more to discuss
> > > >>> before we can apply.
> > > >>>
> > > >>> I think it would make sense to re-order this and put UNPLUG_ERROR
> > > >>> first.  Its semantics are clearer, and I think there's a stronger case
> > > >>> for it.    
> > > >>
> > > >> Alright
> > > >>    
> > > >>>
> > > >>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > > >>> it really tell the user/management anything useful beyond what
> > > >>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?    
> > > >>
> > > >>
> > > >> It informs that the hotunplug operation exceed the timeout that QEMU
> > > >> internals considers adequate, but QEMU can't assert that it was caused
> > > >> by an error or an unexpected delay. The end result is that the device
> > > >> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.    
> > > > 
> > > > Is it, though?  I mean, it is with this implementation for papr:
> > > > because we clear the unplug_requested flag, even if the guest later
> > > > tries to complete the unplug, it will fail.
> > > > 
> > > > But if I understand what Markus was saying correctly, that might not
> > > > be possible for all hotplug systems.  I believe Markus was suggesting
> > > > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > > > device yet, but it could still happen later.
> > > > 
> > > > And in that case, I'm not yet sold on the value of a message that
> > > > essentially just means "Ayup, still dunno what's happening, sorry".
> > > >     
> > > >> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> > > >> event.    
> > > > 
> > > > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > > > be "guest rejected hotplug", or something more specific, in the rare
> > > > case that the guest has a way of signalling something more specific,
> > > > or "timeout" - but the later *only* to be sent in cases where on the
> > > > timeout we're able to block any later completion of the unplug (as we
> > > > can on papr).  
> > 
> > Is canceling unplug on timeout documented somewhere (like some spec)?  
> 
> Uh.. not as such.  In the PAPR model, hotplugs and unplugs are mostly
> guest directed, so the question doesn't really arise.
> 
> > If not it might (theoretically) confuse guest when it tries to unplug
> > after timeout and leave guest in some unexpected state.  
> 
> Possible, but probably not that likely.  The mechanism we use to
> "cancel" the hotplugs is that we just fail the hypercalls that the
> guest will need to call to actually complete the hotplug.  We also
> fail those in some other situations, and that seems to work.
> 
> That said, I no longer think this cancelling on timeout is a good
> idea, since it mismatches what happens on other platforms more than I
> think we need to.
> 
> My now preferred approach is to revert the timeout changes, but
> instead allow retries of unplugs to be issued.  I think that's just a
> matter of resending the unplug message to the guest, while making it
> otherwise a no-op on the qemu side.

Yep, all we need to do is notify QEMU user, so it knows that unplug
has failed. Then It can decide on it's own what to do with it and also when.

> > > I believe that's already covered by the existing API:
> > > 
> > > 
> > > +# @DEVICE_UNPLUG_ERROR:
> > > +#
> > > +# Emitted when a device hot unplug error occurs.
> > > +#
> > > +# @device: device name
> > > +#
> > > +# @msg: Informative message
> > > 
> > > The 'informative message' would be the reason the event occurred. In patch
> > > 4/4, for the memory hotunplug refused by the guest, it is being set as:
> > > 
> > >       qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
> > >                                    "for device %s", dev->id);
> > >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > > 
> > > 
> > > 
> > > We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
> > > case (currently on patch 2/4) by just changing 'msg', e.g.:
> > > 
> > > 
> > >       qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
> > >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > >   
> > 
> > lets make everything support ACPI (just kidding).  
> 
> Heh.  If nothing else, doesn't help us with existing guests.
> 
> > maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
> > which sort of does the same thing (and more) but instead of strings uses status codes
> > defined by spec.  
> 
> Hmm.  I'm a bit dubious about issuing ACPI messages for a non ACPI
> guest, but maybe that could work.

May be we can rename it to be ACPI agnostic (though I'm not sure how renaming
QAPI interfaces should be done (it might upset libvirt for example)).

(My point was that ACPI spec had already gone through all the trouble defining
status of completion and documenting it. Also libvirt supports this notification.
It looks like worthwhile thing to consider if can somehow reuse it outside of
x86 world)


> > Idea similar to DEVICE_UNPLUG_ERROR was considered back then, but instead of QEMU being
> > a poor translator of status codes to non machine-readable strings we went with
> > exposing well documented status codes to user. This way user can implement
> > specific reactions to particular errors just looking at JSON + spec.
> >   
> > > Thanks,
> > > 
> > > DHB
> > > 
> > >   
> > > > 
> > > > Thoughs, Markus?
> > > >     
> > >   
> >   
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-29 23:28         ` Igor Mammedov
  2021-03-30 23:46           ` David Gibson
@ 2021-03-31 19:40           ` Daniel Henrique Barboza
  1 sibling, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-31 19:40 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, groug, qemu-ppc, armbru, David Gibson



On 3/29/21 8:28 PM, Igor Mammedov wrote:
> On Wed, 24 Mar 2021 16:09:59 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> On 3/23/21 10:40 PM, David Gibson wrote:
>>> On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/22/21 10:12 PM, David Gibson wrote:
>>>>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
>>>>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
>>>>>>
>>>>>> Patches 1 and 3 are independent of the ppc patches and can be applied
>>>>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
>>>>>> are dependent on the QAPI patches.
>>>>>
>>>>> Implementation looks fine, but I think there's a bit more to discuss
>>>>> before we can apply.
>>>>>
>>>>> I think it would make sense to re-order this and put UNPLUG_ERROR
>>>>> first.  Its semantics are clearer, and I think there's a stronger case
>>>>> for it.
>>>>
>>>> Alright
>>>>   
>>>>>
>>>>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
>>>>> it really tell the user/management anything useful beyond what
>>>>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
>>>>
>>>>
>>>> It informs that the hotunplug operation exceed the timeout that QEMU
>>>> internals considers adequate, but QEMU can't assert that it was caused
>>>> by an error or an unexpected delay. The end result is that the device
>>>> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.
>>>
>>> Is it, though?  I mean, it is with this implementation for papr:
>>> because we clear the unplug_requested flag, even if the guest later
>>> tries to complete the unplug, it will fail.
>>>
>>> But if I understand what Markus was saying correctly, that might not
>>> be possible for all hotplug systems.  I believe Markus was suggesting
>>> that DEVICE_NOT_DELETED could just mean that we haven't deleted the
>>> device yet, but it could still happen later.
>>>
>>> And in that case, I'm not yet sold on the value of a message that
>>> essentially just means "Ayup, still dunno what's happening, sorry".
>>>    
>>>> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
>>>> event.
>>>
>>> Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
>>> be "guest rejected hotplug", or something more specific, in the rare
>>> case that the guest has a way of signalling something more specific,
>>> or "timeout" - but the later *only* to be sent in cases where on the
>>> timeout we're able to block any later completion of the unplug (as we
>>> can on papr).
> 
> Is canceling unplug on timeout documented somewhere (like some spec)?

Nope.

> 
> If not it might (theoretically) confuse guest when it tries to unplug
> after timeout and leave guest in some unexpected state.

I consider this a remote possibility due to the generous timeout we're
using in the machine, but it is a possibility nevertheless.

> 
>>
>> I believe that's already covered by the existing API:
>>
>>
>> +# @DEVICE_UNPLUG_ERROR:
>> +#
>> +# Emitted when a device hot unplug error occurs.
>> +#
>> +# @device: device name
>> +#
>> +# @msg: Informative message
>>
>> The 'informative message' would be the reason the event occurred. In patch
>> 4/4, for the memory hotunplug refused by the guest, it is being set as:
>>
>>        qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
>>                                     "for device %s", dev->id);
>>        qapi_event_send_device_unplug_error(dev->id, qapi_error);
>>
>>
>>
>> We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
>> case (currently on patch 2/4) by just changing 'msg', e.g.:
>>
>>
>>        qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
>>        qapi_event_send_device_unplug_error(dev->id, qapi_error);
>>
> 
> lets make everything support ACPI (just kidding).

I would love to make PAPR more ACPI and less .... PAPR.

> 
> maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
> which sort of does the same thing (and more) but instead of strings uses status codes
> defined by spec.
> 
> Idea similar to DEVICE_UNPLUG_ERROR was considered back then, but instead of QEMU being
> a poor translator of status codes to non machine-readable strings we went with
> exposing well documented status codes to user. This way user can implement
> specific reactions to particular errors just looking at JSON + spec.


This is not a bad idea. Problem is that we don't have all ACPI_DEVICE_OST
fields to fill in because, well, both the timeout and the cancell mechanism
aren't specified there.


Thanks,


DHB

> 
>> Thanks,
>>
>> DHB
>>
>>
>>>
>>> Thoughs, Markus?
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-30 23:46           ` David Gibson
  2021-03-31  9:49             ` Igor Mammedov
@ 2021-03-31 19:47             ` Daniel Henrique Barboza
  2021-04-01  1:36               ` David Gibson
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-31 19:47 UTC (permalink / raw)
  To: David Gibson, Igor Mammedov; +Cc: groug, qemu-ppc, armbru, qemu-devel



On 3/30/21 8:46 PM, David Gibson wrote:
> On Tue, Mar 30, 2021 at 01:28:31AM +0200, Igor Mammedov wrote:
>> On Wed, 24 Mar 2021 16:09:59 -0300
>> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>>
>>> On 3/23/21 10:40 PM, David Gibson wrote:
>>>> On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/22/21 10:12 PM, David Gibson wrote:
>>>>>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
>>>>>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
>>>>>>>
>>>>>>> Patches 1 and 3 are independent of the ppc patches and can be applied
>>>>>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
>>>>>>> are dependent on the QAPI patches.
>>>>>>
>>>>>> Implementation looks fine, but I think there's a bit more to discuss
>>>>>> before we can apply.
>>>>>>
>>>>>> I think it would make sense to re-order this and put UNPLUG_ERROR
>>>>>> first.  Its semantics are clearer, and I think there's a stronger case
>>>>>> for it.
>>>>>
>>>>> Alright
>>>>>   
>>>>>>
>>>>>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
>>>>>> it really tell the user/management anything useful beyond what
>>>>>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
>>>>>
>>>>>
>>>>> It informs that the hotunplug operation exceed the timeout that QEMU
>>>>> internals considers adequate, but QEMU can't assert that it was caused
>>>>> by an error or an unexpected delay. The end result is that the device
>>>>> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.
>>>>
>>>> Is it, though?  I mean, it is with this implementation for papr:
>>>> because we clear the unplug_requested flag, even if the guest later
>>>> tries to complete the unplug, it will fail.
>>>>
>>>> But if I understand what Markus was saying correctly, that might not
>>>> be possible for all hotplug systems.  I believe Markus was suggesting
>>>> that DEVICE_NOT_DELETED could just mean that we haven't deleted the
>>>> device yet, but it could still happen later.
>>>>
>>>> And in that case, I'm not yet sold on the value of a message that
>>>> essentially just means "Ayup, still dunno what's happening, sorry".
>>>>    
>>>>> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
>>>>> event.
>>>>
>>>> Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
>>>> be "guest rejected hotplug", or something more specific, in the rare
>>>> case that the guest has a way of signalling something more specific,
>>>> or "timeout" - but the later *only* to be sent in cases where on the
>>>> timeout we're able to block any later completion of the unplug (as we
>>>> can on papr).
>>
>> Is canceling unplug on timeout documented somewhere (like some spec)?
> 
> Uh.. not as such.  In the PAPR model, hotplugs and unplugs are mostly
> guest directed, so the question doesn't really arise.
> 
>> If not it might (theoretically) confuse guest when it tries to unplug
>> after timeout and leave guest in some unexpected state.
> 
> Possible, but probably not that likely.  The mechanism we use to
> "cancel" the hotplugs is that we just fail the hypercalls that the
> guest will need to call to actually complete the hotplug.  We also
> fail those in some other situations, and that seems to work.
> 
> That said, I no longer think this cancelling on timeout is a good
> idea, since it mismatches what happens on other platforms more than I
> think we need to.
> 
> My now preferred approach is to revert the timeout changes, but
> instead allow retries of unplugs to be issued.  I think that's just a
> matter of resending the unplug message to the guest, while making it
> otherwise a no-op on the qemu side.

I used this approach in a patch I sent back in January:

"[PATCH v2 1/1] spapr.c: always pulse guest IRQ in spapr_core_unplug_request()"

https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg04399.html


Let me know and I'll revert the timeout mechanism and re-post this one.
I guess there's still time to make this change in the 6.0.0 window, avoiding
releasing a mechanism we're not happy with.



Thanks,


DHB

> 
>>> I believe that's already covered by the existing API:
>>>
>>>
>>> +# @DEVICE_UNPLUG_ERROR:
>>> +#
>>> +# Emitted when a device hot unplug error occurs.
>>> +#
>>> +# @device: device name
>>> +#
>>> +# @msg: Informative message
>>>
>>> The 'informative message' would be the reason the event occurred. In patch
>>> 4/4, for the memory hotunplug refused by the guest, it is being set as:
>>>
>>>        qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
>>>                                     "for device %s", dev->id);
>>>        qapi_event_send_device_unplug_error(dev->id, qapi_error);
>>>
>>>
>>>
>>> We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
>>> case (currently on patch 2/4) by just changing 'msg', e.g.:
>>>
>>>
>>>        qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
>>>        qapi_event_send_device_unplug_error(dev->id, qapi_error);
>>>
>>
>> lets make everything support ACPI (just kidding).
> 
> Heh.  If nothing else, doesn't help us with existing guests.
> 
>> maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
>> which sort of does the same thing (and more) but instead of strings uses status codes
>> defined by spec.
> 
> Hmm.  I'm a bit dubious about issuing ACPI messages for a non ACPI
> guest, but maybe that could work.
> 
>> Idea similar to DEVICE_UNPLUG_ERROR was considered back then, but instead of QEMU being
>> a poor translator of status codes to non machine-readable strings we went with
>> exposing well documented status codes to user. This way user can implement
>> specific reactions to particular errors just looking at JSON + spec.
>>
>>> Thanks,
>>>
>>> DHB
>>>
>>>
>>>>
>>>> Thoughs, Markus?
>>>>    
>>>
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-31  9:49             ` Igor Mammedov
@ 2021-04-01  1:31               ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-04-01  1:31 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P. Berrangé,
	qemu-devel, Daniel Henrique Barboza, groug, armbru, qemu-ppc


[-- Attachment #1: Type: text/plain, Size: 7218 bytes --]

On Wed, Mar 31, 2021 at 11:49:14AM +0200, Igor Mammedov wrote:
> On Wed, 31 Mar 2021 10:46:49 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Mar 30, 2021 at 01:28:31AM +0200, Igor Mammedov wrote:
> > > On Wed, 24 Mar 2021 16:09:59 -0300
> > > Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> > >   
> > > > On 3/23/21 10:40 PM, David Gibson wrote:  
> > > > > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:    
> > > > >>
> > > > >>
> > > > >> On 3/22/21 10:12 PM, David Gibson wrote:    
> > > > >>> On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:    
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > > > >>>> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > > > >>>>
> > > > >>>> Patches 1 and 3 are independent of the ppc patches and can be applied
> > > > >>>> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > > > >>>> are dependent on the QAPI patches.    
> > > > >>>
> > > > >>> Implementation looks fine, but I think there's a bit more to discuss
> > > > >>> before we can apply.
> > > > >>>
> > > > >>> I think it would make sense to re-order this and put UNPLUG_ERROR
> > > > >>> first.  Its semantics are clearer, and I think there's a stronger case
> > > > >>> for it.    
> > > > >>
> > > > >> Alright
> > > > >>    
> > > > >>>
> > > > >>> I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > > > >>> it really tell the user/management anything useful beyond what
> > > > >>> receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?    
> > > > >>
> > > > >>
> > > > >> It informs that the hotunplug operation exceed the timeout that QEMU
> > > > >> internals considers adequate, but QEMU can't assert that it was caused
> > > > >> by an error or an unexpected delay. The end result is that the device
> > > > >> is not going to be deleted from QMP, so DEVICE_NOT_DELETED.    
> > > > > 
> > > > > Is it, though?  I mean, it is with this implementation for papr:
> > > > > because we clear the unplug_requested flag, even if the guest later
> > > > > tries to complete the unplug, it will fail.
> > > > > 
> > > > > But if I understand what Markus was saying correctly, that might not
> > > > > be possible for all hotplug systems.  I believe Markus was suggesting
> > > > > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > > > > device yet, but it could still happen later.
> > > > > 
> > > > > And in that case, I'm not yet sold on the value of a message that
> > > > > essentially just means "Ayup, still dunno what's happening, sorry".
> > > > >     
> > > > >> Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> > > > >> event.    
> > > > > 
> > > > > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > > > > be "guest rejected hotplug", or something more specific, in the rare
> > > > > case that the guest has a way of signalling something more specific,
> > > > > or "timeout" - but the later *only* to be sent in cases where on the
> > > > > timeout we're able to block any later completion of the unplug (as we
> > > > > can on papr).  
> > > 
> > > Is canceling unplug on timeout documented somewhere (like some spec)?  
> > 
> > Uh.. not as such.  In the PAPR model, hotplugs and unplugs are mostly
> > guest directed, so the question doesn't really arise.
> > 
> > > If not it might (theoretically) confuse guest when it tries to unplug
> > > after timeout and leave guest in some unexpected state.  
> > 
> > Possible, but probably not that likely.  The mechanism we use to
> > "cancel" the hotplugs is that we just fail the hypercalls that the
> > guest will need to call to actually complete the hotplug.  We also
> > fail those in some other situations, and that seems to work.
> > 
> > That said, I no longer think this cancelling on timeout is a good
> > idea, since it mismatches what happens on other platforms more than I
> > think we need to.
> > 
> > My now preferred approach is to revert the timeout changes, but
> > instead allow retries of unplugs to be issued.  I think that's just a
> > matter of resending the unplug message to the guest, while making it
> > otherwise a no-op on the qemu side.
> 
> Yep, all we need to do is notify QEMU user, so it knows that unplug
> has failed. Then It can decide on it's own what to do with it and also when.

I'm not sure even that makes sense.  I mean in the cases that the
guest specifically signals failure, then yes, we should definitely
notify the user.  But for the cases the timeout was covering, I'm not
convinced a notification is useful: we *don't* know the unplug has
failed, we only suspect it, and I don't see that qemu really has any
more information than the user about what the expected time for an
unplug should be.

> > > > I believe that's already covered by the existing API:
> > > > 
> > > > 
> > > > +# @DEVICE_UNPLUG_ERROR:
> > > > +#
> > > > +# Emitted when a device hot unplug error occurs.
> > > > +#
> > > > +# @device: device name
> > > > +#
> > > > +# @msg: Informative message
> > > > 
> > > > The 'informative message' would be the reason the event occurred. In patch
> > > > 4/4, for the memory hotunplug refused by the guest, it is being set as:
> > > > 
> > > >       qapi_error = g_strdup_printf("Memory hotunplug rejected by the guest "
> > > >                                    "for device %s", dev->id);
> > > >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > > > 
> > > > 
> > > > 
> > > > We could use the same DEVICE_UNPLUG_ERROR event in the CPU hotunplug timeout
> > > > case (currently on patch 2/4) by just changing 'msg', e.g.:
> > > > 
> > > > 
> > > >       qapi_error = g_strdup_printf("CPU hotunplug timeout for device %s", dev->id);
> > > >       qapi_event_send_device_unplug_error(dev->id, qapi_error);
> > > >   
> > > 
> > > lets make everything support ACPI (just kidding).  
> > 
> > Heh.  If nothing else, doesn't help us with existing guests.
> > 
> > > maybe we can reuse already existing ACPI_DEVICE_OST instead of DEVICE_UNPLUG_ERROR
> > > which sort of does the same thing (and more) but instead of strings uses status codes
> > > defined by spec.  
> > 
> > Hmm.  I'm a bit dubious about issuing ACPI messages for a non ACPI
> > guest, but maybe that could work.
> 
> May be we can rename it to be ACPI agnostic (though I'm not sure how renaming
> QAPI interfaces should be done (it might upset libvirt for example)).
> 
> (My point was that ACPI spec had already gone through all the trouble defining
> status of completion and documenting it. Also libvirt supports this notification.
> It looks like worthwhile thing to consider if can somehow reuse it outside of
> x86 world)

Yeah, that's a fair point.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-31 19:47             ` Daniel Henrique Barboza
@ 2021-04-01  1:36               ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-04-01  1:36 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Igor Mammedov, groug, qemu-ppc, armbru, qemu-devel


[-- Attachment #1: Type: text/plain, Size: 5883 bytes --]

On Wed, Mar 31, 2021 at 04:47:14PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/30/21 8:46 PM, David Gibson wrote:
> > On Tue, Mar 30, 2021 at 01:28:31AM +0200, Igor Mammedov wrote:
> > > On Wed, 24 Mar 2021 16:09:59 -0300
> > > Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> > > 
> > > > On 3/23/21 10:40 PM, David Gibson wrote:
> > > > > On Tue, Mar 23, 2021 at 02:10:22PM -0300, Daniel Henrique Barboza wrote:
> > > > > > 
> > > > > > 
> > > > > > On 3/22/21 10:12 PM, David Gibson wrote:
> > > > > > > On Fri, Mar 12, 2021 at 05:07:36PM -0300, Daniel Henrique Barboza wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> > > > > > > > DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> > > > > > > > 
> > > > > > > > Patches 1 and 3 are independent of the ppc patches and can be applied
> > > > > > > > separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> > > > > > > > are dependent on the QAPI patches.
> > > > > > > 
> > > > > > > Implementation looks fine, but I think there's a bit more to discuss
> > > > > > > before we can apply.
> > > > > > > 
> > > > > > > I think it would make sense to re-order this and put UNPLUG_ERROR
> > > > > > > first.  Its semantics are clearer, and I think there's a stronger case
> > > > > > > for it.
> > > > > > 
> > > > > > Alright
> > > > > > > 
> > > > > > > I'm a bit less sold on DEVICE_NOT_DELETED, after consideration.  Does
> > > > > > > it really tell the user/management anything useful beyond what
> > > > > > > receiving neither a DEVICE_DELETED nor a DEVICE_UNPLUG_ERROR does?
> > > > > > 
> > > > > > 
> > > > > > It informs that the hotunplug operation exceed the timeout that QEMU
> > > > > > internals considers adequate, but QEMU can't assert that it was caused
> > > > > > by an error or an unexpected delay. The end result is that the device
> > > > > > is not going to be deleted from QMP, so DEVICE_NOT_DELETED.
> > > > > 
> > > > > Is it, though?  I mean, it is with this implementation for papr:
> > > > > because we clear the unplug_requested flag, even if the guest later
> > > > > tries to complete the unplug, it will fail.
> > > > > 
> > > > > But if I understand what Markus was saying correctly, that might not
> > > > > be possible for all hotplug systems.  I believe Markus was suggesting
> > > > > that DEVICE_NOT_DELETED could just mean that we haven't deleted the
> > > > > device yet, but it could still happen later.
> > > > > 
> > > > > And in that case, I'm not yet sold on the value of a message that
> > > > > essentially just means "Ayup, still dunno what's happening, sorry".
> > > > > > Perhaps we should just be straightforward and create a DEVICE_UNPLUG_TIMEOUT
> > > > > > event.
> > > > > 
> > > > > Hm... what if we added a "reason" field to UNPLUG_ERROR.  That could
> > > > > be "guest rejected hotplug", or something more specific, in the rare
> > > > > case that the guest has a way of signalling something more specific,
> > > > > or "timeout" - but the later *only* to be sent in cases where on the
> > > > > timeout we're able to block any later completion of the unplug (as we
> > > > > can on papr).
> > > 
> > > Is canceling unplug on timeout documented somewhere (like some spec)?
> > 
> > Uh.. not as such.  In the PAPR model, hotplugs and unplugs are mostly
> > guest directed, so the question doesn't really arise.
> > 
> > > If not it might (theoretically) confuse guest when it tries to unplug
> > > after timeout and leave guest in some unexpected state.
> > 
> > Possible, but probably not that likely.  The mechanism we use to
> > "cancel" the hotplugs is that we just fail the hypercalls that the
> > guest will need to call to actually complete the hotplug.  We also
> > fail those in some other situations, and that seems to work.
> > 
> > That said, I no longer think this cancelling on timeout is a good
> > idea, since it mismatches what happens on other platforms more than I
> > think we need to.
> > 
> > My now preferred approach is to revert the timeout changes, but
> > instead allow retries of unplugs to be issued.  I think that's just a
> > matter of resending the unplug message to the guest, while making it
> > otherwise a no-op on the qemu side.
> 
> I used this approach in a patch I sent back in January:

Yes, you did.  I rejected it at the time, but the discussion since has
convinced me that I made a mistake.  In particular, the point that a
really loaded host could pretty much arbitrarily extend the time for
the guest to process the request is convincing to me.

> "[PATCH v2 1/1] spapr.c: always pulse guest IRQ in spapr_core_unplug_request()"

Although.. I think we should actually do a full resend of the unplug
request message to the queue, not just pulse the irq.  AFAICT an
unplug request on a DRC that is already unplugged should be safe
(remove-LMB-by-count-only requests might have to be an exception).

> https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg04399.html
> 
> 
> Let me know and I'll revert the timeout mechanism and re-post this one.
> I guess there's still time to make this change in the 6.0.0 window, avoiding
> releasing a mechanism we're not happy with.

Yes, please, I think this is the way to go.

1) Revert timeouts (6.0)
2) Allow retries (6.0)
3) Add notifications when the guest positively signals failure (6.1)

I think this gives us the best mix of a) user experience, b) not
allowing nasty edge cases on loaded systems, c) matching x86 behaviour
where possible.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events
  2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
                   ` (4 preceding siblings ...)
  2021-03-23  1:12 ` [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events David Gibson
@ 2021-04-20 17:11 ` Daniel Henrique Barboza
  5 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-20 17:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: armbru, qemu-ppc, groug, david

Hello,

Quick update about this work:

- the hotunplug timeout was reverted in the pSeries machine for 6.0.0.
This means that we have no use for a DEVICE_NOT_DELETED event or similar.
I'll drop it for the next version of the series.

- there is a good chance that the pSeries kernel will introduce a way to
report hotunplug errors in v5.13 ([1] for more info). This would make
the DEVICE_UNPLUG_ERROR event relevant again (otherwise it would just be
a rebranding of the existing mem_unplug_error) since we'll be able to
properly report guest side unplug errors for all devices in the pSeries
machine, starting with CPUs.


Thanks,


DHB


[1] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210416210216.380291-3-danielhb413@gmail.com/




On 3/12/21 5:07 PM, Daniel Henrique Barboza wrote:
> Hi,
> 
> This series adds 2 new QAPI events, DEVICE_NOT_DELETED and
> DEVICE_UNPLUG_ERROR. They were (and are still being) discussed in [1].
> 
> Patches 1 and 3 are independent of the ppc patches and can be applied
> separately. Patches 2 and 4 are based on David's ppc-for-6.0 branch and
> are dependent on the QAPI patches.
> 
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg01900.html
> 
> Daniel Henrique Barboza (4):
>    qapi/qdev.json: add DEVICE_NOT_DELETED event
>    spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout
>    qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event
>    spapr.c: use DEVICE_UNPLUG_ERROR event in
>      spapr_memory_unplug_rollback()
> 
>   hw/ppc/spapr.c     |  2 +-
>   hw/ppc/spapr_drc.c |  8 ++++++++
>   qapi/machine.json  | 23 +++++++++++++++++++++++
>   qapi/qdev.json     | 28 ++++++++++++++++++++++++++++
>   4 files changed, 60 insertions(+), 1 deletion(-)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12 20:07 [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events Daniel Henrique Barboza
2021-03-12 20:07 ` [PATCH 1/4] qapi/qdev.json: add DEVICE_NOT_DELETED event Daniel Henrique Barboza
2021-03-23 18:00   ` Eric Blake
2021-03-23 18:12     ` Daniel Henrique Barboza
2021-03-12 20:07 ` [PATCH 2/4] spapr_drc.c: send DEVICE_NOT_DELETED event on unplug timeout Daniel Henrique Barboza
2021-03-12 20:07 ` [PATCH 3/4] qapi/machine.json: add DEVICE_UNPLUG_ERROR QAPI event Daniel Henrique Barboza
2021-03-12 20:07 ` [PATCH 4/4] spapr.c: use DEVICE_UNPLUG_ERROR event in spapr_memory_unplug_rollback() Daniel Henrique Barboza
2021-03-23  1:12 ` [PATCH 0/4] DEVICE_NOT_DELETED/DEVICE_UNPLUG_ERROR QAPI events David Gibson
2021-03-23 17:10   ` Daniel Henrique Barboza
2021-03-24  1:40     ` David Gibson
2021-03-24 19:09       ` Daniel Henrique Barboza
2021-03-25  1:32         ` David Gibson
2021-03-29 23:28         ` Igor Mammedov
2021-03-30 23:46           ` David Gibson
2021-03-31  9:49             ` Igor Mammedov
2021-04-01  1:31               ` David Gibson
2021-03-31 19:47             ` Daniel Henrique Barboza
2021-04-01  1:36               ` David Gibson
2021-03-31 19:40           ` Daniel Henrique Barboza
2021-04-20 17:11 ` Daniel Henrique Barboza

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git