All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] adding a generic QAPI event for failed device hotunplug
@ 2021-03-05 18:16 Daniel Henrique Barboza
  2021-03-06  6:57 ` Markus Armbruster
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-05 18:16 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc; +Cc: michael.roth, Markus Armbruster, David Gibson

Hi,

Recent changes in pseries code (not yet pushed, available at David's
ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
hotunplug errors in the pseries machine.

The pseries machine is also using a timeout to cancel CPU hotunplugs that
takes too long to finish (in which we're assuming a guest side error) and
it would be desirable to also send a QAPI event for this case as well.

At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
need it). Before sending patches to implement this new event I had a talk
with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
used by the pseries machine in both error scenarios (MEM and CPU).

This could also be used by x86 as well, although I believe the use of
MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.


Any suggestions/comments?



Thanks,


DHB


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-05 18:16 [RFC] adding a generic QAPI event for failed device hotunplug Daniel Henrique Barboza
@ 2021-03-06  6:57 ` Markus Armbruster
  2021-03-08 14:22   ` Daniel Henrique Barboza
  0 siblings, 1 reply; 19+ messages in thread
From: Markus Armbruster @ 2021-03-06  6:57 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Michael S. Tsirkin, michael.roth, qemu-devel, qemu-ppc,
	Igor Mammedov, David Gibson

Cc: ACPI maintainers for additional expertise.

Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> Hi,
>
> Recent changes in pseries code (not yet pushed, available at David's
> ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
> hotunplug errors in the pseries machine.
>
> The pseries machine is also using a timeout to cancel CPU hotunplugs that
> takes too long to finish (in which we're assuming a guest side error) and
> it would be desirable to also send a QAPI event for this case as well.
>
> At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
> need it).

I see two interpretations of "ACPI doesn't need":

1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?

2. Management applications haven't needed to know badly enough to
implement an event.

>           Before sending patches to implement this new event I had a talk
> with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
> event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
> used by the pseries machine in both error scenarios (MEM and CPU).

Good point.  One general event is better than two special ones that
could easily grow siblings.

> This could also be used by x86 as well, although I believe the use of
> MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.

Yes.  Our rules for interface deprecation apply.

> Any suggestions/comments?

We should document the event's reliability.  Are there unplug operations
where we can't detect failure?  Are there unplug operations where we
could, but haven't implemented the event?

The fewer exceptions, the better, of course.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-06  6:57 ` Markus Armbruster
@ 2021-03-08 14:22   ` Daniel Henrique Barboza
  2021-03-08 17:04     ` Markus Armbruster
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-08 14:22 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Michael S. Tsirkin, michael.roth, qemu-devel, qemu-ppc,
	Igor Mammedov, David Gibson



On 3/6/21 3:57 AM, Markus Armbruster wrote:
> Cc: ACPI maintainers for additional expertise.
> 
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> Hi,
>>
>> Recent changes in pseries code (not yet pushed, available at David's
>> ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
>> hotunplug errors in the pseries machine.
>>
>> The pseries machine is also using a timeout to cancel CPU hotunplugs that
>> takes too long to finish (in which we're assuming a guest side error) and
>> it would be desirable to also send a QAPI event for this case as well.
>>
>> At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
>> need it).
> 
> I see two interpretations of "ACPI doesn't need":
> 
> 1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?
> 
> 2. Management applications haven't needed to know badly enough to
> implement an event.
> 
>>            Before sending patches to implement this new event I had a talk
>> with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
>> event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
>> used by the pseries machine in both error scenarios (MEM and CPU).
> 
> Good point.  One general event is better than two special ones that
> could easily grow siblings.
> 
>> This could also be used by x86 as well, although I believe the use of
>> MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.
> 
> Yes.  Our rules for interface deprecation apply.
> 
>> Any suggestions/comments?
> 
> We should document the event's reliability.  Are there unplug operations
> where we can't detect failure?  Are there unplug operations where we
> could, but haven't implemented the event?

The current version of the PowerPC spec that the pseries machine implements
(LOPAR) does not predict a way for the virtual machine to report a hotunplug
error back to the hypervisor. If something fails, the hypervisor is left
in the dark.

What happened in the 6.0.0 dev cycle is that we faced a reliable way of
making CPU hotunplug fail in the guest (trying to hotunplug the last online
CPU) and the pseries machine was unprepared for dealing with it. We ended up
implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
that the CPU hotunplug failed in the guest. This is the first scenario we have
today where we want to send a QAPI event informing the CPU hotunplug error,
but this event does not exist in QEMU ATM.

The second scenario is a memory hotunplug error. I found out that the pseries
guest kernel does a reconfiguration step when re-attaching the DIMM right
after refusing the hotunplug, and this reconfiguration is visible to QEMU.
I proceeded to make the pseries machine detect this error case, rollback the
unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
well instead of the MEM specific one.

This investigation and work in the mem hotunplug error path triggered a
discussion in qemu-ppc, where we're considering whether we should do the same
signalling the kernel does for the DIMM hotunplug error for all other device
hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
and it's currently a no-op. We would make a LOPAR spec change to make this an
official hotunplug error report mechanism, and all pseries hotunplug operations,
for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.

Granted, the spec change + Kernel change is not something that we will be able
to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
already in place would make it easier for the future as well.


I have a doc draft of these changes/infos that I forgot to post. I would post
it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
information there as well. Does that work for you as far as documentation
goes?



DHB


> 
> The fewer exceptions, the better, of course.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-08 14:22   ` Daniel Henrique Barboza
@ 2021-03-08 17:04     ` Markus Armbruster
  2021-03-08 18:01       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 19+ messages in thread
From: Markus Armbruster @ 2021-03-08 17:04 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Michael S. Tsirkin, michael.roth, qemu-devel, qemu-ppc,
	Igor Mammedov, David Gibson

Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> On 3/6/21 3:57 AM, Markus Armbruster wrote:
>> Cc: ACPI maintainers for additional expertise.
>> 
>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> 
>>> Hi,
>>>
>>> Recent changes in pseries code (not yet pushed, available at David's
>>> ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
>>> hotunplug errors in the pseries machine.
>>>
>>> The pseries machine is also using a timeout to cancel CPU hotunplugs that
>>> takes too long to finish (in which we're assuming a guest side error) and
>>> it would be desirable to also send a QAPI event for this case as well.
>>>
>>> At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
>>> need it).
>> 
>> I see two interpretations of "ACPI doesn't need":
>> 
>> 1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?
>> 
>> 2. Management applications haven't needed to know badly enough to
>> implement an event.
>> 
>>>            Before sending patches to implement this new event I had a talk
>>> with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
>>> event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
>>> used by the pseries machine in both error scenarios (MEM and CPU).
>> 
>> Good point.  One general event is better than two special ones that
>> could easily grow siblings.
>> 
>>> This could also be used by x86 as well, although I believe the use of
>>> MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.
>> 
>> Yes.  Our rules for interface deprecation apply.
>> 
>>> Any suggestions/comments?
>> 
>> We should document the event's reliability.  Are there unplug operations
>> where we can't detect failure?  Are there unplug operations where we
>> could, but haven't implemented the event?
>
> The current version of the PowerPC spec that the pseries machine implements
> (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> error back to the hypervisor. If something fails, the hypervisor is left
> in the dark.
>
> What happened in the 6.0.0 dev cycle is that we faced a reliable way of

Do you mean "unreliable way"?

> making CPU hotunplug fail in the guest (trying to hotunplug the last online
> CPU) and the pseries machine was unprepared for dealing with it. We ended up
> implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> that the CPU hotunplug failed in the guest. This is the first scenario we have
> today where we want to send a QAPI event informing the CPU hotunplug error,
> but this event does not exist in QEMU ATM.

When the timeout kicks in, how can you know the operation failed?  You
better be sure when you send an error event.  In other words: what
prevents the scenario where the operation is much slower than you
expect, the timeout expires, the error event is sent, and then the
operation completes successfully?

> The second scenario is a memory hotunplug error. I found out that the pseries
> guest kernel does a reconfiguration step when re-attaching the DIMM right
> after refusing the hotunplug, and this reconfiguration is visible to QEMU.
> I proceeded to make the pseries machine detect this error case, rollback the
> unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
> by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
> well instead of the MEM specific one.
>
> This investigation and work in the mem hotunplug error path triggered a
> discussion in qemu-ppc, where we're considering whether we should do the same
> signalling the kernel does for the DIMM hotunplug error for all other device
> hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
> and it's currently a no-op. We would make a LOPAR spec change to make this an
> official hotunplug error report mechanism, and all pseries hotunplug operations,
> for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>
> Granted, the spec change + Kernel change is not something that we will be able
> to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
> already in place would make it easier for the future as well.
>
>
> I have a doc draft of these changes/infos that I forgot to post. I would post
> it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
> information there as well. Does that work for you as far as documentation
> goes?

A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
and device.

I'm not asking you to to implement it for all machines and devices.  But
I want its design to support growth towards that goal, and its
documentation reflect its current state.

In particular, the doc comment in the QAPI schema should list the
limitation.  If the event is only implemented for LOPAR for now, say so.
If it's only implemented for certain devices, say so.

Questions?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-08 17:04     ` Markus Armbruster
@ 2021-03-08 18:01       ` Daniel Henrique Barboza
  2021-03-09  3:22         ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-08 18:01 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Michael S. Tsirkin, michael.roth, qemu-devel, qemu-ppc,
	Igor Mammedov, David Gibson



On 3/8/21 2:04 PM, Markus Armbruster wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> On 3/6/21 3:57 AM, Markus Armbruster wrote:
>>> Cc: ACPI maintainers for additional expertise.
>>>
>>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>>>
>>>> Hi,
>>>>
>>>> Recent changes in pseries code (not yet pushed, available at David's
>>>> ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
>>>> hotunplug errors in the pseries machine.
>>>>
>>>> The pseries machine is also using a timeout to cancel CPU hotunplugs that
>>>> takes too long to finish (in which we're assuming a guest side error) and
>>>> it would be desirable to also send a QAPI event for this case as well.
>>>>
>>>> At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
>>>> need it).
>>>
>>> I see two interpretations of "ACPI doesn't need":
>>>
>>> 1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?
>>>
>>> 2. Management applications haven't needed to know badly enough to
>>> implement an event.
>>>
>>>>             Before sending patches to implement this new event I had a talk
>>>> with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
>>>> event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
>>>> used by the pseries machine in both error scenarios (MEM and CPU).
>>>
>>> Good point.  One general event is better than two special ones that
>>> could easily grow siblings.
>>>
>>>> This could also be used by x86 as well, although I believe the use of
>>>> MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.
>>>
>>> Yes.  Our rules for interface deprecation apply.
>>>
>>>> Any suggestions/comments?
>>>
>>> We should document the event's reliability.  Are there unplug operations
>>> where we can't detect failure?  Are there unplug operations where we
>>> could, but haven't implemented the event?
>>
>> The current version of the PowerPC spec that the pseries machine implements
>> (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>> error back to the hypervisor. If something fails, the hypervisor is left
>> in the dark.
>>
>> What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> 
> Do you mean "unreliable way"?

I guess a better word would be 'reproducible', as in we discovered a reproducible
way of getting the guest kernel to refuse the CPU hotunplug.

> 
>> making CPU hotunplug fail in the guest (trying to hotunplug the last online
>> CPU) and the pseries machine was unprepared for dealing with it. We ended up
>> implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>> that the CPU hotunplug failed in the guest. This is the first scenario we have
>> today where we want to send a QAPI event informing the CPU hotunplug error,
>> but this event does not exist in QEMU ATM.
> 
> When the timeout kicks in, how can you know the operation failed?  You
> better be sure when you send an error event.  In other words: what
> prevents the scenario where the operation is much slower than you
> expect, the timeout expires, the error event is sent, and then the
> operation completes successfully?

A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
if the guest is under heavy load. The timeout is set to 15 seconds.

I'm aware that there's always that special use case, that we haven't seen yet,
where this assumption is no longer valid. The plan is to change the spec and the
guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
lands. For now, timing out the CPU hotunplug process when we're almost certain
(but not 100%) that it failed in the guest is the best can do.

For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
guest visible changes when we change how we're detecting the unplug errors.

> 
>> The second scenario is a memory hotunplug error. I found out that the pseries
>> guest kernel does a reconfiguration step when re-attaching the DIMM right
>> after refusing the hotunplug, and this reconfiguration is visible to QEMU.
>> I proceeded to make the pseries machine detect this error case, rollback the
>> unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
>> by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
>> well instead of the MEM specific one.
>>
>> This investigation and work in the mem hotunplug error path triggered a
>> discussion in qemu-ppc, where we're considering whether we should do the same
>> signalling the kernel does for the DIMM hotunplug error for all other device
>> hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
>> and it's currently a no-op. We would make a LOPAR spec change to make this an
>> official hotunplug error report mechanism, and all pseries hotunplug operations,
>> for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>>
>> Granted, the spec change + Kernel change is not something that we will be able
>> to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
>> already in place would make it easier for the future as well.
>>
>>
>> I have a doc draft of these changes/infos that I forgot to post. I would post
>> it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
>> information there as well. Does that work for you as far as documentation
>> goes?
> 
> A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
> and device.
> 
> I'm not asking you to to implement it for all machines and devices.  But
> I want its design to support growth towards that goal, and its
> documentation reflect its current state.
> 
> In particular, the doc comment in the QAPI schema should list the
> limitation.  If the event is only implemented for LOPAR for now, say so.
> If it's only implemented for certain devices, say so.
> 
> Questions?


Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
in a way that it can be used by other machines in the future, and documenting
where the event is being used ATM.


Thanks,


DHB


> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-08 18:01       ` Daniel Henrique Barboza
@ 2021-03-09  3:22         ` David Gibson
  2021-03-09  6:22           ` Markus Armbruster
  0 siblings, 1 reply; 19+ messages in thread
From: David Gibson @ 2021-03-09  3:22 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Michael S. Tsirkin, michael.roth, qemu-devel, Markus Armbruster,
	qemu-ppc, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 8653 bytes --]

On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/8/21 2:04 PM, Markus Armbruster wrote:
> > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > 
> > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> > > > Cc: ACPI maintainers for additional expertise.
> > > > 
> > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > Recent changes in pseries code (not yet pushed, available at David's
> > > > > ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
> > > > > hotunplug errors in the pseries machine.
> > > > > 
> > > > > The pseries machine is also using a timeout to cancel CPU hotunplugs that
> > > > > takes too long to finish (in which we're assuming a guest side error) and
> > > > > it would be desirable to also send a QAPI event for this case as well.
> > > > > 
> > > > > At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
> > > > > need it).
> > > > 
> > > > I see two interpretations of "ACPI doesn't need":
> > > > 
> > > > 1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?
> > > > 
> > > > 2. Management applications haven't needed to know badly enough to
> > > > implement an event.
> > > > 
> > > > >             Before sending patches to implement this new event I had a talk
> > > > > with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
> > > > > event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
> > > > > used by the pseries machine in both error scenarios (MEM and CPU).
> > > > 
> > > > Good point.  One general event is better than two special ones that
> > > > could easily grow siblings.
> > > > 
> > > > > This could also be used by x86 as well, although I believe the use of
> > > > > MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.
> > > > 
> > > > Yes.  Our rules for interface deprecation apply.
> > > > 
> > > > > Any suggestions/comments?
> > > > 
> > > > We should document the event's reliability.  Are there unplug operations
> > > > where we can't detect failure?  Are there unplug operations where we
> > > > could, but haven't implemented the event?
> > > 
> > > The current version of the PowerPC spec that the pseries machine implements
> > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> > > error back to the hypervisor. If something fails, the hypervisor is left
> > > in the dark.
> > > 
> > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> > 
> > Do you mean "unreliable way"?
> 
> I guess a better word would be 'reproducible', as in we discovered a reproducible
> way of getting the guest kernel to refuse the CPU hotunplug.

Right.  It's worth noting here that in the PAPR model, there are no
"forced" unplugs.  Unplugs always consist of a request to the guest,
which is then resposible for offlining the device and signalling back
to the hypervisor that it's done with it.

> > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
> > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
> > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> > > that the CPU hotunplug failed in the guest. This is the first scenario we have
> > > today where we want to send a QAPI event informing the CPU hotunplug error,
> > > but this event does not exist in QEMU ATM.
> > 
> > When the timeout kicks in, how can you know the operation failed?  You
> > better be sure when you send an error event.  In other words: what
> > prevents the scenario where the operation is much slower than you
> > expect, the timeout expires, the error event is sent, and then the
> > operation completes successfully?
> 
> A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
> if the guest is under heavy load. The timeout is set to 15 seconds.

Right.  We're well aware that a timeout is an ugly hack, since it's
not really possible to distinguish it from a guest that's just being
really slow.

It was the best thing we could come up with in the short term though.
Without the changes we're suggesting here, the guest can positively
assert the unplug is complete, but it has no way to signal failure.
So, without a timeout qemu is stuck waiting indefinitely (or at least
until the next machine reset) on the guest for an unplug that might
never complete.

It's particularly nasty if the guest does really fail the hotplug
internally, but then conditions change so that the same unplug could
now succeed.  The unplug request is just a message - there's no guest
visible persistent state saying we want the device unplugged, so if
conditions change the guest won't then re-attempt the unplug.
Meanwhile the user can't re-attempt the device_del, because on the
qemu side it's still stuck in a pending unplug waiting on the guest.

As we're discussing we think we have a better way, but it relies on
guest changes, so we can't assume we already have that on the qemu
side.

> I'm aware that there's always that special use case, that we haven't seen yet,
> where this assumption is no longer valid. The plan is to change the spec and the
> guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
> lands. For now, timing out the CPU hotunplug process when we're almost certain
> (but not 100%) that it failed in the guest is the best can do.
> 
> For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
> guest visible changes when we change how we're detecting the unplug errors.
> 
> > > The second scenario is a memory hotunplug error. I found out that the pseries
> > > guest kernel does a reconfiguration step when re-attaching the DIMM right
> > > after refusing the hotunplug, and this reconfiguration is visible to QEMU.
> > > I proceeded to make the pseries machine detect this error case, rollback the
> > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
> > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
> > > well instead of the MEM specific one.
> > > 
> > > This investigation and work in the mem hotunplug error path triggered a
> > > discussion in qemu-ppc, where we're considering whether we should do the same
> > > signalling the kernel does for the DIMM hotunplug error for all other device
> > > hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
> > > and it's currently a no-op. We would make a LOPAR spec change to make this an
> > > official hotunplug error report mechanism, and all pseries hotunplug operations,
> > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
> > > 
> > > Granted, the spec change + Kernel change is not something that we will be able
> > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
> > > already in place would make it easier for the future as well.
> > > 
> > > 
> > > I have a doc draft of these changes/infos that I forgot to post. I would post
> > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
> > > information there as well. Does that work for you as far as documentation
> > > goes?
> > 
> > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
> > and device.

Ack.  Fwiw, I don't think this can ever be more than a "best effort"
notification.  Even with a machine and OS that should support it, a
guest bug or hang could mean that it never acks *or* nacks an unplug
request.

> > I'm not asking you to to implement it for all machines and devices.  But
> > I want its design to support growth towards that goal, and its
> > documentation reflect its current state.
> > 
> > In particular, the doc comment in the QAPI schema should list the
> > limitation.  If the event is only implemented for LOPAR for now, say so.
> > If it's only implemented for certain devices, say so.
> > 
> > Questions?
> 
> 
> Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
> in a way that it can be used by other machines in the future, and documenting
> where the event is being used ATM.
> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-09  3:22         ` David Gibson
@ 2021-03-09  6:22           ` Markus Armbruster
  2021-03-11 20:50             ` Daniel Henrique Barboza
  0 siblings, 1 reply; 19+ messages in thread
From: Markus Armbruster @ 2021-03-09  6:22 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, qemu-ppc, Laine Stump, Paolo Bonzini,
	Igor Mammedov

Cc: Paolo and Julia in addition to Igor, because the thread is wandering
towards DeviceState member pending_deleted_event.

Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
please loop in the right person.

David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>> 
>> 
>> On 3/8/21 2:04 PM, Markus Armbruster wrote:
>> > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> > 
>> > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
[...]
>> > > > We should document the event's reliability.  Are there unplug operations
>> > > > where we can't detect failure?  Are there unplug operations where we
>> > > > could, but haven't implemented the event?
>> > > 
>> > > The current version of the PowerPC spec that the pseries machine implements
>> > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>> > > error back to the hypervisor. If something fails, the hypervisor is left
>> > > in the dark.
>> > > 
>> > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>> > 
>> > Do you mean "unreliable way"?
>> 
>> I guess a better word would be 'reproducible', as in we discovered a reproducible
>> way of getting the guest kernel to refuse the CPU hotunplug.
>
> Right.  It's worth noting here that in the PAPR model, there are no
> "forced" unplugs.  Unplugs always consist of a request to the guest,
> which is then resposible for offlining the device and signalling back
> to the hypervisor that it's done with it.
>
>> > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
>> > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
>> > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>> > > that the CPU hotunplug failed in the guest. This is the first scenario we have
>> > > today where we want to send a QAPI event informing the CPU hotunplug error,
>> > > but this event does not exist in QEMU ATM.
>> > 
>> > When the timeout kicks in, how can you know the operation failed?  You
>> > better be sure when you send an error event.  In other words: what
>> > prevents the scenario where the operation is much slower than you
>> > expect, the timeout expires, the error event is sent, and then the
>> > operation completes successfully?
>> 
>> A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
>> if the guest is under heavy load. The timeout is set to 15 seconds.
>
> Right.  We're well aware that a timeout is an ugly hack, since it's
> not really possible to distinguish it from a guest that's just being
> really slow.

As long as unplug failure cannot be detected reliably, we need a timeout
*somewhere*.  I vaguely recall libvirt has one.  Laine?

Putting the timeout in QEMU might be better.  QEMU might be in a better
position to pick a suitable timeout.

> It was the best thing we could come up with in the short term though.
> Without the changes we're suggesting here, the guest can positively
> assert the unplug is complete, but it has no way to signal failure.
> So, without a timeout qemu is stuck waiting indefinitely (or at least
> until the next machine reset) on the guest for an unplug that might
> never complete.
>
> It's particularly nasty if the guest does really fail the hotplug
> internally, but then conditions change so that the same unplug could
> now succeed.  The unplug request is just a message - there's no guest
> visible persistent state saying we want the device unplugged, so if
> conditions change the guest won't then re-attempt the unplug.
> Meanwhile the user can't re-attempt the device_del, because on the
> qemu side it's still stuck in a pending unplug waiting on the guest.

"Can't re-attempt" feels like a bug.  Let me explain.

Depending on the device, device_del can complete the unplug, or merely
initiate it.  Documentation:

# Notes: When this command completes, the device may not be removed from the
#        guest.  Hot removal is an operation that requires guest cooperation.
#        This command merely requests that the guest begin the hot removal
#        process.  Completion of the device removal process is signaled with a
#        DEVICE_DELETED event. Guest reset will automatically complete removal
#        for all devices.

This is not as accurate as it could be.  Behavior depends on the device.

For some kinds of devices, the command completes the unplug before it
sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
event gets send with a slight delay because device cleanup uses RCU.

For others, notably PCI devices, the command only pokes the guest to do
its bit.  QEMU reacts to the guest's actions, which eventually leads to
the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
play ball to the end, the event doesn't get send.

The "can't retry unplug" behavior is new.  Another device_del used to
simply poke the guest again.  I think this regressed in commit
cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
Feels like a must-fix for 6.0.

> As we're discussing we think we have a better way, but it relies on
> guest changes, so we can't assume we already have that on the qemu
> side.
>
>> I'm aware that there's always that special use case, that we haven't seen yet,
>> where this assumption is no longer valid. The plan is to change the spec and the
>> guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
>> lands. For now, timing out the CPU hotunplug process when we're almost certain
>> (but not 100%) that it failed in the guest is the best can do.
>> 
>> For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
>> guest visible changes when we change how we're detecting the unplug errors.
>> 
>> > > The second scenario is a memory hotunplug error. I found out that the pseries
>> > > guest kernel does a reconfiguration step when re-attaching the DIMM right
>> > > after refusing the hotunplug, and this reconfiguration is visible to QEMU.
>> > > I proceeded to make the pseries machine detect this error case, rollback the
>> > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
>> > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
>> > > well instead of the MEM specific one.
>> > > 
>> > > This investigation and work in the mem hotunplug error path triggered a
>> > > discussion in qemu-ppc, where we're considering whether we should do the same
>> > > signalling the kernel does for the DIMM hotunplug error for all other device
>> > > hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
>> > > and it's currently a no-op. We would make a LOPAR spec change to make this an
>> > > official hotunplug error report mechanism, and all pseries hotunplug operations,
>> > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>> > > 
>> > > Granted, the spec change + Kernel change is not something that we will be able
>> > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
>> > > already in place would make it easier for the future as well.
>> > > 
>> > > 
>> > > I have a doc draft of these changes/infos that I forgot to post. I would post
>> > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
>> > > information there as well. Does that work for you as far as documentation
>> > > goes?
>> > 
>> > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
>> > and device.
>
> Ack.  Fwiw, I don't think this can ever be more than a "best effort"
> notification.  Even with a machine and OS that should support it, a
> guest bug or hang could mean that it never acks *or* nacks an unplug
> request.

Since the success event is named DEVICE_DELETED, I'd name the
probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
statement of fact "still not deleted" instead of an error (that just
might not be one).

>> > I'm not asking you to to implement it for all machines and devices.  But
>> > I want its design to support growth towards that goal, and its
>> > documentation reflect its current state.
>> > 
>> > In particular, the doc comment in the QAPI schema should list the
>> > limitation.  If the event is only implemented for LOPAR for now, say so.
>> > If it's only implemented for certain devices, say so.
>> > 
>> > Questions?
>> 
>> 
>> Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
>> in a way that it can be used by other machines in the future, and documenting
>> where the event is being used ATM.
>> 
>> 
>> Thanks,
>> 
>> 
>> DHB
>> 
>> 
>> > 
>> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-09  6:22           ` Markus Armbruster
@ 2021-03-11 20:50             ` Daniel Henrique Barboza
  2021-03-12  1:19               ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-11 20:50 UTC (permalink / raw)
  To: Markus Armbruster, David Gibson
  Cc: Michael S. Tsirkin, michael.roth, Julia Suvorova, qemu-devel,
	qemu-ppc, Laine Stump, Paolo Bonzini, Igor Mammedov



On 3/9/21 3:22 AM, Markus Armbruster wrote:
> Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> towards DeviceState member pending_deleted_event.
> 
> Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> please loop in the right person.
> 
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
>> On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/8/21 2:04 PM, Markus Armbruster wrote:
>>>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>>>>
>>>>> On 3/6/21 3:57 AM, Markus Armbruster wrote:
> [...]
>>>>>> We should document the event's reliability.  Are there unplug operations
>>>>>> where we can't detect failure?  Are there unplug operations where we
>>>>>> could, but haven't implemented the event?
>>>>>
>>>>> The current version of the PowerPC spec that the pseries machine implements
>>>>> (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>>>>> error back to the hypervisor. If something fails, the hypervisor is left
>>>>> in the dark.
>>>>>
>>>>> What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>>>>
>>>> Do you mean "unreliable way"?
>>>
>>> I guess a better word would be 'reproducible', as in we discovered a reproducible
>>> way of getting the guest kernel to refuse the CPU hotunplug.
>>
>> Right.  It's worth noting here that in the PAPR model, there are no
>> "forced" unplugs.  Unplugs always consist of a request to the guest,
>> which is then resposible for offlining the device and signalling back
>> to the hypervisor that it's done with it.
>>
>>>>> making CPU hotunplug fail in the guest (trying to hotunplug the last online
>>>>> CPU) and the pseries machine was unprepared for dealing with it. We ended up
>>>>> implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>>>>> that the CPU hotunplug failed in the guest. This is the first scenario we have
>>>>> today where we want to send a QAPI event informing the CPU hotunplug error,
>>>>> but this event does not exist in QEMU ATM.
>>>>
>>>> When the timeout kicks in, how can you know the operation failed?  You
>>>> better be sure when you send an error event.  In other words: what
>>>> prevents the scenario where the operation is much slower than you
>>>> expect, the timeout expires, the error event is sent, and then the
>>>> operation completes successfully?
>>>
>>> A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
>>> if the guest is under heavy load. The timeout is set to 15 seconds.
>>
>> Right.  We're well aware that a timeout is an ugly hack, since it's
>> not really possible to distinguish it from a guest that's just being
>> really slow.
> 
> As long as unplug failure cannot be detected reliably, we need a timeout
> *somewhere*.  I vaguely recall libvirt has one.  Laine?

Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
the timeout makes more sense since it has more information about the
conditions/mechanics involved.

At this moment, the only case I know of hotunplug operations timing out in
QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
unplug timeout is desirable mechanism for other machines/device types.

> 
> Putting the timeout in QEMU might be better.  QEMU might be in a better
> position to pick a suitable timeout.
> 
>> It was the best thing we could come up with in the short term though.
>> Without the changes we're suggesting here, the guest can positively
>> assert the unplug is complete, but it has no way to signal failure.
>> So, without a timeout qemu is stuck waiting indefinitely (or at least
>> until the next machine reset) on the guest for an unplug that might
>> never complete.
>>
>> It's particularly nasty if the guest does really fail the hotplug
>> internally, but then conditions change so that the same unplug could
>> now succeed.  The unplug request is just a message - there's no guest
>> visible persistent state saying we want the device unplugged, so if
>> conditions change the guest won't then re-attempt the unplug.
>> Meanwhile the user can't re-attempt the device_del, because on the
>> qemu side it's still stuck in a pending unplug waiting on the guest.
> 
> "Can't re-attempt" feels like a bug.  Let me explain.


So, what David mentioned above is related to pseries internals I believe.

Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
issued for devices that were in the middle of the unplug process, with the
exception of DIMMs where we bothered to throw an error message back to the
user.

In 6.0.0 the code was reworked, and now we're always letting the user know
when the 'device_del' was ignored due to an ongoing hotunplug of the device.
And for CPUs we also tell the timeout remaining. We're still not sending
multiple hotunplug IRQ pulses to the guest, but at least the user is now
informed about it.

As for the commit mentioned below:

> 
> Depending on the device, device_del can complete the unplug, or merely
> initiate it.  Documentation:
> 
> # Notes: When this command completes, the device may not be removed from the
> #        guest.  Hot removal is an operation that requires guest cooperation.
> #        This command merely requests that the guest begin the hot removal
> #        process.  Completion of the device removal process is signaled with a
> #        DEVICE_DELETED event. Guest reset will automatically complete removal
> #        for all devices.
> 
> This is not as accurate as it could be.  Behavior depends on the device.
> 
> For some kinds of devices, the command completes the unplug before it
> sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
> event gets send with a slight delay because device cleanup uses RCU.
> 
> For others, notably PCI devices, the command only pokes the guest to do
> its bit.  QEMU reacts to the guest's actions, which eventually leads to
> the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
> play ball to the end, the event doesn't get send.
> 
> The "can't retry unplug" behavior is new.  Another device_del used to
> simply poke the guest again.  I think this regressed in commit
> cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
> Feels like a must-fix for 6.0.


This doesn't directly affect pseries code because we're not using
dev->pending_deleted_event to track the pending unplug status. We're
using an internal flag that is related to the DRC (the 'hotplug state'
of the device).

The commit seems a bit odd because it is making a change in the common code
inside qmp_device_del() based on a PCI-e specific behavior. In the end this
doesn't impact any other device but PCI-e (it is the only device that uses
dev->pending_deleted_event to mark the start and finish of the unplug process),
but it's not something that I might expect in that function.

IMO this verification should be removed from qmp_device_del and moved to
pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe devices
without making assumptions about everything else.


> 
>> As we're discussing we think we have a better way, but it relies on
>> guest changes, so we can't assume we already have that on the qemu
>> side.
>>
>>> I'm aware that there's always that special use case, that we haven't seen yet,
>>> where this assumption is no longer valid. The plan is to change the spec and the
>>> guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
>>> lands. For now, timing out the CPU hotunplug process when we're almost certain
>>> (but not 100%) that it failed in the guest is the best can do.
>>>
>>> For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
>>> guest visible changes when we change how we're detecting the unplug errors.
>>>
>>>>> The second scenario is a memory hotunplug error. I found out that the pseries
>>>>> guest kernel does a reconfiguration step when re-attaching the DIMM right
>>>>> after refusing the hotunplug, and this reconfiguration is visible to QEMU.
>>>>> I proceeded to make the pseries machine detect this error case, rollback the
>>>>> unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
>>>>> by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
>>>>> well instead of the MEM specific one.
>>>>>
>>>>> This investigation and work in the mem hotunplug error path triggered a
>>>>> discussion in qemu-ppc, where we're considering whether we should do the same
>>>>> signalling the kernel does for the DIMM hotunplug error for all other device
>>>>> hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
>>>>> and it's currently a no-op. We would make a LOPAR spec change to make this an
>>>>> official hotunplug error report mechanism, and all pseries hotunplug operations,
>>>>> for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>>>>>
>>>>> Granted, the spec change + Kernel change is not something that we will be able
>>>>> to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
>>>>> already in place would make it easier for the future as well.
>>>>>
>>>>>
>>>>> I have a doc draft of these changes/infos that I forgot to post. I would post
>>>>> it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
>>>>> information there as well. Does that work for you as far as documentation
>>>>> goes?
>>>>
>>>> A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
>>>> and device.
>>
>> Ack.  Fwiw, I don't think this can ever be more than a "best effort"
>> notification.  Even with a machine and OS that should support it, a
>> guest bug or hang could mean that it never acks *or* nacks an unplug
>> request.
> 
> Since the success event is named DEVICE_DELETED, I'd name the
> probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
> statement of fact "still not deleted" instead of an error (that just
> might not be one).


"DEVICE_NOT_DELETED" sounds way better for what we want to express in the
pseries case when we can't be 100% sure of a guest side error. However,
there is at least one case where I'm sure of a guest side error (where I'm
using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that case.


Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can use
both in pseries-6.0.0.




Thanks,


DHB


> 
>>>> I'm not asking you to to implement it for all machines and devices.  But
>>>> I want its design to support growth towards that goal, and its
>>>> documentation reflect its current state.
>>>>
>>>> In particular, the doc comment in the QAPI schema should list the
>>>> limitation.  If the event is only implemented for LOPAR for now, say so.
>>>> If it's only implemented for certain devices, say so.
>>>>
>>>> Questions?
>>>
>>>
>>> Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
>>> in a way that it can be used by other machines in the future, and documenting
>>> where the event is being used ATM.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> DHB
>>>
>>>
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-11 20:50             ` Daniel Henrique Barboza
@ 2021-03-12  1:19               ` David Gibson
  2021-03-12  8:12                 ` Markus Armbruster
  2021-03-12 13:38                 ` Laine Stump
  0 siblings, 2 replies; 19+ messages in thread
From: David Gibson @ 2021-03-12  1:19 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Michael S. Tsirkin, michael.roth, Julia Suvorova, qemu-devel,
	Markus Armbruster, qemu-ppc, Laine Stump, Paolo Bonzini,
	Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 13289 bytes --]

On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/9/21 3:22 AM, Markus Armbruster wrote:
> > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> > towards DeviceState member pending_deleted_event.
> > 
> > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> > please loop in the right person.
> > 
> > David Gibson <david@gibson.dropbear.id.au> writes:
> > 
> > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
> > > > 
> > > > 
> > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
> > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > > > > 
> > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> > [...]
> > > > > > > We should document the event's reliability.  Are there unplug operations
> > > > > > > where we can't detect failure?  Are there unplug operations where we
> > > > > > > could, but haven't implemented the event?
> > > > > > 
> > > > > > The current version of the PowerPC spec that the pseries machine implements
> > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> > > > > > error back to the hypervisor. If something fails, the hypervisor is left
> > > > > > in the dark.
> > > > > > 
> > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> > > > > 
> > > > > Do you mean "unreliable way"?
> > > > 
> > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
> > > > way of getting the guest kernel to refuse the CPU hotunplug.
> > > 
> > > Right.  It's worth noting here that in the PAPR model, there are no
> > > "forced" unplugs.  Unplugs always consist of a request to the guest,
> > > which is then resposible for offlining the device and signalling back
> > > to the hypervisor that it's done with it.
> > > 
> > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
> > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
> > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
> > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
> > > > > > but this event does not exist in QEMU ATM.
> > > > > 
> > > > > When the timeout kicks in, how can you know the operation failed?  You
> > > > > better be sure when you send an error event.  In other words: what
> > > > > prevents the scenario where the operation is much slower than you
> > > > > expect, the timeout expires, the error event is sent, and then the
> > > > > operation completes successfully?
> > > > 
> > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
> > > > if the guest is under heavy load. The timeout is set to 15 seconds.
> > > 
> > > Right.  We're well aware that a timeout is an ugly hack, since it's
> > > not really possible to distinguish it from a guest that's just being
> > > really slow.
> > 
> > As long as unplug failure cannot be detected reliably, we need a timeout
> > *somewhere*.  I vaguely recall libvirt has one.  Laine?
> 
> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
> the timeout makes more sense since it has more information about the
> conditions/mechanics involved.

Right.  In particular, I can't really see how libvirt can fully
implement that timeout.  AFAIK qemu has no way of listing or
cancelling "in flight" unplug requests, so it's entirely possible that
the unplug could still complete after libvirt's has "timed out".

> At this moment, the only case I know of hotunplug operations timing out in
> QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
> unplug timeout is desirable mechanism for other machines/device types.
> 
> > Putting the timeout in QEMU might be better.  QEMU might be in a better
> > position to pick a suitable timeout.
> > 
> > > It was the best thing we could come up with in the short term though.
> > > Without the changes we're suggesting here, the guest can positively
> > > assert the unplug is complete, but it has no way to signal failure.
> > > So, without a timeout qemu is stuck waiting indefinitely (or at least
> > > until the next machine reset) on the guest for an unplug that might
> > > never complete.
> > > 
> > > It's particularly nasty if the guest does really fail the hotplug
> > > internally, but then conditions change so that the same unplug could
> > > now succeed.  The unplug request is just a message - there's no guest
> > > visible persistent state saying we want the device unplugged, so if
> > > conditions change the guest won't then re-attempt the unplug.
> > > Meanwhile the user can't re-attempt the device_del, because on the
> > > qemu side it's still stuck in a pending unplug waiting on the guest.
> > 
> > "Can't re-attempt" feels like a bug.  Let me explain.

You may be right.  Perhaps we should just send another unplug message
if we get a device_del on something that's already pending deletion.
AFAICT repeated unplug messages for the same device should be
harmless.

> So, what David mentioned above is related to pseries internals I believe.
> 
> Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
> issued for devices that were in the middle of the unplug process, with the
> exception of DIMMs where we bothered to throw an error message back to the
> user.
> 
> In 6.0.0 the code was reworked, and now we're always letting the user know
> when the 'device_del' was ignored due to an ongoing hotunplug of the device.
> And for CPUs we also tell the timeout remaining. We're still not sending
> multiple hotunplug IRQ pulses to the guest, but at least the user is now
> informed about it.
> 
> As for the commit mentioned below:
> 
> > 
> > Depending on the device, device_del can complete the unplug, or merely
> > initiate it.  Documentation:
> > 
> > # Notes: When this command completes, the device may not be removed from the
> > #        guest.  Hot removal is an operation that requires guest cooperation.
> > #        This command merely requests that the guest begin the hot removal
> > #        process.  Completion of the device removal process is signaled with a
> > #        DEVICE_DELETED event. Guest reset will automatically complete removal
> > #        for all devices.
> > 
> > This is not as accurate as it could be.  Behavior depends on the device.
> > 
> > For some kinds of devices, the command completes the unplug before it
> > sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
> > event gets send with a slight delay because device cleanup uses RCU.
> > 
> > For others, notably PCI devices, the command only pokes the guest to do
> > its bit.  QEMU reacts to the guest's actions, which eventually leads to
> > the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
> > play ball to the end, the event doesn't get send.
> > 
> > The "can't retry unplug" behavior is new.  Another device_del used to
> > simply poke the guest again.

Maybe on x86.  I think a repeated device_del was at best a no-op on
PAPR.

> > I think this regressed in commit
> > cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
> > Feels like a must-fix for 6.0.
> 
> 
> This doesn't directly affect pseries code because we're not using
> dev->pending_deleted_event to track the pending unplug status. We're

Huh... I didn't know about pending_deleted_event.  Maybe we *should*
be using that.  Handling the migration will be painful, of course.

> using an internal flag that is related to the DRC (the 'hotplug state'
> of the device).
> 
> The commit seems a bit odd because it is making a change in the common code
> inside qmp_device_del() based on a PCI-e specific behavior. In the end this
> doesn't impact any other device but PCI-e (it is the only device that uses
> dev->pending_deleted_event to mark the start and finish of the unplug process),
> but it's not something that I might expect in that function.
> 
> IMO this verification should be removed from qmp_device_del and moved to
> pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe devices
> without making assumptions about everything else.
> 
> 
> > 
> > > As we're discussing we think we have a better way, but it relies on
> > > guest changes, so we can't assume we already have that on the qemu
> > > side.
> > > 
> > > > I'm aware that there's always that special use case, that we haven't seen yet,
> > > > where this assumption is no longer valid. The plan is to change the spec and the
> > > > guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
> > > > lands. For now, timing out the CPU hotunplug process when we're almost certain
> > > > (but not 100%) that it failed in the guest is the best can do.
> > > > 
> > > > For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
> > > > guest visible changes when we change how we're detecting the unplug errors.
> > > > 
> > > > > > The second scenario is a memory hotunplug error. I found out that the pseries
> > > > > > guest kernel does a reconfiguration step when re-attaching the DIMM right
> > > > > > after refusing the hotunplug, and this reconfiguration is visible to QEMU.
> > > > > > I proceeded to make the pseries machine detect this error case, rollback the
> > > > > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
> > > > > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
> > > > > > well instead of the MEM specific one.
> > > > > > 
> > > > > > This investigation and work in the mem hotunplug error path triggered a
> > > > > > discussion in qemu-ppc, where we're considering whether we should do the same
> > > > > > signalling the kernel does for the DIMM hotunplug error for all other device
> > > > > > hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
> > > > > > and it's currently a no-op. We would make a LOPAR spec change to make this an
> > > > > > official hotunplug error report mechanism, and all pseries hotunplug operations,
> > > > > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
> > > > > > 
> > > > > > Granted, the spec change + Kernel change is not something that we will be able
> > > > > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
> > > > > > already in place would make it easier for the future as well.
> > > > > > 
> > > > > > 
> > > > > > I have a doc draft of these changes/infos that I forgot to post. I would post
> > > > > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
> > > > > > information there as well. Does that work for you as far as documentation
> > > > > > goes?
> > > > > 
> > > > > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
> > > > > and device.
> > > 
> > > Ack.  Fwiw, I don't think this can ever be more than a "best effort"
> > > notification.  Even with a machine and OS that should support it, a
> > > guest bug or hang could mean that it never acks *or* nacks an unplug
> > > request.
> > 
> > Since the success event is named DEVICE_DELETED, I'd name the
> > probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
> > statement of fact "still not deleted" instead of an error (that just
> > might not be one).
> 
> 
> "DEVICE_NOT_DELETED" sounds way better for what we want to express in the
> pseries case when we can't be 100% sure of a guest side error. However,
> there is at least one case where I'm sure of a guest side error (where I'm
> using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that case.
> 
> 
> Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can use
> both in pseries-6.0.0.
> 
> 
> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> > > > > I'm not asking you to to implement it for all machines and devices.  But
> > > > > I want its design to support growth towards that goal, and its
> > > > > documentation reflect its current state.
> > > > > 
> > > > > In particular, the doc comment in the QAPI schema should list the
> > > > > limitation.  If the event is only implemented for LOPAR for now, say so.
> > > > > If it's only implemented for certain devices, say so.
> > > > > 
> > > > > Questions?
> > > > 
> > > > 
> > > > Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
> > > > in a way that it can be used by other machines in the future, and documenting
> > > > where the event is being used ATM.
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > 
> > > > DHB
> > > > 
> > > > 
> > > > > 
> > > > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-12  1:19               ` David Gibson
@ 2021-03-12  8:12                 ` Markus Armbruster
  2021-03-19  7:55                   ` Markus Armbruster
  2021-03-22  6:39                   ` David Gibson
  2021-03-12 13:38                 ` Laine Stump
  1 sibling, 2 replies; 19+ messages in thread
From: Markus Armbruster @ 2021-03-12  8:12 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Juan Quintela, qemu-ppc, Laine Stump,
	Igor Mammedov, Paolo Bonzini

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
>> 
>> 
>> On 3/9/21 3:22 AM, Markus Armbruster wrote:
>> > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
>> > towards DeviceState member pending_deleted_event.
>> > 
>> > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
>> > please loop in the right person.
>> > 
>> > David Gibson <david@gibson.dropbear.id.au> writes:
>> > 
>> > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>> > > > 
>> > > > 
>> > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
>> > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> > > > > 
>> > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
>> > [...]
>> > > > > > > We should document the event's reliability.  Are there unplug operations
>> > > > > > > where we can't detect failure?  Are there unplug operations where we
>> > > > > > > could, but haven't implemented the event?
>> > > > > > 
>> > > > > > The current version of the PowerPC spec that the pseries machine implements
>> > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>> > > > > > error back to the hypervisor. If something fails, the hypervisor is left
>> > > > > > in the dark.
>> > > > > > 
>> > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>> > > > > 
>> > > > > Do you mean "unreliable way"?
>> > > > 
>> > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
>> > > > way of getting the guest kernel to refuse the CPU hotunplug.
>> > > 
>> > > Right.  It's worth noting here that in the PAPR model, there are no
>> > > "forced" unplugs.  Unplugs always consist of a request to the guest,
>> > > which is then resposible for offlining the device and signalling back
>> > > to the hypervisor that it's done with it.
>> > > 
>> > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
>> > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
>> > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>> > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
>> > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
>> > > > > > but this event does not exist in QEMU ATM.
>> > > > > 
>> > > > > When the timeout kicks in, how can you know the operation failed?  You
>> > > > > better be sure when you send an error event.  In other words: what
>> > > > > prevents the scenario where the operation is much slower than you
>> > > > > expect, the timeout expires, the error event is sent, and then the
>> > > > > operation completes successfully?
>> > > > 
>> > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
>> > > > if the guest is under heavy load. The timeout is set to 15 seconds.
>> > > 
>> > > Right.  We're well aware that a timeout is an ugly hack, since it's
>> > > not really possible to distinguish it from a guest that's just being
>> > > really slow.
>> > 
>> > As long as unplug failure cannot be detected reliably, we need a timeout
>> > *somewhere*.  I vaguely recall libvirt has one.  Laine?
>> 
>> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
>> the timeout makes more sense since it has more information about the
>> conditions/mechanics involved.
>
> Right.  In particular, I can't really see how libvirt can fully
> implement that timeout.  AFAIK qemu has no way of listing or
> cancelling "in flight" unplug requests, so it's entirely possible that
> the unplug could still complete after libvirt's has "timed out".

QEMU doesn't really keep track of "in flight" unplug requests, and as
long as that's the case, its timeout even will have the same issue.

>> At this moment, the only case I know of hotunplug operations timing out in
>> QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
>> unplug timeout is desirable mechanism for other machines/device types.

I think we first need to nail down what the even means.  It could mean
"hot unplug definitely failed" or "hot unplug did not complete in time".

Any operation that is not instantaneous goes through a state where it
neither succeeded nor failed.

When the duration of that state is unbounded, no timeout can get us out
of it.  When exiting the state requires guest cooperation, its duration
is unbounded.  "Definitely failed" semantics is unattainable.

It is attainable if we can somehow cancel the unplug, because that
forces transition to "failed".

I suspect we have at least three kinds of unplugs: effectively
instantaneous (command completes the job, e.g. USB), unbounded and not
cancelable, unbounded and cancelable.

When we onlw know "did not complete in time", the management application
needs a way to retry the unplug regardless of actual state.  If the
unplug succeeded meanwhile (sending DEVICE_DELETED), the retry should
fail.  If it failed meanwhile, start over.  If it's still in progress,
do nothing.

We can choose to leave cancelling to the management application.  Ditch
"definitely failed", report "did not complete in time".  The management
application can then either cancel or retry.  The former succeeds if it
canceled the unplug, fails if the unplug completed meanwhile, which also
sent DEVICE_DELETED).

I think before we can continue reworking the code, we need consensus on
a hot unplug state machine that works for all known cases, or at least
the cases we have in QEMU.

>> > Putting the timeout in QEMU might be better.  QEMU might be in a better
>> > position to pick a suitable timeout.
>> > 
>> > > It was the best thing we could come up with in the short term though.
>> > > Without the changes we're suggesting here, the guest can positively
>> > > assert the unplug is complete, but it has no way to signal failure.
>> > > So, without a timeout qemu is stuck waiting indefinitely (or at least
>> > > until the next machine reset) on the guest for an unplug that might
>> > > never complete.
>> > > 
>> > > It's particularly nasty if the guest does really fail the hotplug
>> > > internally, but then conditions change so that the same unplug could
>> > > now succeed.  The unplug request is just a message - there's no guest
>> > > visible persistent state saying we want the device unplugged, so if
>> > > conditions change the guest won't then re-attempt the unplug.
>> > > Meanwhile the user can't re-attempt the device_del, because on the
>> > > qemu side it's still stuck in a pending unplug waiting on the guest.
>> > 
>> > "Can't re-attempt" feels like a bug.  Let me explain.
>
> You may be right.  Perhaps we should just send another unplug message
> if we get a device_del on something that's already pending deletion.
> AFAICT repeated unplug messages for the same device should be
> harmless.

Consider physical PCI hot unplug.  You press a button next to the slot,
wait for the (figuratively) green light, then pull out the card.  You
can press the button as many times as you want.

>> So, what David mentioned above is related to pseries internals I believe.
>> 
>> Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
>> issued for devices that were in the middle of the unplug process, with the
>> exception of DIMMs where we bothered to throw an error message back to the
>> user.
>> 
>> In 6.0.0 the code was reworked, and now we're always letting the user know
>> when the 'device_del' was ignored due to an ongoing hotunplug of the device.
>> And for CPUs we also tell the timeout remaining. We're still not sending
>> multiple hotunplug IRQ pulses to the guest, but at least the user is now
>> informed about it.

This is possible because you know unplug is in progress.  I'm not sure
we can know for all devices.

>> As for the commit mentioned below:
>> 
>> > 
>> > Depending on the device, device_del can complete the unplug, or merely
>> > initiate it.  Documentation:
>> > 
>> > # Notes: When this command completes, the device may not be removed from the
>> > #        guest.  Hot removal is an operation that requires guest cooperation.
>> > #        This command merely requests that the guest begin the hot removal
>> > #        process.  Completion of the device removal process is signaled with a
>> > #        DEVICE_DELETED event. Guest reset will automatically complete removal
>> > #        for all devices.
>> > 
>> > This is not as accurate as it could be.  Behavior depends on the device.
>> > 
>> > For some kinds of devices, the command completes the unplug before it
>> > sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
>> > event gets send with a slight delay because device cleanup uses RCU.
>> > 
>> > For others, notably PCI devices, the command only pokes the guest to do
>> > its bit.  QEMU reacts to the guest's actions, which eventually leads to
>> > the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
>> > play ball to the end, the event doesn't get send.
>> > 
>> > The "can't retry unplug" behavior is new.  Another device_del used to
>> > simply poke the guest again.
>
> Maybe on x86.  I think a repeated device_del was at best a no-op on
> PAPR.

There's certainly more than one way to skin this cat.  When QEMU knows
the guest is still working on the unplug, poking the guest again is
useless at best.

>> > I think this regressed in commit
>> > cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
>> > Feels like a must-fix for 6.0.
>> 
>> 
>> This doesn't directly affect pseries code because we're not using
>> dev->pending_deleted_event to track the pending unplug status. We're
>
> Huh... I didn't know about pending_deleted_event.  Maybe we *should*
> be using that.  Handling the migration will be painful, of course.

Paolo and I believe the code around @pending_deleted_event has become
wrong.

@pending_deleted_event was created to get DEVICE_DELETED right for
recursively deleted devices (commit 352e8da74).  It did *not* "track the
pending unplug status".

It was later appropriated for other purposes (commit c000a9bd0,
9711cd0df, cce8944cc).  Paolo and I believe these are wrong.

>> using an internal flag that is related to the DRC (the 'hotplug state'
>> of the device).
>> 
>> The commit seems a bit odd because it is making a change in the common code
>> inside qmp_device_del() based on a PCI-e specific behavior. In the end this
>> doesn't impact any other device but PCI-e (it is the only device that uses
>> dev->pending_deleted_event to mark the start and finish of the unplug process),
>> but it's not something that I might expect in that function.

Concur.

>> IMO this verification should be removed from qmp_device_del and moved to
>> pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe devices
>> without making assumptions about everything else.

A subset of {Igor, Julia, Juan} intends to look into this in time for 6.0.

>> > > As we're discussing we think we have a better way, but it relies on
>> > > guest changes, so we can't assume we already have that on the qemu
>> > > side.
>> > > 
>> > > > I'm aware that there's always that special use case, that we haven't seen yet,
>> > > > where this assumption is no longer valid. The plan is to change the spec and the
>> > > > guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
>> > > > lands. For now, timing out the CPU hotunplug process when we're almost certain
>> > > > (but not 100%) that it failed in the guest is the best can do.
>> > > > 
>> > > > For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
>> > > > guest visible changes when we change how we're detecting the unplug errors.
>> > > > 
>> > > > > > The second scenario is a memory hotunplug error. I found out that the pseries
>> > > > > > guest kernel does a reconfiguration step when re-attaching the DIMM right
>> > > > > > after refusing the hotunplug, and this reconfiguration is visible to QEMU.
>> > > > > > I proceeded to make the pseries machine detect this error case, rollback the
>> > > > > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
>> > > > > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
>> > > > > > well instead of the MEM specific one.
>> > > > > > 
>> > > > > > This investigation and work in the mem hotunplug error path triggered a
>> > > > > > discussion in qemu-ppc, where we're considering whether we should do the same
>> > > > > > signalling the kernel does for the DIMM hotunplug error for all other device
>> > > > > > hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
>> > > > > > and it's currently a no-op. We would make a LOPAR spec change to make this an
>> > > > > > official hotunplug error report mechanism, and all pseries hotunplug operations,
>> > > > > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>> > > > > > 
>> > > > > > Granted, the spec change + Kernel change is not something that we will be able
>> > > > > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
>> > > > > > already in place would make it easier for the future as well.
>> > > > > > 
>> > > > > > 
>> > > > > > I have a doc draft of these changes/infos that I forgot to post. I would post
>> > > > > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
>> > > > > > information there as well. Does that work for you as far as documentation
>> > > > > > goes?
>> > > > > 
>> > > > > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
>> > > > > and device.
>> > > 
>> > > Ack.  Fwiw, I don't think this can ever be more than a "best effort"
>> > > notification.  Even with a machine and OS that should support it, a
>> > > guest bug or hang could mean that it never acks *or* nacks an unplug
>> > > request.
>> > 
>> > Since the success event is named DEVICE_DELETED, I'd name the
>> > probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
>> > statement of fact "still not deleted" instead of an error (that just
>> > might not be one).
>> 
>> 
>> "DEVICE_NOT_DELETED" sounds way better for what we want to express in the
>> pseries case when we can't be 100% sure of a guest side error. However,
>> there is at least one case where I'm sure of a guest side error (where I'm
>> using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that case.
>> 
>> 
>> Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can use
>> both in pseries-6.0.0.

Separate QMP events are advisable when their meaning is different enough
for a management application to want to tell them apart.

[...]



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-12  1:19               ` David Gibson
  2021-03-12  8:12                 ` Markus Armbruster
@ 2021-03-12 13:38                 ` Laine Stump
  2021-03-22  6:43                   ` David Gibson
  1 sibling, 1 reply; 19+ messages in thread
From: Laine Stump @ 2021-03-12 13:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, Markus Armbruster, qemu-ppc, Paolo Bonzini,
	Igor Mammedov, David Gibson

On 3/11/21 8:19 PM, David Gibson wrote:
> On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/9/21 3:22 AM, Markus Armbruster wrote:
>>> Cc: Paolo and Julia in addition to Igor, because the thread is wandering
>>> towards DeviceState member pending_deleted_event.
>>>
>>> Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
>>> please loop in the right person.
>>>
>>> David Gibson <david@gibson.dropbear.id.au> writes:
>>>
>>>> On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/8/21 2:04 PM, Markus Armbruster wrote:
>>>>>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>>>>>>
>>>>>>> On 3/6/21 3:57 AM, Markus Armbruster wrote:
>>> [...]
>>>>>>>> We should document the event's reliability.  Are there unplug operations
>>>>>>>> where we can't detect failure?  Are there unplug operations where we
>>>>>>>> could, but haven't implemented the event?
>>>>>>>
>>>>>>> The current version of the PowerPC spec that the pseries machine implements
>>>>>>> (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>>>>>>> error back to the hypervisor. If something fails, the hypervisor is left
>>>>>>> in the dark.
>>>>>>>
>>>>>>> What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>>>>>>
>>>>>> Do you mean "unreliable way"?
>>>>>
>>>>> I guess a better word would be 'reproducible', as in we discovered a reproducible
>>>>> way of getting the guest kernel to refuse the CPU hotunplug.
>>>>
>>>> Right.  It's worth noting here that in the PAPR model, there are no
>>>> "forced" unplugs.  Unplugs always consist of a request to the guest,
>>>> which is then resposible for offlining the device and signalling back
>>>> to the hypervisor that it's done with it.
>>>>
>>>>>>> making CPU hotunplug fail in the guest (trying to hotunplug the last online
>>>>>>> CPU) and the pseries machine was unprepared for dealing with it. We ended up
>>>>>>> implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>>>>>>> that the CPU hotunplug failed in the guest. This is the first scenario we have
>>>>>>> today where we want to send a QAPI event informing the CPU hotunplug error,
>>>>>>> but this event does not exist in QEMU ATM.
>>>>>>
>>>>>> When the timeout kicks in, how can you know the operation failed?  You
>>>>>> better be sure when you send an error event.  In other words: what
>>>>>> prevents the scenario where the operation is much slower than you
>>>>>> expect, the timeout expires, the error event is sent, and then the
>>>>>> operation completes successfully?
>>>>>
>>>>> A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
>>>>> if the guest is under heavy load. The timeout is set to 15 seconds.
>>>>
>>>> Right.  We're well aware that a timeout is an ugly hack, since it's
>>>> not really possible to distinguish it from a guest that's just being
>>>> really slow.
>>>
>>> As long as unplug failure cannot be detected reliably, we need a timeout
>>> *somewhere*.  I vaguely recall libvirt has one.  Laine?
>>
>> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
>> the timeout makes more sense since it has more information about the
>> conditions/mechanics involved.
> 
> Right.  In particular, I can't really see how libvirt can fully
> implement that timeout.  AFAIK qemu has no way of listing or
> cancelling "in flight" unplug requests, so it's entirely possible that
> the unplug could still complete after libvirt's has "timed out".

(I'm purposefully not trying to understand the full context of all of 
this, as I'm mostly unconnected right now, and only popped in at the 
mention of my name and CC. So forgive me if I'm missing some of the 
details of the conversation - I'm only here to give context about 
libvirt's timeout during unplug)

I didn't remember there being any sort of timeout for unplugs in 
libvirt, so I went back into the code and found that there *is* a 
timeout (I'd just forgotten that I'd ever seen it), but (for PCI 
devices, which is the only hotplug/unplug code I have any real 
familiarity with) it is just a short (10 seconds for PPC, 5 seconds for 
other platforms) to see if the unplug can complete before the public API 
returns; if there is a "timeout" then we still return success (after 
logging a warning message) but the device stays in the domain's device 
list, and nothing else is changed unless/until we receive a 
DEVICE_DELETED event from qemu (completely asynchronous - libvirt's API 
has long ago returned success) - only then do we remove the device from 
libvirt's domain status. libvirt won't/can't ever go back and 
retroactively fail the API that's already completed successfully though :-)

For VCPUs (which I guess is what you're exclusively talking about here) 
it looks like libvirt waits for the same 5-10 seconds (during the unplug 
API call) and if it hasn't received DEVICE_DELETED, then it returns an 
error:

     if ((rc = qemuDomainWaitForDeviceRemoval(vm)) <= 0) {
         if (rc == 0)
             virReportError(VIR_ERR_OPERATION_TIMEOUT, "%s",
                            _("vcpu unplug request timed out. Unplug 
result "
                              "must be manually inspected in the domain"));

         goto cleanup;
     }

Here's what Peter says in the commit log when he replaced old-useless 
VCPU unplug code with new-working code in 2016:

     As the new code is using device_del all the implications of using it
     are present. Contrary to the device deletion code, the vcpu deletion
     code fails if the unplug request is not executed in time.

I have no clue why in the case of PCI devices libvirt is logging a 
warning and returning success, while in the case of VCPUs it is logging 
an error and returning failure. I think there may have originally been a 
stated/unstated assumption that the libvirt unplug APIs were 
synchronous, and both situations were just the result of later trying to 
cope with the reality that the operation is actually asynchronous.

>> At this moment, the only case I know of hotunplug operations timing out in
>> QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
>> unplug timeout is desirable mechanism for other machines/device types.
>>
>>> Putting the timeout in QEMU might be better.  QEMU might be in a better
>>> position to pick a suitable timeout.
>>>
>>>> It was the best thing we could come up with in the short term though.
>>>> Without the changes we're suggesting here, the guest can positively
>>>> assert the unplug is complete, but it has no way to signal failure.
>>>> So, without a timeout qemu is stuck waiting indefinitely (or at least
>>>> until the next machine reset) on the guest for an unplug that might
>>>> never complete.
>>>>
>>>> It's particularly nasty if the guest does really fail the hotplug
>>>> internally, but then conditions change so that the same unplug could
>>>> now succeed.  The unplug request is just a message - there's no guest
>>>> visible persistent state saying we want the device unplugged, so if
>>>> conditions change the guest won't then re-attempt the unplug.
>>>> Meanwhile the user can't re-attempt the device_del, because on the
>>>> qemu side it's still stuck in a pending unplug waiting on the guest.
>>>
>>> "Can't re-attempt" feels like a bug.  Let me explain.
> 
> You may be right.  Perhaps we should just send another unplug message
> if we get a device_del on something that's already pending deletion.
> AFAICT repeated unplug messages for the same device should be
> harmless.
> 
>> So, what David mentioned above is related to pseries internals I believe.
>>
>> Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
>> issued for devices that were in the middle of the unplug process, with the
>> exception of DIMMs where we bothered to throw an error message back to the
>> user.
>>
>> In 6.0.0 the code was reworked, and now we're always letting the user know
>> when the 'device_del' was ignored due to an ongoing hotunplug of the device.
>> And for CPUs we also tell the timeout remaining. We're still not sending
>> multiple hotunplug IRQ pulses to the guest, but at least the user is now
>> informed about it.
>>
>> As for the commit mentioned below:
>>
>>>
>>> Depending on the device, device_del can complete the unplug, or merely
>>> initiate it.  Documentation:
>>>
>>> # Notes: When this command completes, the device may not be removed from the
>>> #        guest.  Hot removal is an operation that requires guest cooperation.
>>> #        This command merely requests that the guest begin the hot removal
>>> #        process.  Completion of the device removal process is signaled with a
>>> #        DEVICE_DELETED event. Guest reset will automatically complete removal
>>> #        for all devices.
>>>
>>> This is not as accurate as it could be.  Behavior depends on the device.
>>>
>>> For some kinds of devices, the command completes the unplug before it
>>> sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
>>> event gets send with a slight delay because device cleanup uses RCU.
>>>
>>> For others, notably PCI devices, the command only pokes the guest to do
>>> its bit.  QEMU reacts to the guest's actions, which eventually leads to
>>> the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
>>> play ball to the end, the event doesn't get send.
>>>
>>> The "can't retry unplug" behavior is new.  Another device_del used to
>>> simply poke the guest again.
> 
> Maybe on x86.  I think a repeated device_del was at best a no-op on
> PAPR.
> 
>>> I think this regressed in commit
>>> cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
>>> Feels like a must-fix for 6.0.
>>
>>
>> This doesn't directly affect pseries code because we're not using
>> dev->pending_deleted_event to track the pending unplug status. We're
> 
> Huh... I didn't know about pending_deleted_event.  Maybe we *should*
> be using that.  Handling the migration will be painful, of course.
> 
>> using an internal flag that is related to the DRC (the 'hotplug state'
>> of the device).
>>
>> The commit seems a bit odd because it is making a change in the common code
>> inside qmp_device_del() based on a PCI-e specific behavior. In the end this
>> doesn't impact any other device but PCI-e (it is the only device that uses
>> dev->pending_deleted_event to mark the start and finish of the unplug process),
>> but it's not something that I might expect in that function.
>>
>> IMO this verification should be removed from qmp_device_del and moved to
>> pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe devices
>> without making assumptions about everything else.
>>
>>
>>>
>>>> As we're discussing we think we have a better way, but it relies on
>>>> guest changes, so we can't assume we already have that on the qemu
>>>> side.
>>>>
>>>>> I'm aware that there's always that special use case, that we haven't seen yet,
>>>>> where this assumption is no longer valid. The plan is to change the spec and the
>>>>> guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
>>>>> lands. For now, timing out the CPU hotunplug process when we're almost certain
>>>>> (but not 100%) that it failed in the guest is the best can do.
>>>>>
>>>>> For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
>>>>> guest visible changes when we change how we're detecting the unplug errors.
>>>>>
>>>>>>> The second scenario is a memory hotunplug error. I found out that the pseries
>>>>>>> guest kernel does a reconfiguration step when re-attaching the DIMM right
>>>>>>> after refusing the hotunplug, and this reconfiguration is visible to QEMU.
>>>>>>> I proceeded to make the pseries machine detect this error case, rollback the
>>>>>>> unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
>>>>>>> by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
>>>>>>> well instead of the MEM specific one.
>>>>>>>
>>>>>>> This investigation and work in the mem hotunplug error path triggered a
>>>>>>> discussion in qemu-ppc, where we're considering whether we should do the same
>>>>>>> signalling the kernel does for the DIMM hotunplug error for all other device
>>>>>>> hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
>>>>>>> and it's currently a no-op. We would make a LOPAR spec change to make this an
>>>>>>> official hotunplug error report mechanism, and all pseries hotunplug operations,
>>>>>>> for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
>>>>>>>
>>>>>>> Granted, the spec change + Kernel change is not something that we will be able
>>>>>>> to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
>>>>>>> already in place would make it easier for the future as well.
>>>>>>>
>>>>>>>
>>>>>>> I have a doc draft of these changes/infos that I forgot to post. I would post
>>>>>>> it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
>>>>>>> information there as well. Does that work for you as far as documentation
>>>>>>> goes?
>>>>>>
>>>>>> A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
>>>>>> and device.
>>>>
>>>> Ack.  Fwiw, I don't think this can ever be more than a "best effort"
>>>> notification.  Even with a machine and OS that should support it, a
>>>> guest bug or hang could mean that it never acks *or* nacks an unplug
>>>> request.
>>>
>>> Since the success event is named DEVICE_DELETED, I'd name the
>>> probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
>>> statement of fact "still not deleted" instead of an error (that just
>>> might not be one).
>>
>>
>> "DEVICE_NOT_DELETED" sounds way better for what we want to express in the
>> pseries case when we can't be 100% sure of a guest side error. However,
>> there is at least one case where I'm sure of a guest side error (where I'm
>> using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that case.
>>
>>
>> Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can use
>> both in pseries-6.0.0.
>>
>>
>>
>>
>> Thanks,
>>
>>
>> DHB
>>
>>
>>>
>>>>>> I'm not asking you to to implement it for all machines and devices.  But
>>>>>> I want its design to support growth towards that goal, and its
>>>>>> documentation reflect its current state.
>>>>>>
>>>>>> In particular, the doc comment in the QAPI schema should list the
>>>>>> limitation.  If the event is only implemented for LOPAR for now, say so.
>>>>>> If it's only implemented for certain devices, say so.
>>>>>>
>>>>>> Questions?
>>>>>
>>>>>
>>>>> Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
>>>>> in a way that it can be used by other machines in the future, and documenting
>>>>> where the event is being used ATM.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> DHB
>>>>>
>>>>>
>>>>>>
>>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-12  8:12                 ` Markus Armbruster
@ 2021-03-19  7:55                   ` Markus Armbruster
  2021-03-22  6:39                   ` David Gibson
  1 sibling, 0 replies; 19+ messages in thread
From: Markus Armbruster @ 2021-03-19  7:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, Juan Quintela, qemu-ppc, Laine Stump,
	Igor Mammedov, Paolo Bonzini, David Gibson

Markus Armbruster <armbru@redhat.com> writes:

> David Gibson <david@gibson.dropbear.id.au> writes:
>
>> On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
>>> 
>>> 
>>> On 3/9/21 3:22 AM, Markus Armbruster wrote:
>>> > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
>>> > towards DeviceState member pending_deleted_event.
>>> > 
>>> > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
>>> > please loop in the right person.
>>> > 
>>> > David Gibson <david@gibson.dropbear.id.au> writes:
>>> > 
>>> > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>>> > > > 
>>> > > > 
>>> > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
>>> > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>>> > > > > 
>>> > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
>>> > [...]
>>> > > > > > > We should document the event's reliability.  Are there unplug operations
>>> > > > > > > where we can't detect failure?  Are there unplug operations where we
>>> > > > > > > could, but haven't implemented the event?
>>> > > > > > 
>>> > > > > > The current version of the PowerPC spec that the pseries machine implements
>>> > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
>>> > > > > > error back to the hypervisor. If something fails, the hypervisor is left
>>> > > > > > in the dark.
>>> > > > > > 
>>> > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>>> > > > > 
>>> > > > > Do you mean "unreliable way"?
>>> > > > 
>>> > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
>>> > > > way of getting the guest kernel to refuse the CPU hotunplug.
>>> > > 
>>> > > Right.  It's worth noting here that in the PAPR model, there are no
>>> > > "forced" unplugs.  Unplugs always consist of a request to the guest,
>>> > > which is then resposible for offlining the device and signalling back
>>> > > to the hypervisor that it's done with it.
>>> > > 
>>> > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
>>> > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
>>> > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
>>> > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
>>> > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
>>> > > > > > but this event does not exist in QEMU ATM.
>>> > > > > 
>>> > > > > When the timeout kicks in, how can you know the operation failed?  You
>>> > > > > better be sure when you send an error event.  In other words: what
>>> > > > > prevents the scenario where the operation is much slower than you
>>> > > > > expect, the timeout expires, the error event is sent, and then the
>>> > > > > operation completes successfully?
>>> > > > 
>>> > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
>>> > > > if the guest is under heavy load. The timeout is set to 15 seconds.
>>> > > 
>>> > > Right.  We're well aware that a timeout is an ugly hack, since it's
>>> > > not really possible to distinguish it from a guest that's just being
>>> > > really slow.
>>> > 
>>> > As long as unplug failure cannot be detected reliably, we need a timeout
>>> > *somewhere*.  I vaguely recall libvirt has one.  Laine?
>>> 
>>> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
>>> the timeout makes more sense since it has more information about the
>>> conditions/mechanics involved.
>>
>> Right.  In particular, I can't really see how libvirt can fully
>> implement that timeout.  AFAIK qemu has no way of listing or
>> cancelling "in flight" unplug requests, so it's entirely possible that
>> the unplug could still complete after libvirt's has "timed out".
>
> QEMU doesn't really keep track of "in flight" unplug requests, and as
> long as that's the case, its timeout even will have the same issue.

If we change QEMU to keep track of "in flight" unplug requests, then we
likely want QMP commands to query and cancel them.

Instead of inventing ad hoc commands, we should look into using the job
framework: query-jobs, job-cancel, ...  See qapi/job.json.

Bonus: we don't need new events, existing JOB_STATUS_CHANGE can do the
job (pardon the pun).

[...]



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-12  8:12                 ` Markus Armbruster
  2021-03-19  7:55                   ` Markus Armbruster
@ 2021-03-22  6:39                   ` David Gibson
  2021-03-22 12:06                     ` Paolo Bonzini
  1 sibling, 1 reply; 19+ messages in thread
From: David Gibson @ 2021-03-22  6:39 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Juan Quintela, qemu-ppc, Laine Stump,
	Igor Mammedov, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 17852 bytes --]

On Fri, Mar 12, 2021 at 09:12:53AM +0100, Markus Armbruster wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
> >> 
> >> 
> >> On 3/9/21 3:22 AM, Markus Armbruster wrote:
> >> > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> >> > towards DeviceState member pending_deleted_event.
> >> > 
> >> > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> >> > please loop in the right person.
> >> > 
> >> > David Gibson <david@gibson.dropbear.id.au> writes:
> >> > 
> >> > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
> >> > > > 
> >> > > > 
> >> > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
> >> > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> >> > > > > 
> >> > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> >> > [...]
> >> > > > > > > We should document the event's reliability.  Are there unplug operations
> >> > > > > > > where we can't detect failure?  Are there unplug operations where we
> >> > > > > > > could, but haven't implemented the event?
> >> > > > > > 
> >> > > > > > The current version of the PowerPC spec that the pseries machine implements
> >> > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> >> > > > > > error back to the hypervisor. If something fails, the hypervisor is left
> >> > > > > > in the dark.
> >> > > > > > 
> >> > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> >> > > > > 
> >> > > > > Do you mean "unreliable way"?
> >> > > > 
> >> > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
> >> > > > way of getting the guest kernel to refuse the CPU hotunplug.
> >> > > 
> >> > > Right.  It's worth noting here that in the PAPR model, there are no
> >> > > "forced" unplugs.  Unplugs always consist of a request to the guest,
> >> > > which is then resposible for offlining the device and signalling back
> >> > > to the hypervisor that it's done with it.
> >> > > 
> >> > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
> >> > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
> >> > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> >> > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
> >> > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
> >> > > > > > but this event does not exist in QEMU ATM.
> >> > > > > 
> >> > > > > When the timeout kicks in, how can you know the operation failed?  You
> >> > > > > better be sure when you send an error event.  In other words: what
> >> > > > > prevents the scenario where the operation is much slower than you
> >> > > > > expect, the timeout expires, the error event is sent, and then the
> >> > > > > operation completes successfully?
> >> > > > 
> >> > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
> >> > > > if the guest is under heavy load. The timeout is set to 15 seconds.
> >> > > 
> >> > > Right.  We're well aware that a timeout is an ugly hack, since it's
> >> > > not really possible to distinguish it from a guest that's just being
> >> > > really slow.
> >> > 
> >> > As long as unplug failure cannot be detected reliably, we need a timeout
> >> > *somewhere*.  I vaguely recall libvirt has one.  Laine?
> >> 
> >> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
> >> the timeout makes more sense since it has more information about the
> >> conditions/mechanics involved.
> >
> > Right.  In particular, I can't really see how libvirt can fully
> > implement that timeout.  AFAIK qemu has no way of listing or
> > cancelling "in flight" unplug requests, so it's entirely possible that
> > the unplug could still complete after libvirt's has "timed out".
> 
> QEMU doesn't really keep track of "in flight" unplug requests, and as
> long as that's the case, its timeout even will have the same issue.

Not generically, maybe.  In the PAPR code we effectively do, by means
of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
a mistake, but at the time I couldn't see how else to handle things.

Currently we will resolve all "in flight" requests at machine reset
time, effectively completing those requests.  Does that differ from
x86 behaviour?

Now that the timeout has been added, we do clear the unplug_requested
flag if we hit it, which effectively "cancels" the unplug request - I
believe if the guest attempts to complete the unplug later, we'll fail
it because that flag is no longer set.

> >> At this moment, the only case I know of hotunplug operations timing out in
> >> QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
> >> unplug timeout is desirable mechanism for other machines/device types.
> 
> I think we first need to nail down what the even means.  It could mean
> "hot unplug definitely failed" or "hot unplug did not complete in time".
> 
> Any operation that is not instantaneous goes through a state where it
> neither succeeded nor failed.
> 
> When the duration of that state is unbounded, no timeout can get us out
> of it.  When exiting the state requires guest cooperation, its duration
> is unbounded.  "Definitely failed" semantics is unattainable.
> 
> It is attainable if we can somehow cancel the unplug, because that
> forces transition to "failed".

Right, we effectively can and do do this in the PAPR model.  When we
clear the unplug_requested flag on timeout, operations that the guest
would perform to implement the unplug will now fail.

> I suspect we have at least three kinds of unplugs: effectively
> instantaneous (command completes the job, e.g. USB), unbounded and not
> cancelable, unbounded and cancelable.
> 
> When we onlw know "did not complete in time", the management application
> needs a way to retry the unplug regardless of actual state.  If the
> unplug succeeded meanwhile (sending DEVICE_DELETED), the retry should
> fail.  If it failed meanwhile, start over.  If it's still in progress,
> do nothing.
> 
> We can choose to leave cancelling to the management application.  Ditch
> "definitely failed", report "did not complete in time".  The management
> application can then either cancel or retry.  The former succeeds if it
> canceled the unplug, fails if the unplug completed meanwhile, which also
> sent DEVICE_DELETED).
> 
> I think before we can continue reworking the code, we need consensus on
> a hot unplug state machine that works for all known cases, or at least
> the cases we have in QEMU.

Right..

> >> > Putting the timeout in QEMU might be better.  QEMU might be in a better
> >> > position to pick a suitable timeout.
> >> > 
> >> > > It was the best thing we could come up with in the short term though.
> >> > > Without the changes we're suggesting here, the guest can positively
> >> > > assert the unplug is complete, but it has no way to signal failure.
> >> > > So, without a timeout qemu is stuck waiting indefinitely (or at least
> >> > > until the next machine reset) on the guest for an unplug that might
> >> > > never complete.
> >> > > 
> >> > > It's particularly nasty if the guest does really fail the hotplug
> >> > > internally, but then conditions change so that the same unplug could
> >> > > now succeed.  The unplug request is just a message - there's no guest
> >> > > visible persistent state saying we want the device unplugged, so if
> >> > > conditions change the guest won't then re-attempt the unplug.
> >> > > Meanwhile the user can't re-attempt the device_del, because on the
> >> > > qemu side it's still stuck in a pending unplug waiting on the guest.
> >> > 
> >> > "Can't re-attempt" feels like a bug.  Let me explain.
> >
> > You may be right.  Perhaps we should just send another unplug message
> > if we get a device_del on something that's already pending deletion.
> > AFAICT repeated unplug messages for the same device should be
> > harmless.
> 
> Consider physical PCI hot unplug.  You press a button next to the slot,
> wait for the (figuratively) green light, then pull out the card.  You
> can press the button as many times as you want.

Urgh... actually there's a whole other can of worms there.  With my
Kata hat on, I'm hitting horrible problems with PCI hotplug where
virtually pressing that button can race with the kernel's normal
device probing, with the end result that the kernel misinterprets the
button push as an unplug request instead of inplug.

But.. actually, I don't think you can push the button as many times as
you want, at least with SHPC.  From looking at this code, I believe
pressing the button when in the transitional state *cancels* the
unplug request (and there's a 5s pause built into the spec, AFAICT
precisely to make such a cancel humanly possible).

> >> So, what David mentioned above is related to pseries internals I believe.
> >> 
> >> Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
> >> issued for devices that were in the middle of the unplug process, with the
> >> exception of DIMMs where we bothered to throw an error message back to the
> >> user.
> >> 
> >> In 6.0.0 the code was reworked, and now we're always letting the user know
> >> when the 'device_del' was ignored due to an ongoing hotunplug of the device.
> >> And for CPUs we also tell the timeout remaining. We're still not sending
> >> multiple hotunplug IRQ pulses to the guest, but at least the user is now
> >> informed about it.
> 
> This is possible because you know unplug is in progress.  I'm not sure
> we can know for all devices.
> 
> >> As for the commit mentioned below:
> >> 
> >> > 
> >> > Depending on the device, device_del can complete the unplug, or merely
> >> > initiate it.  Documentation:
> >> > 
> >> > # Notes: When this command completes, the device may not be removed from the
> >> > #        guest.  Hot removal is an operation that requires guest cooperation.
> >> > #        This command merely requests that the guest begin the hot removal
> >> > #        process.  Completion of the device removal process is signaled with a
> >> > #        DEVICE_DELETED event. Guest reset will automatically complete removal
> >> > #        for all devices.
> >> > 
> >> > This is not as accurate as it could be.  Behavior depends on the device.
> >> > 
> >> > For some kinds of devices, the command completes the unplug before it
> >> > sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
> >> > event gets send with a slight delay because device cleanup uses RCU.
> >> > 
> >> > For others, notably PCI devices, the command only pokes the guest to do
> >> > its bit.  QEMU reacts to the guest's actions, which eventually leads to
> >> > the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
> >> > play ball to the end, the event doesn't get send.
> >> > 
> >> > The "can't retry unplug" behavior is new.  Another device_del used to
> >> > simply poke the guest again.
> >
> > Maybe on x86.  I think a repeated device_del was at best a no-op on
> > PAPR.
> 
> There's certainly more than one way to skin this cat.  When QEMU knows
> the guest is still working on the unplug, poking the guest again is
> useless at best.

Right, but we don't know that the guest is still working on the
unplug.  We just know we notified it and it hasn't responded.  It
might still be working on it, it might have tried and failed (because
it has no way to report that), or it might have just lost the request
completely.

> >> > I think this regressed in commit
> >> > cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
> >> > Feels like a must-fix for 6.0.
> >> 
> >> 
> >> This doesn't directly affect pseries code because we're not using
> >> dev->pending_deleted_event to track the pending unplug status. We're
> >
> > Huh... I didn't know about pending_deleted_event.  Maybe we *should*
> > be using that.  Handling the migration will be painful, of course.
> 
> Paolo and I believe the code around @pending_deleted_event has become
> wrong.
> 
> @pending_deleted_event was created to get DEVICE_DELETED right for
> recursively deleted devices (commit 352e8da74).  It did *not* "track the
> pending unplug status".
> 
> It was later appropriated for other purposes (commit c000a9bd0,
> 9711cd0df, cce8944cc).  Paolo and I believe these are wrong.
> 
> >> using an internal flag that is related to the DRC (the 'hotplug state'
> >> of the device).
> >> 
> >> The commit seems a bit odd because it is making a change in the common code
> >> inside qmp_device_del() based on a PCI-e specific behavior. In the end this
> >> doesn't impact any other device but PCI-e (it is the only device that uses
> >> dev->pending_deleted_event to mark the start and finish of the unplug process),
> >> but it's not something that I might expect in that function.
> 
> Concur.
> 
> >> IMO this verification should be removed from qmp_device_del and moved to
> >> pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe devices
> >> without making assumptions about everything else.
> 
> A subset of {Igor, Julia, Juan} intends to look into this in time for 6.0.
> 
> >> > > As we're discussing we think we have a better way, but it relies on
> >> > > guest changes, so we can't assume we already have that on the qemu
> >> > > side.
> >> > > 
> >> > > > I'm aware that there's always that special use case, that we haven't seen yet,
> >> > > > where this assumption is no longer valid. The plan is to change the spec and the
> >> > > > guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
> >> > > > lands. For now, timing out the CPU hotunplug process when we're almost certain
> >> > > > (but not 100%) that it failed in the guest is the best can do.
> >> > > > 
> >> > > > For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, avoiding
> >> > > > guest visible changes when we change how we're detecting the unplug errors.
> >> > > > 
> >> > > > > > The second scenario is a memory hotunplug error. I found out that the pseries
> >> > > > > > guest kernel does a reconfiguration step when re-attaching the DIMM right
> >> > > > > > after refusing the hotunplug, and this reconfiguration is visible to QEMU.
> >> > > > > > I proceeded to make the pseries machine detect this error case, rollback the
> >> > > > > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
> >> > > > > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
> >> > > > > > well instead of the MEM specific one.
> >> > > > > > 
> >> > > > > > This investigation and work in the mem hotunplug error path triggered a
> >> > > > > > discussion in qemu-ppc, where we're considering whether we should do the same
> >> > > > > > signalling the kernel does for the DIMM hotunplug error for all other device
> >> > > > > > hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
> >> > > > > > and it's currently a no-op. We would make a LOPAR spec change to make this an
> >> > > > > > official hotunplug error report mechanism, and all pseries hotunplug operations,
> >> > > > > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.
> >> > > > > > 
> >> > > > > > Granted, the spec change + Kernel change is not something that we will be able
> >> > > > > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
> >> > > > > > already in place would make it easier for the future as well.
> >> > > > > > 
> >> > > > > > 
> >> > > > > > I have a doc draft of these changes/infos that I forgot to post. I would post
> >> > > > > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
> >> > > > > > information there as well. Does that work for you as far as documentation
> >> > > > > > goes?
> >> > > > > 
> >> > > > > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
> >> > > > > and device.
> >> > > 
> >> > > Ack.  Fwiw, I don't think this can ever be more than a "best effort"
> >> > > notification.  Even with a machine and OS that should support it, a
> >> > > guest bug or hang could mean that it never acks *or* nacks an unplug
> >> > > request.
> >> > 
> >> > Since the success event is named DEVICE_DELETED, I'd name the
> >> > probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
> >> > statement of fact "still not deleted" instead of an error (that just
> >> > might not be one).
> >> 
> >> 
> >> "DEVICE_NOT_DELETED" sounds way better for what we want to express in the
> >> pseries case when we can't be 100% sure of a guest side error. However,
> >> there is at least one case where I'm sure of a guest side error (where I'm
> >> using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that case.
> >> 
> >> 
> >> Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can use
> >> both in pseries-6.0.0.
> 
> Separate QMP events are advisable when their meaning is different enough
> for a management application to want to tell them apart.
> 
> [...]
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-12 13:38                 ` Laine Stump
@ 2021-03-22  6:43                   ` David Gibson
  0 siblings, 0 replies; 19+ messages in thread
From: David Gibson @ 2021-03-22  6:43 UTC (permalink / raw)
  To: Laine Stump
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Markus Armbruster, qemu-ppc,
	Paolo Bonzini, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 7262 bytes --]

On Fri, Mar 12, 2021 at 08:38:47AM -0500, Laine Stump wrote:
> On 3/11/21 8:19 PM, David Gibson wrote:
> > On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > 
> > > On 3/9/21 3:22 AM, Markus Armbruster wrote:
> > > > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> > > > towards DeviceState member pending_deleted_event.
> > > > 
> > > > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> > > > please loop in the right person.
> > > > 
> > > > David Gibson <david@gibson.dropbear.id.au> writes:
> > > > 
> > > > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
> > > > > > 
> > > > > > 
> > > > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
> > > > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > > > > > > 
> > > > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> > > > [...]
> > > > > > > > > We should document the event's reliability.  Are there unplug operations
> > > > > > > > > where we can't detect failure?  Are there unplug operations where we
> > > > > > > > > could, but haven't implemented the event?
> > > > > > > > 
> > > > > > > > The current version of the PowerPC spec that the pseries machine implements
> > > > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> > > > > > > > error back to the hypervisor. If something fails, the hypervisor is left
> > > > > > > > in the dark.
> > > > > > > > 
> > > > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> > > > > > > 
> > > > > > > Do you mean "unreliable way"?
> > > > > > 
> > > > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
> > > > > > way of getting the guest kernel to refuse the CPU hotunplug.
> > > > > 
> > > > > Right.  It's worth noting here that in the PAPR model, there are no
> > > > > "forced" unplugs.  Unplugs always consist of a request to the guest,
> > > > > which is then resposible for offlining the device and signalling back
> > > > > to the hypervisor that it's done with it.
> > > > > 
> > > > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
> > > > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
> > > > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> > > > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
> > > > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
> > > > > > > > but this event does not exist in QEMU ATM.
> > > > > > > 
> > > > > > > When the timeout kicks in, how can you know the operation failed?  You
> > > > > > > better be sure when you send an error event.  In other words: what
> > > > > > > prevents the scenario where the operation is much slower than you
> > > > > > > expect, the timeout expires, the error event is sent, and then the
> > > > > > > operation completes successfully?
> > > > > > 
> > > > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
> > > > > > if the guest is under heavy load. The timeout is set to 15 seconds.
> > > > > 
> > > > > Right.  We're well aware that a timeout is an ugly hack, since it's
> > > > > not really possible to distinguish it from a guest that's just being
> > > > > really slow.
> > > > 
> > > > As long as unplug failure cannot be detected reliably, we need a timeout
> > > > *somewhere*.  I vaguely recall libvirt has one.  Laine?
> > > 
> > > Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
> > > the timeout makes more sense since it has more information about the
> > > conditions/mechanics involved.
> > 
> > Right.  In particular, I can't really see how libvirt can fully
> > implement that timeout.  AFAIK qemu has no way of listing or
> > cancelling "in flight" unplug requests, so it's entirely possible that
> > the unplug could still complete after libvirt's has "timed out".
> 
> (I'm purposefully not trying to understand the full context of all of this,
> as I'm mostly unconnected right now, and only popped in at the mention of my
> name and CC. So forgive me if I'm missing some of the details of the
> conversation - I'm only here to give context about libvirt's timeout during
> unplug)
> 
> I didn't remember there being any sort of timeout for unplugs in libvirt, so
> I went back into the code and found that there *is* a timeout (I'd just
> forgotten that I'd ever seen it), but (for PCI devices, which is the only
> hotplug/unplug code I have any real familiarity with) it is just a short (10
> seconds for PPC, 5 seconds for other platforms) to see if the unplug can
> complete before the public API returns; if there is a "timeout" then we
> still return success (after logging a warning message) but the device stays
> in the domain's device list, and nothing else is changed unless/until we
> receive a DEVICE_DELETED event from qemu (completely asynchronous -
> libvirt's API has long ago returned success) - only then do we remove the
> device from libvirt's domain status. libvirt won't/can't ever go back and
> retroactively fail the API that's already completed successfully though :-)

Huh.  That seems... kinda bogus.  It really seems to me you need to
provide either a synchronous interface: wait indefinitely for
completion - or an asynchronous one: always return quickly and user
has to wait for the completion event.  Having this hybrid just seems
confusing.

> For VCPUs (which I guess is what you're exclusively talking about here) it
> looks like libvirt waits for the same 5-10 seconds (during the unplug API
> call) and if it hasn't received DEVICE_DELETED, then it returns an error:
> 
>     if ((rc = qemuDomainWaitForDeviceRemoval(vm)) <= 0) {
>         if (rc == 0)
>             virReportError(VIR_ERR_OPERATION_TIMEOUT, "%s",
>                            _("vcpu unplug request timed out. Unplug result "
>                              "must be manually inspected in the domain"));
> 
>         goto cleanup;
>     }
> 
> Here's what Peter says in the commit log when he replaced old-useless VCPU
> unplug code with new-working code in 2016:
> 
>     As the new code is using device_del all the implications of using it
>     are present. Contrary to the device deletion code, the vcpu deletion
>     code fails if the unplug request is not executed in time.
> 
> I have no clue why in the case of PCI devices libvirt is logging a warning
> and returning success, while in the case of VCPUs it is logging an error and
> returning failure. I think there may have originally been a stated/unstated
> assumption that the libvirt unplug APIs were synchronous, and both
> situations were just the result of later trying to cope with the reality
> that the operation is actually asynchronous.

That seems pretty plausible.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-22  6:39                   ` David Gibson
@ 2021-03-22 12:06                     ` Paolo Bonzini
  2021-03-23  3:33                       ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-03-22 12:06 UTC (permalink / raw)
  To: David Gibson, Markus Armbruster
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Juan Quintela, qemu-ppc, Laine Stump,
	Igor Mammedov

On 22/03/21 07:39, David Gibson wrote:
>> QEMU doesn't really keep track of "in flight" unplug requests, and as
>> long as that's the case, its timeout even will have the same issue.
> Not generically, maybe.  In the PAPR code we effectively do, by means
> of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
> a mistake, but at the time I couldn't see how else to handle things.

No, that's good.  x86 also tracks it in some registers that are 
accessible from the ACPI firmware.  See "PCI slot removal notification" 
in docs/specs/acpi_pci_hotplug.txt.

> Currently we will resolve all "in flight" requests at machine reset
> time, effectively completing those requests.  Does that differ from
> x86 behaviour?

IIRC on x86 the requests are instead cancelled, but I'm not 100% sure.

Paolo



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-22 12:06                     ` Paolo Bonzini
@ 2021-03-23  3:33                       ` David Gibson
  2021-03-23 13:06                         ` Igor Mammedov
  0 siblings, 1 reply; 19+ messages in thread
From: David Gibson @ 2021-03-23  3:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Markus Armbruster, Juan Quintela,
	qemu-ppc, Laine Stump, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On Mon, Mar 22, 2021 at 01:06:53PM +0100, Paolo Bonzini wrote:
> On 22/03/21 07:39, David Gibson wrote:
> > > QEMU doesn't really keep track of "in flight" unplug requests, and as
> > > long as that's the case, its timeout even will have the same issue.
> > Not generically, maybe.  In the PAPR code we effectively do, by means
> > of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
> > a mistake, but at the time I couldn't see how else to handle things.
> 
> No, that's good.  x86 also tracks it in some registers that are accessible
> from the ACPI firmware.  See "PCI slot removal notification" in
> docs/specs/acpi_pci_hotplug.txt.
> 
> > Currently we will resolve all "in flight" requests at machine reset
> > time, effectively completing those requests.  Does that differ from
> > x86 behaviour?
> 
> IIRC on x86 the requests are instead cancelled, but I'm not 100%
> sure.

Ah... we'd better check that and try to make ppc consistent with
whatever it does.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-23  3:33                       ` David Gibson
@ 2021-03-23 13:06                         ` Igor Mammedov
  2021-03-29  5:35                           ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Igor Mammedov @ 2021-03-23 13:06 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Markus Armbruster, Juan Quintela,
	qemu-ppc, Laine Stump, Paolo Bonzini, jfreimann

On Tue, 23 Mar 2021 14:33:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Mar 22, 2021 at 01:06:53PM +0100, Paolo Bonzini wrote:
> > On 22/03/21 07:39, David Gibson wrote:  
> > > > QEMU doesn't really keep track of "in flight" unplug requests, and as
> > > > long as that's the case, its timeout even will have the same issue.  
> > > Not generically, maybe.  In the PAPR code we effectively do, by means
> > > of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
> > > a mistake, but at the time I couldn't see how else to handle things.  
> > 
> > No, that's good.  x86 also tracks it in some registers that are accessible
> > from the ACPI firmware.  See "PCI slot removal notification" in
> > docs/specs/acpi_pci_hotplug.txt.
> >   
> > > Currently we will resolve all "in flight" requests at machine reset
> > > time, effectively completing those requests.  Does that differ from
> > > x86 behaviour?  
> > 
> > IIRC on x86 the requests are instead cancelled, but I'm not 100%
> > sure.  
> 
> Ah... we'd better check that and try to make ppc consistent with
> whatever it does.
> 

Sorry for being late to discussion, I can't say much for all possible ways to unplug
PCI device (aside that it's a complicated mess), but hopefully I can shed some light on
state/behavior of ACPI based methods.

* x86 - ACPI based PCI hotplug
 Its sole existence was dictated by Widows not supporting SHPC (conventional PCI),
 and it looks like 'thanks' to Windows buggy drivers we would have to use it for
 PCI-E  as well (Julia works on it).
 HW registers described in docs/specs/acpi_pci_hotplug.txt are our own invention,
 they help to raise standard ACPI 'device check' and 'eject request' events when
 guest executes AML bytecode. Potentially there is possibility for guest to report
 plug/unplug progress via ACPI _OST method (including failure/completion) but given
 my experience with how Windows PCI core worked so far that may be not used by it
 (honestly I haven't tried to explore possibility, due to lack of interest in it).
 
 regarding unplug - on device_del QEMU raises SCI interrupt, after this the process is
 asynchronous. When ACPI interpreter gets SCI it sends a respective _EJ0 event to
 devices mentioned in PCI_DOWN_BASE register. After getting the event, guest OS may
 decide to eject PCI device (which causes clearing of device's bit in PCI_DOWN_BASE)
 or refuse to do it. There is no any progress tracking in QEMU for failure and device's
 bit in PCI_DOWN_BASE is kept set. On the next device_(add|del) (for any PCI device)
 guest will see it again and will retry removal.
 Also if guest reboots with any bits in PCI_DOWN_BASE set, respective devices will
 be deleted on QEMU side.
 There is no other way to cancel removal request in PCI_DOWN_BASE, aside of explicitly
 ejecting device on guest request or implicitly on reboot.
 IMHO:
     Sticky nature of PCI_(UP|DOWN)_BASE is more trouble than help but its there since
     SeaBios times so it's ABI we are stuck with. If I were re-implementing it now,
     I would use one shot event that's cleared once guest read it and if possible
     implement _OST status reporting (if it works on Windows).
 As it stands now, once device_del is issued one user can't know when PCI device will be
 removed. No timeout will help with it.
 
* ACPI CPU/Memory hotplug
 Events triggered by device_del are one shot, then guest may report progress to QEMU using
 _OST method (qapi_event_send_acpi_device_ost) (I know that libvirt were aware of it,
 but I don't recall what it does with it). So QEMU might send '_UNPLUG_ERROR' event to
 user if guest decides so. But instead of duplicating all possible events from spec
 QEMU will pass _OST arguments [1] as is for user to interpret as described by standard.
 Though I'd say _OST is not 100% reliable, depending used Windows or linux kernel version
 they might skip on reporting some events. But I don't recall exact state at the time I've
 been testing it. So I'd call status reporting support as 'best effort'.
 Also it doesn't feature pending removal on reboot, that our ACPI PCI hotplug code has.
 So with well behaving guest user will get notified about failure or device removal (when
 guest is able to run its code), for broken guests I'm more inclined to say 'use fixed guest'
 to get sane behavior.
 Policy for user is to retry on failure (there is no bad side effects on retry).

I think that any kind of timeout here is inherently racy, in async hot[un]plug usecase,
all user has to do is just sufficiently over-commit host (or run it nested).
So it's just a question of how long it will take for user to come back with a bug report. 

* As far as I'm aware mentioned 'pending_deleted_event' is there to make transparent
  failover magic happen (CCing Jens, also Michael might know how it works)

* SHCP & PCI-E has its own set of unplug quirks, which I know little about but Julia worked
  with Michael on fixing PCI-E bugs (mostly related how Windows drivers handle unplug,
  some are not possible to fix, hence decision to add ACPI based hotplug to Q35 as workaround).
  So they might know specifics.

1) ACPI spec: _OST (OSPM Status Indication)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-23 13:06                         ` Igor Mammedov
@ 2021-03-29  5:35                           ` David Gibson
  2021-03-29  9:38                             ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: David Gibson @ 2021-03-29  5:35 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Markus Armbruster, Juan Quintela,
	qemu-ppc, Laine Stump, Paolo Bonzini, jfreimann

[-- Attachment #1: Type: text/plain, Size: 8056 bytes --]

On Tue, Mar 23, 2021 at 02:06:36PM +0100, Igor Mammedov wrote:
> On Tue, 23 Mar 2021 14:33:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Mar 22, 2021 at 01:06:53PM +0100, Paolo Bonzini wrote:
> > > On 22/03/21 07:39, David Gibson wrote:  
> > > > > QEMU doesn't really keep track of "in flight" unplug requests, and as
> > > > > long as that's the case, its timeout even will have the same issue.  
> > > > Not generically, maybe.  In the PAPR code we effectively do, by means
> > > > of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
> > > > a mistake, but at the time I couldn't see how else to handle things.  
> > > 
> > > No, that's good.  x86 also tracks it in some registers that are accessible
> > > from the ACPI firmware.  See "PCI slot removal notification" in
> > > docs/specs/acpi_pci_hotplug.txt.
> > >   
> > > > Currently we will resolve all "in flight" requests at machine reset
> > > > time, effectively completing those requests.  Does that differ from
> > > > x86 behaviour?  
> > > 
> > > IIRC on x86 the requests are instead cancelled, but I'm not 100%
> > > sure.  
> > 
> > Ah... we'd better check that and try to make ppc consistent with
> > whatever it does.
> > 
> 
> Sorry for being late to discussion, I can't say much for all possible ways to unplug
> PCI device (aside that it's a complicated mess), but hopefully I can shed some light on
> state/behavior of ACPI based methods.

Thanks, this is quite useful.

> * x86 - ACPI based PCI hotplug
>  Its sole existence was dictated by Widows not supporting SHPC (conventional PCI),
>  and it looks like 'thanks' to Windows buggy drivers we would have to use it for
>  PCI-E  as well (Julia works on it).

AIUI, Julia et al. are working on ACPI PCI hotplug, but it's not yet
merged.  Is that correct?

>  HW registers described in docs/specs/acpi_pci_hotplug.txt are our own invention,
>  they help to raise standard ACPI 'device check' and 'eject request' events when
>  guest executes AML bytecode. Potentially there is possibility for guest to report
>  plug/unplug progress via ACPI _OST method (including failure/completion) but given
>  my experience with how Windows PCI core worked so far that may be not used by it
>  (honestly I haven't tried to explore possibility, due to lack of interest in it).
>  
>  regarding unplug - on device_del QEMU raises SCI interrupt, after this the process is
>  asynchronous. When ACPI interpreter gets SCI it sends a respective _EJ0 event to
>  devices mentioned in PCI_DOWN_BASE register. After getting the event, guest OS may

Ok.  Is PCI_DOWN_BASE an actual emulated hardware register, or one of
the invented ones you mention above?

Either way, I'm assuming there must be a PCI_DOWN_BASE register for
each PCI bus, yes?  How is that implemented for PCI to PCI bridges?

>  decide to eject PCI device (which causes clearing of device's bit in PCI_DOWN_BASE)
>  or refuse to do it. There is no any progress tracking in QEMU for failure and device's
>  bit in PCI_DOWN_BASE is kept set. On the next device_(add|del) (for any PCI device)
>  guest will see it again and will retry removal.

Ok.  Sounds like the bits in PCI_DOWN_BASE are roughly equivalent to
our 'unplug_requested' flag.  In our case it's not guest visible on a
continuing basis, though.  For PAPR hotplug, we have an explicit event
queue, and each event just pertains to a specific device (or cpu, or
memory block).

>  Also if guest reboots with any bits in PCI_DOWN_BASE set, respective devices will
>  be deleted on QEMU side.

Ok, that's the same as how we behave: on reset anything with
'unplug_requested' will be deleted.

>  There is no other way to cancel removal request in PCI_DOWN_BASE, aside of explicitly
>  ejecting device on guest request or implicitly on reboot.
>  IMHO:
>      Sticky nature of PCI_(UP|DOWN)_BASE is more trouble than help but its there since
>      SeaBios times so it's ABI we are stuck with. If I were re-implementing it now,
>      I would use one shot event that's cleared once guest read it and if possible
>      implement _OST status reporting (if it works on Windows).
>  As it stands now, once device_del is issued one user can't know when PCI device will be
>  removed. No timeout will help with it.

Ok.  What happens if you retry a device_del on the same device,
because the guest hasn't responded to the first one?

> * ACPI CPU/Memory hotplug
>  Events triggered by device_del are one shot, then guest may report progress to QEMU using
>  _OST method (qapi_event_send_acpi_device_ost) (I know that libvirt were aware of it,
>  but I don't recall what it does with it). So QEMU might send '_UNPLUG_ERROR' event to
>  user if guest decides so. But instead of duplicating all possible events from spec
>  QEMU will pass _OST arguments [1] as is for user to interpret as described by standard.
>  Though I'd say _OST is not 100% reliable, depending used Windows or linux kernel version
>  they might skip on reporting some events. But I don't recall exact state at the time I've
>  been testing it. So I'd call status reporting support as 'best effort'.

Right.  That more or less has to be the case for anything guest
reported.  Even if our guest kernel is supposed to have support for
reporting back, it might fail to due to a bug or unrelated hard freeze.

>  Also it doesn't feature pending removal on reboot, that our ACPI PCI hotplug code has.

Oh, that's interesting.  On PAPR we do have pending removal on reboot
for all hotplug types.  I guess if the behaviour isn't even consistent
within x86 variants, it doesn't matter very much if PAPR is consistent
with x86.

>  So with well behaving guest user will get notified about failure or device removal (when
>  guest is able to run its code), for broken guests I'm more inclined to say 'use fixed guest'
>  to get sane behavior.
>  Policy for user is to retry on failure (there is no bad side effects on retry).

Ok.  That's different for us - we will fail attempts to retry in qemu,
which I now strongly suspect is a mistake.  I think that shouldn't be
too complicated to change, though.

> I think that any kind of timeout here is inherently racy, in async hot[un]plug usecase,
> all user has to do is just sufficiently over-commit host (or run it
> nested).

Ah, I see your point.  No matter how highly the guest prioritizes
handling the unplug event, a slow-running host could mean that it
doesn't complete the handling in time.

Yeah, that's a pretty strong argument against any sort of timeout.

> So it's just a question of how long it will take for user to come back with a bug report. 
> 
> * As far as I'm aware mentioned 'pending_deleted_event' is there to make transparent
>   failover magic happen (CCing Jens, also Michael might know how it works)
> 
> * SHCP & PCI-E has its own set of unplug quirks, which I know little about but Julia worked
>   with Michael on fixing PCI-E bugs (mostly related how Windows drivers handle unplug,
>   some are not possible to fix, hence decision to add ACPI based hotplug to Q35 as workaround).
>   So they might know specifics.

Ok.  I know I've run into a bunch of complications with SHPC and
pciehp for unrelated work (Kata & SR-IOV).  Julia and/or Michael, if
you're able to answer for SHPC and pcieh these two questions, that
would be pretty useful:

   a) If you device_del, but the guest doesn't respond to the request
      at all, what happens?  If you reboot while in that state, does
      the unplug get completed, or cancelled?

   b) Can you safely retry a device_del which is "pending"
      (i.e. issued by qemu, but the guest hasn't responded yet)

> 1) ACPI spec: _OST (OSPM Status Indication)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] adding a generic QAPI event for failed device hotunplug
  2021-03-29  5:35                           ` David Gibson
@ 2021-03-29  9:38                             ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-03-29  9:38 UTC (permalink / raw)
  To: David Gibson, Igor Mammedov
  Cc: Michael S. Tsirkin, michael.roth, Daniel Henrique Barboza,
	Julia Suvorova, qemu-devel, Markus Armbruster, Juan Quintela,
	qemu-ppc, Laine Stump, jfreimann

On 29/03/21 07:35, David Gibson wrote:
>>   regarding unplug - on device_del QEMU raises SCI interrupt, after this the process is
>>   asynchronous. When ACPI interpreter gets SCI it sends a respective _EJ0 event to
>>   devices mentioned in PCI_DOWN_BASE register. After getting the event, guest OS may
>
> Ok.  Is PCI_DOWN_BASE an actual emulated hardware register, or one of
> the invented ones you mention above?

It's invented.  Which is perhaps not the best word because the point of 
ACPI is exactly to let vendor provide a standardized interface that 
abstracts any register they invent.  Even when QEMU emulates actual 
chipset registers, operating systems usually access those registers 
through the ACPI interpreter (i.e. through bytecode provided by QEMU).

> Either way, I'm assuming there must be a PCI_DOWN_BASE register for
> each PCI bus, yes?  How is that implemented for PCI to PCI bridges?

Yes (though they are multiplexed on the same I/O port using a 
bus-selection register).

>>  Also it doesn't feature pending removal on reboot, that our ACPI PCI hotplug code has.

Ok, so that is what I was remembering.

Paolo



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-03-29  9:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 18:16 [RFC] adding a generic QAPI event for failed device hotunplug Daniel Henrique Barboza
2021-03-06  6:57 ` Markus Armbruster
2021-03-08 14:22   ` Daniel Henrique Barboza
2021-03-08 17:04     ` Markus Armbruster
2021-03-08 18:01       ` Daniel Henrique Barboza
2021-03-09  3:22         ` David Gibson
2021-03-09  6:22           ` Markus Armbruster
2021-03-11 20:50             ` Daniel Henrique Barboza
2021-03-12  1:19               ` David Gibson
2021-03-12  8:12                 ` Markus Armbruster
2021-03-19  7:55                   ` Markus Armbruster
2021-03-22  6:39                   ` David Gibson
2021-03-22 12:06                     ` Paolo Bonzini
2021-03-23  3:33                       ` David Gibson
2021-03-23 13:06                         ` Igor Mammedov
2021-03-29  5:35                           ` David Gibson
2021-03-29  9:38                             ` Paolo Bonzini
2021-03-12 13:38                 ` Laine Stump
2021-03-22  6:43                   ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.