qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Henrique Barboza <danielhb413@gmail.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	michael.roth@amd.com,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"qemu-ppc@nongnu.org" <qemu-ppc@nongnu.org>,
	Igor Mammedov <imammedo@redhat.com>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [RFC] adding a generic QAPI event for failed device hotunplug
Date: Mon, 8 Mar 2021 11:22:54 -0300	[thread overview]
Message-ID: <d9567bf3-8740-e8fe-b29b-a3b0ebdb5809@gmail.com> (raw)
In-Reply-To: <877dmkrcpl.fsf@dusky.pond.sub.org>



On 3/6/21 3:57 AM, Markus Armbruster wrote:
> Cc: ACPI maintainers for additional expertise.
> 
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> Hi,
>>
>> Recent changes in pseries code (not yet pushed, available at David's
>> ppc-for-6.0) are using the QAPI event MEM_UNPLUG_ERROR to report memory
>> hotunplug errors in the pseries machine.
>>
>> The pseries machine is also using a timeout to cancel CPU hotunplugs that
>> takes too long to finish (in which we're assuming a guest side error) and
>> it would be desirable to also send a QAPI event for this case as well.
>>
>> At this moment, there is no "CPU_UNPLUG_ERROR" in QAPI (guess ACPI doesn't
>> need it).
> 
> I see two interpretations of "ACPI doesn't need":
> 
> 1. Unplug can't fail, or QEMU can't detect failure.  Michael, Igor?
> 
> 2. Management applications haven't needed to know badly enough to
> implement an event.
> 
>>            Before sending patches to implement this new event I had a talk
>> with David Gibson and he suggested that, instead of adding a new CPU_UNPLUG_ERROR
>> event, we could add a generic event (e.g. DEVICE_UNPLUG_ERROR) that can be
>> used by the pseries machine in both error scenarios (MEM and CPU).
> 
> Good point.  One general event is better than two special ones that
> could easily grow siblings.
> 
>> This could also be used by x86 as well, although I believe the use of
>> MEM_UNPLUG_ERROR would need to be kept for awhile to avoid breaking ABI.
> 
> Yes.  Our rules for interface deprecation apply.
> 
>> Any suggestions/comments?
> 
> We should document the event's reliability.  Are there unplug operations
> where we can't detect failure?  Are there unplug operations where we
> could, but haven't implemented the event?

The current version of the PowerPC spec that the pseries machine implements
(LOPAR) does not predict a way for the virtual machine to report a hotunplug
error back to the hypervisor. If something fails, the hypervisor is left
in the dark.

What happened in the 6.0.0 dev cycle is that we faced a reliable way of
making CPU hotunplug fail in the guest (trying to hotunplug the last online
CPU) and the pseries machine was unprepared for dealing with it. We ended up
implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
that the CPU hotunplug failed in the guest. This is the first scenario we have
today where we want to send a QAPI event informing the CPU hotunplug error,
but this event does not exist in QEMU ATM.

The second scenario is a memory hotunplug error. I found out that the pseries
guest kernel does a reconfiguration step when re-attaching the DIMM right
after refusing the hotunplug, and this reconfiguration is visible to QEMU.
I proceeded to make the pseries machine detect this error case, rollback the
unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already covered
by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in this case as
well instead of the MEM specific one.

This investigation and work in the mem hotunplug error path triggered a
discussion in qemu-ppc, where we're considering whether we should do the same
signalling the kernel does for the DIMM hotunplug error for all other device
hotunplug errors, given that the reconfiguration per se is not forbidden by LOPAR
and it's currently a no-op. We would make a LOPAR spec change to make this an
official hotunplug error report mechanism, and all pseries hotunplug operations,
for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the error path.

Granted, the spec change + Kernel change is not something that we will be able
to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI event
already in place would make it easier for the future as well.


I have a doc draft of these changes/infos that I forgot to post. I would post
it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI events
information there as well. Does that work for you as far as documentation
goes?



DHB


> 
> The fewer exceptions, the better, of course.
> 


  reply	other threads:[~2021-03-08 14:42 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-05 18:16 [RFC] adding a generic QAPI event for failed device hotunplug Daniel Henrique Barboza
2021-03-06  6:57 ` Markus Armbruster
2021-03-08 14:22   ` Daniel Henrique Barboza [this message]
2021-03-08 17:04     ` Markus Armbruster
2021-03-08 18:01       ` Daniel Henrique Barboza
2021-03-09  3:22         ` David Gibson
2021-03-09  6:22           ` Markus Armbruster
2021-03-11 20:50             ` Daniel Henrique Barboza
2021-03-12  1:19               ` David Gibson
2021-03-12  8:12                 ` Markus Armbruster
2021-03-19  7:55                   ` Markus Armbruster
2021-03-22  6:39                   ` David Gibson
2021-03-22 12:06                     ` Paolo Bonzini
2021-03-23  3:33                       ` David Gibson
2021-03-23 13:06                         ` Igor Mammedov
2021-03-29  5:35                           ` David Gibson
2021-03-29  9:38                             ` Paolo Bonzini
2021-03-12 13:38                 ` Laine Stump
2021-03-22  6:43                   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d9567bf3-8740-e8fe-b29b-a3b0ebdb5809@gmail.com \
    --to=danielhb413@gmail.com \
    --cc=armbru@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=imammedo@redhat.com \
    --cc=michael.roth@amd.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).