Re: [RFC] adding a generic QAPI event for failed device hotunplug

From: David Gibson <david@gibson.dropbear.id.au>
To: Laine Stump <laine@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	michael.roth@amd.com,
	Daniel Henrique Barboza <danielhb413@gmail.com>,
	Julia Suvorova <jusual@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Markus Armbruster <armbru@redhat.com>,
	"qemu-ppc@nongnu.org" <qemu-ppc@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [RFC] adding a generic QAPI event for failed device hotunplug
Date: Mon, 22 Mar 2021 17:43:24 +1100	[thread overview]
Message-ID: <YFg8jCvFBcRUOrWS@yekko.fritz.box> (raw)
In-Reply-To: <70a596e0-102c-60ce-ccf7-6c961fa5eec3@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7262 bytes --]

On Fri, Mar 12, 2021 at 08:38:47AM -0500, Laine Stump wrote:
> On 3/11/21 8:19 PM, David Gibson wrote:
> > On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > 
> > > On 3/9/21 3:22 AM, Markus Armbruster wrote:
> > > > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> > > > towards DeviceState member pending_deleted_event.
> > > > 
> > > > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> > > > please loop in the right person.
> > > > 
> > > > David Gibson <david@gibson.dropbear.id.au> writes:
> > > > 
> > > > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
> > > > > > 
> > > > > > 
> > > > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
> > > > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > > > > > > 
> > > > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> > > > [...]
> > > > > > > > > We should document the event's reliability.  Are there unplug operations
> > > > > > > > > where we can't detect failure?  Are there unplug operations where we
> > > > > > > > > could, but haven't implemented the event?
> > > > > > > > 
> > > > > > > > The current version of the PowerPC spec that the pseries machine implements
> > > > > > > > (LOPAR) does not predict a way for the virtual machine to report a hotunplug
> > > > > > > > error back to the hypervisor. If something fails, the hypervisor is left
> > > > > > > > in the dark.
> > > > > > > > 
> > > > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
> > > > > > > 
> > > > > > > Do you mean "unreliable way"?
> > > > > > 
> > > > > > I guess a better word would be 'reproducible', as in we discovered a reproducible
> > > > > > way of getting the guest kernel to refuse the CPU hotunplug.
> > > > > 
> > > > > Right.  It's worth noting here that in the PAPR model, there are no
> > > > > "forced" unplugs.  Unplugs always consist of a request to the guest,
> > > > > which is then resposible for offlining the device and signalling back
> > > > > to the hypervisor that it's done with it.
> > > > > 
> > > > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the last online
> > > > > > > > CPU) and the pseries machine was unprepared for dealing with it. We ended up
> > > > > > > > implementing a hotunplug timeout and, if the timeout kicks in, we're assuming
> > > > > > > > that the CPU hotunplug failed in the guest. This is the first scenario we have
> > > > > > > > today where we want to send a QAPI event informing the CPU hotunplug error,
> > > > > > > > but this event does not exist in QEMU ATM.
> > > > > > > 
> > > > > > > When the timeout kicks in, how can you know the operation failed?  You
> > > > > > > better be sure when you send an error event.  In other words: what
> > > > > > > prevents the scenario where the operation is much slower than you
> > > > > > > expect, the timeout expires, the error event is sent, and then the
> > > > > > > operation completes successfully?
> > > > > > 
> > > > > > A CPU hotunplug in a pseries guest takes no more than a couple of seconds, even
> > > > > > if the guest is under heavy load. The timeout is set to 15 seconds.
> > > > > 
> > > > > Right.  We're well aware that a timeout is an ugly hack, since it's
> > > > > not really possible to distinguish it from a guest that's just being
> > > > > really slow.
> > > > 
> > > > As long as unplug failure cannot be detected reliably, we need a timeout
> > > > *somewhere*.  I vaguely recall libvirt has one.  Laine?
> > > 
> > > Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU doing
> > > the timeout makes more sense since it has more information about the
> > > conditions/mechanics involved.
> > 
> > Right.  In particular, I can't really see how libvirt can fully
> > implement that timeout.  AFAIK qemu has no way of listing or
> > cancelling "in flight" unplug requests, so it's entirely possible that
> > the unplug could still complete after libvirt's has "timed out".
> 
> (I'm purposefully not trying to understand the full context of all of this,
> as I'm mostly unconnected right now, and only popped in at the mention of my
> name and CC. So forgive me if I'm missing some of the details of the
> conversation - I'm only here to give context about libvirt's timeout during
> unplug)
> 
> I didn't remember there being any sort of timeout for unplugs in libvirt, so
> I went back into the code and found that there *is* a timeout (I'd just
> forgotten that I'd ever seen it), but (for PCI devices, which is the only
> hotplug/unplug code I have any real familiarity with) it is just a short (10
> seconds for PPC, 5 seconds for other platforms) to see if the unplug can
> complete before the public API returns; if there is a "timeout" then we
> still return success (after logging a warning message) but the device stays
> in the domain's device list, and nothing else is changed unless/until we
> receive a DEVICE_DELETED event from qemu (completely asynchronous -
> libvirt's API has long ago returned success) - only then do we remove the
> device from libvirt's domain status. libvirt won't/can't ever go back and
> retroactively fail the API that's already completed successfully though :-)

Huh.  That seems... kinda bogus.  It really seems to me you need to
provide either a synchronous interface: wait indefinitely for
completion - or an asynchronous one: always return quickly and user
has to wait for the completion event.  Having this hybrid just seems
confusing.

> For VCPUs (which I guess is what you're exclusively talking about here) it
> looks like libvirt waits for the same 5-10 seconds (during the unplug API
> call) and if it hasn't received DEVICE_DELETED, then it returns an error:
> 
>     if ((rc = qemuDomainWaitForDeviceRemoval(vm)) <= 0) {
>         if (rc == 0)
>             virReportError(VIR_ERR_OPERATION_TIMEOUT, "%s",
>                            _("vcpu unplug request timed out. Unplug result "
>                              "must be manually inspected in the domain"));
> 
>         goto cleanup;
>     }
> 
> Here's what Peter says in the commit log when he replaced old-useless VCPU
> unplug code with new-working code in 2016:
> 
>     As the new code is using device_del all the implications of using it
>     are present. Contrary to the device deletion code, the vcpu deletion
>     code fails if the unplug request is not executed in time.
> 
> I have no clue why in the case of PCI devices libvirt is logging a warning
> and returning success, while in the case of VCPUs it is logging an error and
> returning failure. I think there may have originally been a stated/unstated
> assumption that the libvirt unplug APIs were synchronous, and both
> situations were just the result of later trying to cope with the reality
> that the operation is actually asynchronous.

That seems pretty plausible.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]