From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37572) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1amxfP-0001iJ-UQ for qemu-devel@nongnu.org; Mon, 04 Apr 2016 02:07:17 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1amxfN-0000Zy-Sn for qemu-devel@nongnu.org; Mon, 04 Apr 2016 02:07:15 -0400 Date: Mon, 4 Apr 2016 14:33:37 +1000 From: David Gibson Message-ID: <20160404043337.GL16485@voom.fritz.box> References: <1459463257-3137-1-git-send-email-mdroth@linux.vnet.ibm.com> <20160401023325.GK416@voom.redhat.com> <20160401034724.3884.29404@loki> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="dfmC41YZQlborXoK" Content-Disposition: inline In-Reply-To: <20160401034724.3884.29404@loki> Subject: Re: [Qemu-devel] [PATCH for-2.6] spapr_drc: enable immediate detach for unsignalled devices List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Roth Cc: qemu-ppc@nongnu.org, sbhat@linux.vnet.ibm.com, qemu-devel@nongnu.org, bharata@linux.vnet.ibm.com --dfmC41YZQlborXoK Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 31, 2016 at 10:47:24PM -0500, Michael Roth wrote: > Quoting David Gibson (2016-03-31 21:33:25) > > On Thu, Mar 31, 2016 at 05:27:37PM -0500, Michael Roth wrote: > > > Currently spapr doesn't support "aborting" hotplug of PCI > > > devices by allowing device_del to immediately remove the > > > device if we haven't signalled the presence of the device > > > to the guest. > > >=20 > > > In the past this wasn't an issue, since we always immediately > > > signalled device attach and simply relied on full guest-aware > > > add->remove path for device removal. However, as of 788d259, > > > we now defer signalling for PCI functions until function 0 > > > is attached, so now we need to deal with these "abort" operations > > > for cases where a user hotplugs a non-0 function, then opts to > > > remove it prior hotplugging function 0. Currently they'd have to > > > reboot before the unplug completed. PCIe multifunction hotplug > > > does not have this requirement however, so from a management > > > implementation perspective it would be good to address this within > > > the same release as 788d259. > > >=20 > > > We accomplish this by simply adding a 'signalled' flag to track > > > whether a device hotplug event has been sent to the guest. If it > > > hasn't, we allow immediate removal under the assumption that the > > > guest will not be using the device. Devices present at boot/reset > > > time are also assumed to be 'signalled'. > > >=20 > > > For CPU/memory/etc, signalling will still happen immediately > > > as part of device_add, so only PCI functions should be affected. > > >=20 > > > Cc: bharata@linux.vnet.ibm.com > > > Cc: david@gibson.dropbear.id.au > > > Cc: sbhat@linux.vnet.ibm.com > > > Cc: qemu-ppc@nongnu.org > > > Signed-off-by: Michael Roth > >=20 > > So, I'm disinclined to take this during the hard freeze, without some > > evidence that it fixes a problem case that's really being hit in > > practice. >=20 > The basic situation is that previously we had: >=20 > device_add virtio-net-pci,id=3Dhp5.2,addr=3D0x5.2 > a1) plug device into slot > a2) signal guest of attach > a3) guest adds device > device_del hp5.2=20 > d1) signal guest of detach > d2) wait for guest to clean up device and signal completion > d3) unplug device from slot >=20 > After 788d259 we have: >=20 > device_add virtio-net-pci,id=3Dhp5.2,addr=3D0x5.2 > a1) plug device into slot > a2) defer signalling until we see 0x5.0 added > device_del hp5.2=20 > d1) defer signalling removal until we see 0x5.0 deleted > d2) wait for guest to clean up device and signal completion >=20 > But d2) never happens because the guest was never signalled that > the device was added in the first place. >=20 > A real world situation where this might occur is admittedly a bit > contrived, but in practice I can certainly see it happening: >=20 > a) user hotplugs multifunction virtio-net-pci to slot 5, starting > with function 3 at 0x5.3 > b) user notices they made a mistake, for instance, they forget enable > vhost in the accompanying netdev > c) user attempts to unplug function and "abort" the operation, but > device does not go away >=20 > Shiva (on cc:) is working on the libvirt implementation for this > and hit this in testing. Since it differs in behavior from the PCIe > workflow it was based on (where aborts are explicitly allowed), and > causes a minor regression from 2.5, I thought it might be worth > considering for 2.6, but I can certainly understand if you opt to > hold off till 2.7. Ah.. yes, I hadn't registered the fact that this was a regression. Given that, I think it's probably reasonable for 2.6. >=20 > >=20 > > > --- > > > hw/ppc/spapr_drc.c | 34 ++++++++++++++++++++++++++++++++++ > > > hw/ppc/spapr_events.c | 11 +++++++++++ > > > include/hw/ppc/spapr_drc.h | 2 ++ > > > 3 files changed, 47 insertions(+) > > >=20 > > > diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c > > > index ef063c0..5568e44 100644 > > > --- a/hw/ppc/spapr_drc.c > > > +++ b/hw/ppc/spapr_drc.c > > > @@ -173,6 +173,12 @@ static void set_configured(sPAPRDRConnector *drc) > > > drc->configured =3D true; > > > } > > > =20 > > > +/* has the guest been notified of device attachment? */ > > > +static void set_signalled(sPAPRDRConnector *drc) > > > +{ > > > + drc->signalled =3D true; > > > +} > > > + > > > /* > > > * dr-entity-sense sensor value > > > * returned via get-sensor-state RTAS calls > > > @@ -355,6 +361,7 @@ static void attach(sPAPRDRConnector *drc, DeviceS= tate *d, void *fdt, > > > drc->fdt =3D fdt; > > > drc->fdt_start_offset =3D fdt_start_offset; > > > drc->configured =3D coldplug; > > > + drc->signalled =3D coldplug; > > > =20 > > > object_property_add_link(OBJECT(drc), "device", > > > object_get_typename(OBJECT(drc->dev)), > > > @@ -371,6 +378,26 @@ static void detach(sPAPRDRConnector *drc, Device= State *d, > > > drc->detach_cb =3D detach_cb; > > > drc->detach_cb_opaque =3D detach_cb_opaque; > > > =20 > > > + /* if we've signalled device presence to the guest, or if the gu= est > > > + * has gone ahead and configured the device (via manually-execut= ed > > > + * device add via drmgr in guest, namely), we need to wait > > > + * for the guest to quiesce the device before completing detach. > > > + * Otherwise, we can assume the guest hasn't seen it and complet= e the > > > + * detach immediately. Note that there is a small race window > > > + * just before, or during, configuration, which is this context > > > + * refers mainly to fetching the device tree via RTAS. > > > + * During this window the device access will be arbitrated by > > > + * associated DRC, which will simply fail the RTAS calls as inva= lid. > > > + * This is recoverable within guest and current implementations = of > > > + * drmgr should be able to cope. > >=20 > > Sorry, I'm really not following this description of the race, or > > seeing how it relates to allowing removal of "half plugged" devices. > >=20 >=20 > There's 2 stages we enter before the device is fully accessible within > the guest: >=20 > 1) The 'signalled' state introduced by this patch, which we enter as soon > as we send a hotplug event to the guest. >=20 > 2) The 'configuration' phase, which consists of RTAS calls against the > DRC and power domain for the slot to modify things like hotplug LED > indicator, allocation/isolation state, and fetching the device's device > tree. Once the device is done fetching the device tree it's considered > to be in 'configured' state according to state diagram in PAPR v2.6 13.4. > (note there's no accesses to the device over PCI bus during this time, > just RTAS calls). >=20 > We only allow immediate unplug/abort if we haven't signalled the device > to the guest, *and* if the device hasn't reached the 'configured' state. >=20 > Normally we wouldn't reach the configuration phase until after we enter > the signalled state, and since being in the signalled state prevents > immediate unplug, there's no race there. >=20 > The only race is if we haven't signalled the guest yet (for instance, > because the above example where we've only hotplugged function 0x5.3 > and are waiting for function 0x5.0 before sending the hotplug event), > but someone manually ran 'drmgr -c pci -a -s ...' in the > guest (perhaps because they were confused why the device didn't show > up). Then, while that command was running through the process of > bringing the device into the 'configured' state, they or someone else > executed device_del on the device. Since it hasn't been signalled, > and since it hasn't reached the configured state, it would be > immediately unplugged by QEMU. But since drmgr might not have finished > adding it yet, we have a case where the guest is attempting to access > the device while it's being removed. This is the race window. It's > pretty small and unlikely, but I figured it was worth mentioning. >=20 > I reason in the comment that since the 'configuration' phase consists > only of RTAS calls, and no actual accesses over the PCI bus, that this > would still result in the guest failing gracefully via an RTAS error > for executing an RTAS call against a DRC that no longer has a device > associated with in. Current implementations of drmgr would simply > abort the add operation and resume normal function without ever > adding the device to PCI subsystem. So during the race window an > 'abort' should be fairly safe. >=20 > > > + */ > > > + if (!drc->signalled && !drc->configured) { > > > + /* if the guest hasn't seen the device we can't rely on it to > > > + * set it back to an isolated state via RTAS, so do it here = manually > > > + */ > > > + drc->isolation_state =3D SPAPR_DR_ISOLATION_STATE_ISOLATED; > > > + } > > > + > > > if (drc->isolation_state !=3D SPAPR_DR_ISOLATION_STATE_ISOLATED)= { > > > DPRINTFN("awaiting transition to isolated state before remov= al"); > > > drc->awaiting_release =3D true; > > > @@ -409,6 +436,7 @@ static void reset(DeviceState *d) > > > { > > > sPAPRDRConnector *drc =3D SPAPR_DR_CONNECTOR(d); > > > sPAPRDRConnectorClass *drck =3D SPAPR_DR_CONNECTOR_GET_CLASS(drc= ); > > > + sPAPRDREntitySense state; > > > =20 > > > DPRINTFN("drc reset: %x", drck->get_index(drc)); > > > /* immediately upon reset we can safely assume DRCs whose devices > > > @@ -436,6 +464,11 @@ static void reset(DeviceState *d) > > > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STAT= E_UNUSABLE); > > > } > > > } > > > + > > > + drck->entity_sense(drc, &state); > > > + if (state =3D=3D SPAPR_DR_ENTITY_SENSE_PRESENT) { > > > + drck->set_signalled(drc); > > > + } > > > } > > > =20 > > > static void realize(DeviceState *d, Error **errp) > > > @@ -594,6 +627,7 @@ static void spapr_dr_connector_class_init(ObjectC= lass *k, void *data) > > > drck->attach =3D attach; > > > drck->detach =3D detach; > > > drck->release_pending =3D release_pending; > > > + drck->set_signalled =3D set_signalled; > > > /* > > > * Reason: it crashes FIXME find and document the real reason > > > */ > > > diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c > > > index 39f4682..be8de63 100644 > > > --- a/hw/ppc/spapr_events.c > > > +++ b/hw/ppc/spapr_events.c > > > @@ -387,6 +387,13 @@ static void spapr_powerdown_req(Notifier *n, voi= d *opaque) > > > qemu_irq_pulse(xics_get_qirq(spapr->icp, spapr->check_exception_= irq)); > > > } > > > =20 > > > +static void spapr_hotplug_set_signalled(uint32_t drc_index) > > > +{ > > > + sPAPRDRConnector *drc =3D spapr_dr_connector_by_index(drc_index); > > > + sPAPRDRConnectorClass *drck =3D SPAPR_DR_CONNECTOR_GET_CLASS(drc= ); > > > + drck->set_signalled(drc); > > > +} > > > + > > > static void spapr_hotplug_req_event(uint8_t hp_id, uint8_t hp_action, > > > sPAPRDRConnectorType drc_type, > > > uint32_t drc) > > > @@ -453,6 +460,10 @@ static void spapr_hotplug_req_event(uint8_t hp_i= d, uint8_t hp_action, > > > =20 > > > rtas_event_log_queue(RTAS_LOG_TYPE_HOTPLUG, new_hp, true); > > > =20 > > > + if (hp->hotplug_action =3D=3D RTAS_LOG_V6_HP_ACTION_ADD) { > > > + spapr_hotplug_set_signalled(drc); > > > + } > > > + > > > qemu_irq_pulse(xics_get_qirq(spapr->icp, spapr->check_exception_= irq)); > > > } > > > =20 > > > diff --git a/include/hw/ppc/spapr_drc.h b/include/hw/ppc/spapr_drc.h > > > index 7e56347..fa21ba0 100644 > > > --- a/include/hw/ppc/spapr_drc.h > > > +++ b/include/hw/ppc/spapr_drc.h > > > @@ -151,6 +151,7 @@ typedef struct sPAPRDRConnector { > > > bool configured; > > > =20 > > > bool awaiting_release; > > > + bool signalled; > > > =20 > > > /* device pointer, via link property */ > > > DeviceState *dev; > > > @@ -188,6 +189,7 @@ typedef struct sPAPRDRConnectorClass { > > > spapr_drc_detach_cb *detach_cb, > > > void *detach_cb_opaque, Error **errp); > > > bool (*release_pending)(sPAPRDRConnector *drc); > > > + void (*set_signalled)(sPAPRDRConnector *drc); > > > } sPAPRDRConnectorClass; > > > =20 > > > sPAPRDRConnector *spapr_dr_connector_new(Object *owner, > >=20 >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --dfmC41YZQlborXoK Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJXAe6hAAoJEGw4ysog2bOSBzcP/3h0LCJBANctK6EQUM+Dkcjf PSjwT/rrUsvLpeb33g+Y9tOcqbKNp0/D9IoO7M3RAXxgKb/3qjSrJilY3w8spk8c E1xxIJzWpUZNbCLYlvEdTFDB+HWftsNXti2RmZR3Tp60o49vIODgBcx6HIdMTMA+ j97OjpyZcDtWXv5Bd0ByzrS2XrRoTU/HYq3f7qof3keI6Mscsr3bDh8911V4DcLN cqSd1K1JGzzMKD+AfB9tH1TZODx739hDSamrSqZhkZaYHfBZD/cz9pIGBPwvQZTf BpUALYPpaq2yLERqjBsVw/2AWMdSk+vlYrvUQtKo3qeBAW6PiGthdZNdswONscgv fW83J6Nn+Tzqj48t7hHfxU5qF2O2cvix1yKtd6qXP1PstjIQoVoLi3i5Zr63c825 xFlP6noiItcSNEXqoDbhML0HLKxizbMJDHXNm09ehmaRy4n5u955/HDhfD3FEAm6 /p+tMl+OiQPNXbdFqOjXiddZUoURMvlvkD4T8YV5dwbLTiyJv6Qn5wUieykAsfHU c41ks45X3V/RUkPX5sd5jzUJ2q9A8b5ggZ2ZfPMgaZFejOWz9B5KwbOk8b4aL3bK JtTTir/DLXRLIyO55FUww9SzkfliwibA11CENsWG3W9v5q5dHnvXq+unVM8+ZPsa +AnEEQg4l1WcDhtqONO2 =ayPl -----END PGP SIGNATURE----- --dfmC41YZQlborXoK--