From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48153) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b5SWJ-0001oW-DY for qemu-devel@nongnu.org; Wed, 25 May 2016 02:42:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b5SWA-0000le-FL for qemu-devel@nongnu.org; Wed, 25 May 2016 02:42:18 -0400 Date: Wed, 25 May 2016 16:34:37 +1000 From: David Gibson Message-ID: <20160525063437.GN17226@voom.fritz.box> References: <1462344751-28281-1-git-send-email-aik@ozlabs.ru> <1462344751-28281-2-git-send-email-aik@ozlabs.ru> <20160505163941.7628431c@t450s.home> <0f2ca845-0c9a-5446-ea69-371ea866319e@ozlabs.ru> <20160513162453.0d367e95@t450s.home> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="kK1uqZGE6pgsGNyR" Content-Disposition: inline In-Reply-To: <20160513162453.0d367e95@t450s.home> Subject: Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Alexey Kardashevskiy , qemu-devel@nongnu.org, qemu-ppc@nongnu.org, Alexander Graf , Paolo Bonzini --kK1uqZGE6pgsGNyR Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote: > On Fri, 13 May 2016 17:16:48 +1000 > Alexey Kardashevskiy wrote: >=20 > > On 05/06/2016 08:39 AM, Alex Williamson wrote: > > > On Wed, 4 May 2016 16:52:13 +1000 > > > Alexey Kardashevskiy wrote: > > > =20 > > >> This postpones VFIO container deinitialization to let region_del() > > >> callbacks (called via vfio_listener_release) do proper clean up > > >> while the group is still attached to the container. =20 > > > > > > Any mappings within the container should clean themselves up when the > > > container is deprivleged by removing the last group in the kernel. Is > > > the issue that that doesn't happen, which would be a spapr vfio kernel > > > bug, or that our QEMU side structures get all out of whack if we let > > > that happen? =20 > >=20 > > My mailbase got corrupted, missed that. > >=20 > > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory:= =20 > > Notify IOMMU about starting/stopping being used by VFIO", I should have= put=20 > > 01/19 and 02/19 right before 17/19, sorry about that. >=20 > Which I object to, it's just ridiculous to have vfio start/stop > callbacks in a set of generic iommu region ops. It's ugly, but I don't actually see a better way to do this (the general concept of having vfio start/stop callbacks, that is, not the specifics of the patches). The fact is that how we implement the guest side IOMMU *does* need to change depending on whether VFIO devices are present or not. That's due essentially to incompatibilities between a couple of kernel mechanisms. Which in itself is ugly, but nonetheless real. A (usually blank) vfio on/off callback in the guest side IOMMU ops seems like the least-bad way to handle this. > > Every reboot the spapr machine removes all (i.e. one or two) windows an= d=20 > > creates the default one. > >=20 > > I do this by memory_region_del_subregion(iommu_mr) +=20 > > memory_region_add_subregion(iommu_mr). Which gets translated to=20 > > VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via=20 > > vfio_memory_listener if there is VFIO; no direct calls from spapr to vf= io=20 > > =3D> cool. During the machine reset, the VFIO device is there with the = =20 > > container and groups attached, at some point with no windows. > >=20 > > Now to VFIO plug/unplug. > >=20 > > When VFIO plug happens, vfio_memory_listener is created, region_add() i= s=20 > > called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE= ). > > Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If= =20 > > region_del() is not called when the container is being destroyed (as be= fore=20 > > this patchset), then the kernel cleans and destroys windows when=20 > > close(container->fd) is called or when qemu is killed (and this fd is= =20 > > naturally closed), I hope this answers the comment from 02/19. > >=20 > > So far so good (right?) > >=20 > > However I also have a guest view of the TCE table, this is what the gue= st=20 > > sees and this is what emulated PCI devices use. This guest view is eith= er=20 > > allocated in the KVM (so H_PUT_TCE can be handled quickly right in the = host=20 > > kernel, even in real mode) or userspace (VFIO case). > >=20 > > I generally want the guest view to be in the KVM. However when I plug V= FIO,=20 > > I have to move the table to the userspace. When I unplug VFIO, I want t= o do=20 > > the opposite so I need a way to tell spapr that it can move the table.= =20 > > region_del() seemed a natural way of doing this as region_add() is alre= ady=20 > > doing the opposite part. > >=20 > > With this patchset, each IOMMU MR gets a usage counter, region_add() do= es=20 > > +1, region_del() does -1 (yeah, not extremely optimal during reset). Wh= en=20 > > the counter goes from 0 to 1, vfio_start() hook is called, when the cou= nter=20 > > becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers= on=20 > > the same PHB. > >=20 > > Without 01/19 and 02/19, I'll have to repeat region_del()'s counter=20 > > decrement steps in vfio_disconnect_container(). And I still cannot move= =20 > > counting from region_add() to vfio_connect_container() so there will be= =20 > > asymmetry which I am fine with, I am just checking here - what would be= the=20 > > best approach here? >=20 >=20 > You're imposing on other iommu models (type1) that in order to release > a container we first deregister the listener, which un-plays all of > the mappings within that region. That's inefficient when we can simply > unset the container and move on. So you're imposing an inefficiency on > a separate vfio iommu model for the book keeping of your own. I don't > think that's a reasonable approach. Has it even been testing how that > affects type1 users? When a container is closed, clearly it shouldn't > be contributing to reference counts, so it seems like there must be > other ways to handle this. My first guess is to agree, but I'll look at that more carefully when I actually get to the patch doing that. What I really don't understand about this one is what the group<->container connection - an entirely host side matter - has to do with the reference counting here, which is per *guest* side IOMMU. >=20 > > >> > > >> Signed-off-by: Alexey Kardashevskiy > > >> --- > > >> hw/vfio/common.c | 22 +++++++++++++++------- > > >> 1 file changed, 15 insertions(+), 7 deletions(-) > > >> > > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > >> index fe5ec6a..0b40262 100644 > > >> --- a/hw/vfio/common.c > > >> +++ b/hw/vfio/common.c > > >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGrou= p *group) > > >> { > > >> VFIOContainer *container =3D group->container; > > >> > > >> - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd= )) { > > >> - error_report("vfio: error disconnecting group %d from conta= iner", > > >> - group->groupid); > > >> - } > > >> - > > >> QLIST_REMOVE(group, container_next); > > >> + > > >> + if (QLIST_EMPTY(&container->group_list)) { > > >> + VFIOGuestIOMMU *giommu; > > >> + > > >> + vfio_listener_release(container); > > >> + > > >> + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next)= { > > >> + memory_region_unregister_iommu_notifier(&giommu->n); > > >> + } > > >> + } > > >> + > > >> group->container =3D NULL; > > >> + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd= )) { > > >> + error_report("vfio: error disconnecting group %d from conta= iner", > > >> + group->groupid); > > >> + } > > >> > > >> if (QLIST_EMPTY(&container->group_list)) { > > >> VFIOAddressSpace *space =3D container->space; > > >> VFIOGuestIOMMU *giommu, *tmp; > > >> > > >> - vfio_listener_release(container); > > >> QLIST_REMOVE(container, next); > > >> > > >> QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_= next, tmp) { > > >> - memory_region_unregister_iommu_notifier(&giommu->n); > > >> QLIST_REMOVE(giommu, giommu_next); > > >> g_free(giommu); > > >> } =20 > > > > > > I'm not spotting why this is a 2-pass process vs simply moving the > > > existing QLIST_EMPTY cleanup above the ioctl. Thanks, =20 > >=20 > >=20 > >=20 > >=20 > >=20 >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --kK1uqZGE6pgsGNyR Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJXRUd9AAoJEGw4ysog2bOSkL0P/iuVz9vlJeLK90koSARqFQtK FqFB/wkdVHyzcoxtiwdrHiirY9dMUywCpY2Fl7WH9zV0d2n5ClsSDet9rmGR3v9f ZnS5PMZBemdD/7HLuDAg5jwUGhqMraL8HbAB8WvqKMsbfhX7XodYlxXvjN1tRRVk Qcbm409vxSa8pcK58KJ5hkwzyFwlwwGc2fC08WMoadNmv4zqe//3f+FHicGwVyAU NmpHNU9vwYXRJ1n8GR2JPhcVZhxVKEoiAJreBRFO0PzUAwO0jF7wINFS1UP7bF8J mWZTj36Tz7/DCp8gCIlxSaislfeshD2c9xG1tQQAoKpfGrFIeFxrUs2Vm6ct8w8f eniMmK4EhF2nKVOkquKAVoV3n6YoDTjPL8/ZrniDNKqU71Bo2gluV5J/S9StkKbt bsjntosi8+LBSa0lNBUnWotUqdMObpTax2bTvyjQZvubE28cBc7VwQy3kFIAEe/L XHaomHCIgVVbphritvCcMwYRu6XG+zcXVy1ThsUHLWuJJe07bS8QL1eMvfBwoEN0 yO29jFA9y3t/ysu25QV5DQZLOLH1oGHNYcnX/tgpw4LqwwFsa+KDBxqyD6sG7eTX Oq6QB/1ToQ982bQuJCXR/ocm14rISZb1mmwXcwTcJdvIm9uZZRNeVsFaQHTwTi08 XFr8J3eEQUxp4Oq4k6Ia =KdO5 -----END PGP SIGNATURE----- --kK1uqZGE6pgsGNyR--