From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com [IPv6:2607:f8b0:400e:c00::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3rTFY30ZZ4zDq5d for ; Tue, 14 Jun 2016 13:31:02 +1000 (AEST) Received: by mail-pf0-x244.google.com with SMTP id 66so10002193pfy.1 for ; Mon, 13 Jun 2016 20:31:02 -0700 (PDT) Subject: Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE To: David Gibson References: <20160309054544.GM22546@voom.fritz.box> <56DFEACC.7030508@ozlabs.ru> <20160310052108.GV22546@voom.fritz.box> <56E1FEBE.5090602@ozlabs.ru> <20160315060400.GK15272@voom.fritz.box> <15389a41428.27cb.1ca38dd7e845b990cd13d431eb58563d@ozlabs.ru> <20160321051932.GJ23586@voom.redhat.com> <56F0932F.2050708@ozlabs.ru> <20160323030341.GT23586@voom.redhat.com> <367698b4-40eb-ec26-5154-b22d672322b7@ozlabs.ru> <20160610065025.GR9226@voom.fritz.box> Cc: "linuxppc-dev@lists.ozlabs.org" , Paul Mackerras , Alex Williamson , kvm-ppc , kvm@vger.kernel.org From: Alexey Kardashevskiy Message-ID: Date: Tue, 14 Jun 2016 13:30:53 +1000 MIME-Version: 1.0 In-Reply-To: <20160610065025.GR9226@voom.fritz.box> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="vmiw9l1rmuLxE0GJfkvC8BJg6bRCNMxfI" List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --vmiw9l1rmuLxE0GJfkvC8BJg6bRCNMxfI Content-Type: multipart/mixed; boundary="7HcMEPEgg7Gn8qv0b8NCUKPhwkhlANu2v" From: Alexey Kardashevskiy To: David Gibson Cc: "linuxppc-dev@lists.ozlabs.org" , Paul Mackerras , Alex Williamson , kvm-ppc , kvm@vger.kernel.org Message-ID: Subject: Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE References: <20160309054544.GM22546@voom.fritz.box> <56DFEACC.7030508@ozlabs.ru> <20160310052108.GV22546@voom.fritz.box> <56E1FEBE.5090602@ozlabs.ru> <20160315060400.GK15272@voom.fritz.box> <15389a41428.27cb.1ca38dd7e845b990cd13d431eb58563d@ozlabs.ru> <20160321051932.GJ23586@voom.redhat.com> <56F0932F.2050708@ozlabs.ru> <20160323030341.GT23586@voom.redhat.com> <367698b4-40eb-ec26-5154-b22d672322b7@ozlabs.ru> <20160610065025.GR9226@voom.fritz.box> In-Reply-To: <20160610065025.GR9226@voom.fritz.box> --7HcMEPEgg7Gn8qv0b8NCUKPhwkhlANu2v Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: quoted-printable On 10/06/16 16:50, David Gibson wrote: > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote: >> On 23/03/16 14:03, David Gibson wrote: >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:= >>>> Uff, lost cc: list. Added back. Some comments below. >>>> >>>> >>>> On 03/21/2016 04:19 PM, David Gibson wrote: >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrot= e: >>>>>> On March 15, 2016 17:29:26 David Gibson wrote: >>>>>> >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wr= ote: >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote: >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy = wrote: >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevski= y wrote: >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/u= nmap >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a targ= et IOMMU >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems = and >>>>>>>>>>>> linked to IOMMU groups by the user space. >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM,= we need >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping= =2E >>>>>>>>>>>> >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN param= eter >>>>>>>>>>>> is added which accepts: >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual har= dware >>>>>>>>>>>> TCE table; >>>>>>>>>>>> - a LIOBN to assign to the found table. >>>>>>>>>>>> >>>>>>>>>>>> Before notifying KVM about new link, this check the group fo= r being >>>>>>>>>>>> registered with KVM device in order to release them at unexp= ected KVM >>>>>>>>>>>> finish. >>>>>>>>>>>> >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to= the user >>>>>>>>>>>> space. >>>>>>>>>>>> >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling= to let it >>>>>>>>>>>> link to a KVM module. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>>>>>>>> --- >>>>>>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- >>>>>>>>>>>> arch/powerpc/kvm/Kconfig | 1 + >>>>>>>>>>>> arch/powerpc/kvm/Makefile | 5 +- >>>>>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + >>>>>>>>>>>> include/uapi/linux/kvm.h | 9 +++ >>>>>>>>>>>> virt/kvm/vfio.c | 106 >>>>>>>> +++++++++++++++++++++++++++++ >>>>>>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> index ef51740..c0d3eb7 100644 >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups: >>>>>>>>>>>> >>>>>>>>>>>> KVM_DEV_VFIO_GROUP attributes: >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM devi= ce tracking >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>>>> + for the VFIO group. >>>>>>>>>>> >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, = in which >>>>>>>>>>> case it might be clearer to put them in a separate patch. >>>>>>>>>>> >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM= device >>>>>>>> tracking >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>>>> + for the VFIO group. >>>>>>>>>>>> >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file de= scriptor >>>>>>>>>>>> -for the VFIO group. >>>>>>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for = a VFIO group >>>>>>>>>>>> + kvm_device_attr.addr points to a struct: >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>>>> + __u32 argsz; >>>>>>>>>>>> + __s32 fd; >>>>>>>>>>>> + __u32 liobn; >>>>>>>>>>>> + __u8 pad[4]; >>>>>>>>>>>> + __u64 start_addr; >>>>>>>>>>>> + }; >>>>>>>>>>>> + where >>>>>>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; >>>>>>>>>>>> + @fd is a file descriptor for a VFIO group; >>>>>>>>>>>> + @liobn is a logical bus id to be associated with the grou= p; >>>>>>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus >>>>>>>>>>> >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you c= an call >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU = group? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated)= - for 32bit >>>>>>>>>> and 64bit windows, this is why @start_addr is there. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kco= nfig >>>>>>>>>>>> index 1059846..dfa3488 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 >>>>>>>>>>>> select KVM >>>>>>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE >>>>>>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT >>>>>>>>>>>> + select KVM_VFIO if VFIO >>>>>>>>>>>> ---help--- >>>>>>>>>>>> Support running unmodified book3s_64 and book3s_32 guest= kernels >>>>>>>>>>>> in virtual machines on book3s_64 host processors. >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Ma= kefile >>>>>>>>>>>> index 7f7b6d8..71f577c 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y :=3D -Ivirt/kvm -Iarch/powerpc/kvm= >>>>>>>>>>>> KVM :=3D ../../../virt/kvm >>>>>>>>>>>> >>>>>>>>>>>> common-objs-y =3D $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o= \ >>>>>>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o >>>>>>>>>>>> + $(KVM)/eventfd.o >>>>>>>>>>> >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.= I added >>>>>>>>>>> it (even though it didn't do anything until now) so that libv= irt >>>>>>>>>>> wouldn't choke when it finds it's not available. Obviously t= he new >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the dev= ice >>>>>>>>>>> itself should be available always. >>>>>>>>>> >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm= as a module. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> CFLAGS_e500_mmu.o :=3D -I. >>>>>>>>>>>> CFLAGS_e500_mmu_host.o :=3D -I. >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif >>>>>>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) +=3D \ >>>>>>>>>>>> book3s_xics.o >>>>>>>>>>>> >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) +=3D \ >>>>>>>>>>>> + $(KVM)/vfio.o \ >>>>>>>>>>>> + >>>>>>>>>>>> kvm-book3s_64-module-objs +=3D \ >>>>>>>>>>>> $(KVM)/kvm_main.o \ >>>>>>>>>>>> $(KVM)/eventfd.o \ >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/p= owerpc.c >>>>>>>>>>>> index 19aa59b..63f188d 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct = kvm >>>>>>>> *kvm, long ext) >>>>>>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE: >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE_64: >>>>>>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: >>>>>>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: >>>>>>>>>>>> case KVM_CAP_PPC_RTAS: >>>>>>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/k= vm.h >>>>>>>>>>>> index 080ffbf..f1abbea 100644 >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP 1 >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 >>>>>>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 >>>>>>>>>>>> >>>>>>>>>>>> enum kvm_device_type { >>>>>>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 =3D 1, >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { >>>>>>>>>>>> KVM_DEV_TYPE_MAX, >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>>>> + __u32 argsz; >>>>>>>>>>>> + __s32 fd; >>>>>>>>>>>> + __u32 liobn; >>>>>>>>>>>> + __u8 pad[4]; >>>>>>>>>>>> + __u64 start_addr; >>>>>>>>>>>> +}; >>>>>>>>>>>> + >>>>>>>>>>>> /* >>>>>>>>>>>> * ioctls for VM fds >>>>>>>>>>>> */ >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c >>>>>>>>>>>> index 1dd087d..87c771e 100644 >>>>>>>>>>>> --- a/virt/kvm/vfio.c >>>>>>>>>>>> +++ b/virt/kvm/vfio.c >>>>>>>>>>>> @@ -20,6 +20,10 @@ >>>>>>>>>>>> #include >>>>>>>>>>>> #include "vfio.h" >>>>>>>>>>>> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> +#include >>>>>>>>>>>> +#endif >>>>>>>>>>>> + >>>>>>>>>>>> struct kvm_vfio_group { >>>>>>>>>>>> struct list_head node; >>>>>>>>>>>> struct vfio_group *vfio_group; >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) >>>>>>>>>>>> symbol_put(vfio_group_put_external_user); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_grou= p *vfio_group) >>>>>>>>>>>> +{ >>>>>>>>>>>> + int (*fn)(struct vfio_group *); >>>>>>>>>>>> + int ret =3D -1; >>>>>>>>>>> >>>>>>>>>>> Should this be -ESOMETHING? >>>>>>>>>>> >>>>>>>>>>>> + fn =3D symbol_get(vfio_external_user_iommu_id); >>>>>>>>>>>> + if (!fn) >>>>>>>>>>>> + return ret; >>>>>>>>>>>> + >>>>>>>>>>>> + ret =3D fn(vfio_group); >>>>>>>>>>>> + >>>>>>>>>>>> + symbol_put(vfio_external_user_iommu_id); >>>>>>>>>>>> + >>>>>>>>>>>> + return ret; >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *v= fio_group) >>>>>>>>>>>> { >>>>>>>>>>>> long (*fn)(struct vfio_group *, unsigned long); >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(s= truct >>>>>>>> kvm_device *dev) >>>>>>>>>>>> mutex_unlock(&kv->lock); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *k= vm, >>>>>>>>>>>> + struct vfio_group *vfio_group) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attac= h >>>>>>>>>>> function? >>>>>>>>>> >>>>>>>>>> Having less patches which touch different maintainers areas is= better. I >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but = I can in >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest= view of TCE >>>>>>>>>> table". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> +{ >>>>>>>>>>>> + int group_id; >>>>>>>>>>>> + struct iommu_group *grp; >>>>>>>>>>>> + >>>>>>>>>>>> + group_id =3D kvm_vfio_external_user_iommu_id(vfio_group); >>>>>>>>>>>> + grp =3D iommu_group_get_by_id(group_id); >>>>>>>>>>>> + if (grp) { >>>>>>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); >>>>>>>>>>>> + iommu_group_put(grp); >>>>>>>>>>>> + } >>>>>>>>>>>> +} >>>>>>>>>>>> +#endif >>>>>>>>>>>> + >>>>>>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long = attr, u64 arg) >>>>>>>>>>>> { >>>>>>>>>>>> struct kvm_vfio *kv =3D dev->private; >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kv= m_device >>>>>>>> *dev, long attr, u64 arg) >>>>>>>>>>>> continue; >>>>>>>>>>>> >>>>>>>>>>>> list_del(&kvg->node); >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>> >>>>>>>>>>> Better to make a no-op version of the call than have to #ifde= f at the >>>>>>>>>>> callsite. >>>>>>>>>> >>>>>>>>>> It is questionable. A x86 reader may decide that >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 = and get >>>>>>>>>> confused. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, >>>>>>>>>>>> + kvg->vfio_group); >>>>>>>>>>>> +#endif >>>>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); >>>>>>>>>>>> kfree(kvg); >>>>>>>>>>>> ret =3D 0; >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kv= m_device >>>>>>>> *dev, long attr, u64 arg) >>>>>>>>>>>> kvm_vfio_update_coherency(dev); >>>>>>>>>>>> >>>>>>>>>>>> return ret; >>>>>>>>>>>> + >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; >>>>>>>>>>>> + unsigned long minsz; >>>>>>>>>>>> + struct kvm_vfio *kv =3D dev->private; >>>>>>>>>>>> + struct vfio_group *vfio_group; >>>>>>>>>>>> + struct kvm_vfio_group *kvg; >>>>>>>>>>>> + struct fd f; >>>>>>>>>>>> + >>>>>>>>>>>> + minsz =3D offsetofend(struct kvm_vfio_spapr_tce_liobn, >>>>>>>>>>>> + start_addr); >>>>>>>>>>>> + >>>>>>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) >>>>>>>>>>>> + return -EFAULT; >>>>>>>>>>>> + >>>>>>>>>>>> + if (param.argsz < minsz) >>>>>>>>>>>> + return -EINVAL; >>>>>>>>>>>> + >>>>>>>>>>>> + f =3D fdget(param.fd); >>>>>>>>>>>> + if (!f.file) >>>>>>>>>>>> + return -EBADF; >>>>>>>>>>>> + >>>>>>>>>>>> + vfio_group =3D kvm_vfio_group_get_external_user(f.file); >>>>>>>>>>>> + fdput(f); >>>>>>>>>>>> + >>>>>>>>>>>> + if (IS_ERR(vfio_group)) >>>>>>>>>>>> + return PTR_ERR(vfio_group); >>>>>>>>>>>> + >>>>>>>>>>>> + ret =3D -ENOENT; >>>>>>>>>>> >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMM= U? It's >>>>>>>>>>> possible a kernel could be built for a platform supporting mu= ltiple >>>>>>>>>>> IOMMU types. >>>>>>>>>> >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU= type is a >>>>>>>>>> VFIO container property, not a group property and here (KVM) w= e only have >>>>>>>>>> groups. >>>>>>>>> >>>>>>>>> Which, as mentioned previously, is broken. >>>>>>>> >>>>>>>> Which I am failing to follow you on this. >>>>>>>> >>>>>>>> What I am trying to achieve here is pretty much referencing a gr= oup so it >>>>>>>> cannot be reused. Plus LIOBNs. >>>>>>> >>>>>>> "Plus LIOBNs" is not a trivial change. You are establishing a li= nkage >>>>>> >from LIOBNs to groups. But that doesn't make sense; if mapping i= n one >>>>>>> (guest) LIOBN affects a group it must affect all groups in the >>>>>>> container. i.e. LIOBN->container is the natural mapping, *not* L= IOBN >>>>>>> to group. >>>>>> >>>>>> I can see your point but i don't see how to proceed now, I'm total= ly stuck. >>>>>> Pass container fd and then implement new api to lock containers so= mehow and >>>>> >>>>> I'm not really understanding what the question is about locking con= tainers. >>>>> >>>>>> enumerate groups when updating TCE table (including real mode)? >>>>> >>>>> Why do you need to enumerate groups? The groups within the contain= er >>>>> share a TCE table, so can't you just update that once? >>>> >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cac= he >>>> invalidate) register address, it is still per PE but this does not m= atter >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to c= omplete >>>> the picture. >>> >>> True, you'll need to enumerate the groups for invalidates. But you >>> need that already, right. >>> >>>>>> Plus new API when we remove a group from a container as the result= of guest >>>>>> PCI hot unplug? >>>>> >>>>> I assume you mean a kernel internal API, since it shouldn't need >>>>> anything else visible to userspace. Won't this happen naturally? >>>>> When the group is removed from the container, it will get its own T= CE >>>>> table instead of the previously shared one. >>>>> >>>>>>>> Passing a container fd does not make much >>>>>>>> sense here as the VFIO device would walk through groups, referen= ce them and >>>>>>>> that is it, there is no locking on VFIO containters and so far t= here was no >>>>>>>> need to teach KVM about containers. >>>>>>>> >>>>>>>> What do I miss now? >>>>>>> >>>>>>> Referencing the groups is essentially just a useful side effect. = The >>>>>>> important functionality is informing VFIO of the guest LIOBNs; an= d >>>>>>> LIOBNs map to containers, not groups. >>>>>> >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container= =2E There >>>>>> can be one or many or none containers per liobn. >>>>> >>>>> Ah, true. >>>> >>>> So I need to add new kernel API for KVM to get table(s) from VFIO >>>> container(s). And invent some locking mechanism to prevent table(s) = (or >>>> associated container(s)) from going away, like >>>> vfio_group_get_external_user/vfio_group_put_external_user but for >>>> containers. Correct? >>> >>> Well, the container is attached to an fd, so if you get a reference o= n >>> the file* that should do it. >> >> I am still trying to think of how to implement this suggestion. >> >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not ri= ght >> interface as it knows nothing about KVM. There is VFIO-KVM device but = it >> does not have idea about containers. >> >> So I have to: >> >> Wenever a container is created or removed, notify the VFIO-KVM device = by >> passing there a container fd. ok. >=20 > Actually, I don't think the vfio-kvm device is really useful here. It > was designed as a hack for a particular problem on x86. It certainly > could be extended to cover the information we need here, but I don't > think it's a particularly natural way of doing so. >=20 > The problem is that conveying the information from the vfio-kvm device > to the real mode H_PUT_TCE handler, which is what really needs it, > isn't particularly simpler than conveying that information from > anywhere else. >=20 >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs = to >> what LIOBN so the realmode handlers could do the job. The real mode TC= E >> handlers get LIOBN, find a guest view table and update it. Now I want = to >> update the hardware table which is iommu_table attached to LIOBN. >> >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a conta= iner fd. >> >> Now VFIO-KVM device needs to extract iommu_table's from the container.= >> These iommu_table pointers are stored in "struct tce_container" which = is >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. = So I >> cannot export and use that. >> >> The other way to go would be adding API to VFIO to enumerate IOMMU gro= ups >> in a container and use iommu_table pointers stored in iommu_table_grou= p of >> each group (in fact the very first group will be enough as multiple gr= oups >> in a container share the table). Adding vfio_container_get_groups() wh= en >> only first one is needed is quite tricky in terms of maintainers appro= vals. >> >> So what would be the right course of action? Thanks. >=20 > So, from the user side, you need to be able to bind a VFIO backend to > a particular guest IOMMU. This suggests a new ioctl() used in > conjunction with KVM_CREATE_SPAPR_TCE. Let's call it > KVM_SPAPR_TCE_BIND_VFIO. You'd use KVM_CREATE_SPAPR_TCE to make the > kernel aware of a LIOBN in the first place, then use > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN. > So it would be a VM ioctl, taking a LIOBN and a container fd. You'd > need a capability to go with it, and some way to unbind as well. This is what I had in the first place some years ago. And after 5-6 revie= ws I was told that there is a VFIO KVM and I should use it. > To implement that, the ioctl() would need to use a new vfio (kernel > internal) interface - which can be specific to only the spapr TCE > type. That would take the container fd, and return the list of > iommu_tables in some form or other (or various error conditions, > obviously). >=20 > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform > the kernel of the LIOBN. When the VFIO device is attached to the PHB, > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the > LIOBN. The ioctl() implementation uses the new special interface into > the spapr_tce vfio backend to get the list of iommu tables, which it > stores in some private format. Getting just a list of IOMMU groups would do too. Pushing such API is a problem, this is how I ended up with the current design. > The H_PUT_TCE implementation uses that > stored list of iommu tables to translate H_PUT_TCEs from the guest > into actions on the host IOMMU tables. >=20 > And, yes, the special interface to the spapr TCE vfio back end is kind > of a hack. That's what you get when you need to link to separate > kernel subsystems for performance reasons. One can argue if it is a hack, how is this hack better that the existing approach? :) Alex, could you please comment on David's suggestion? Thanks! --=20 Alexey --7HcMEPEgg7Gn8qv0b8NCUKPhwkhlANu2v-- --vmiw9l1rmuLxE0GJfkvC8BJg6bRCNMxfI Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXX3ptAAoJEIYTPdgrwSC5fQ8P/0kczUGEO7CdzlXjnTxNw4TI GJzS8efLXVu22APTVFPzs5T71RufvUZW+ywQXv12mVdQepiUXd0M2N0oTjdy8ePs aY5EJ4eOR94V7uChA3crgciFVAwNbesfraeumQBGmIk99v0/DdWXROxODOdpuoDE 3a7zY0JDNkhBpmkJk6uXfBwg5Be5PufYuuybJ5Dg0IUAHFw3x1vW7L1prW82Q560 otleFMLkulY67AQK9tlKeO7RzIbtwVxvVS7MdwjnoLbn90MbxKSNF9732eAKwHkD rY09Bd15jzxG2eqvY/ZNr7+XXEzx4dQXui38sO60xyY50HCdw8sxNxnEC7CWBxMR M0qPHBhYg4AB6LxPBP9coHbP9/NN1TmXdmBIVeN6CG9usccjkgQPXJddiK8Ppou0 +d6zlsVU5dscI0vWy/cDk4dW/hq1AG1oqc7nFeXKnSzWD7N85qNNRWybitdyHGLU A4bn/h7l/YLxqdglASWRh6YRPedYA9fMGYJXZ/MYeGpPyRW29HOqXiS0EW/ZDfHl 8ZQ/xjE2H/e3ZL+4+wgrHjb9qP7+poHqd+wgOkUp2bvzBOc0rERIy+R1n6inaME9 u0Tz4zOTvhYwJ3iftwAqBQxXQcr+foYR0s2z6B/KPmyNBMZAGdQBuiq7QfXbcHoi ro57VrkFFMf1wwvomJoe =k6Z4 -----END PGP SIGNATURE----- --vmiw9l1rmuLxE0GJfkvC8BJg6bRCNMxfI--