On Tue, Jun 14, 2016 at 01:30:53PM +1000, Alexey Kardashevskiy wrote: > On 10/06/16 16:50, David Gibson wrote: > > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote: > >> On 23/03/16 14:03, David Gibson wrote: > >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote: > >>>> Uff, lost cc: list. Added back. Some comments below. > >>>> > >>>> > >>>> On 03/21/2016 04:19 PM, David Gibson wrote: > >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote: > >>>>>> On March 15, 2016 17:29:26 David Gibson wrote: > >>>>>> > >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote: > >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote: > >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: > >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: > >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: > >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap > >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU > >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and > >>>>>>>>>>>> linked to IOMMU groups by the user space. > >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need > >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping. > >>>>>>>>>>>> > >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter > >>>>>>>>>>>> is added which accepts: > >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware > >>>>>>>>>>>> TCE table; > >>>>>>>>>>>> - a LIOBN to assign to the found table. > >>>>>>>>>>>> > >>>>>>>>>>>> Before notifying KVM about new link, this check the group for being > >>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM > >>>>>>>>>>>> finish. > >>>>>>>>>>>> > >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > >>>>>>>>>>>> space. > >>>>>>>>>>>> > >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it > >>>>>>>>>>>> link to a KVM module. > >>>>>>>>>>>> > >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy > >>>>>>>>>>>> --- > >>>>>>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- > >>>>>>>>>>>> arch/powerpc/kvm/Kconfig | 1 + > >>>>>>>>>>>> arch/powerpc/kvm/Makefile | 5 +- > >>>>>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + > >>>>>>>>>>>> include/uapi/linux/kvm.h | 9 +++ > >>>>>>>>>>>> virt/kvm/vfio.c | 106 > >>>>>>>> +++++++++++++++++++++++++++++ > >>>>>>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) > >>>>>>>>>>>> > >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>>>> index ef51740..c0d3eb7 100644 > >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups: > >>>>>>>>>>>> > >>>>>>>>>>>> KVM_DEV_VFIO_GROUP attributes: > >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking > >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>>>> + for the VFIO group. > >>>>>>>>>>> > >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which > >>>>>>>>>>> case it might be clearer to put them in a separate patch. > >>>>>>>>>>> > >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device > >>>>>>>> tracking > >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>>>> + for the VFIO group. > >>>>>>>>>>>> > >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>>>> -for the VFIO group. > >>>>>>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group > >>>>>>>>>>>> + kvm_device_attr.addr points to a struct: > >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn { > >>>>>>>>>>>> + __u32 argsz; > >>>>>>>>>>>> + __s32 fd; > >>>>>>>>>>>> + __u32 liobn; > >>>>>>>>>>>> + __u8 pad[4]; > >>>>>>>>>>>> + __u64 start_addr; > >>>>>>>>>>>> + }; > >>>>>>>>>>>> + where > >>>>>>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; > >>>>>>>>>>>> + @fd is a file descriptor for a VFIO group; > >>>>>>>>>>>> + @liobn is a logical bus id to be associated with the group; > >>>>>>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus > >>>>>>>>>>> > >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call > >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit > >>>>>>>>>> and 64bit windows, this is why @start_addr is there. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig > >>>>>>>>>>>> index 1059846..dfa3488 100644 > >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig > >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig > >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 > >>>>>>>>>>>> select KVM > >>>>>>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE > >>>>>>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT > >>>>>>>>>>>> + select KVM_VFIO if VFIO > >>>>>>>>>>>> ---help--- > >>>>>>>>>>>> Support running unmodified book3s_64 and book3s_32 guest kernels > >>>>>>>>>>>> in virtual machines on book3s_64 host processors. > >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile > >>>>>>>>>>>> index 7f7b6d8..71f577c 100644 > >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile > >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile > >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm > >>>>>>>>>>>> KVM := ../../../virt/kvm > >>>>>>>>>>>> > >>>>>>>>>>>> common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ > >>>>>>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o > >>>>>>>>>>>> + $(KVM)/eventfd.o > >>>>>>>>>>> > >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case. I added > >>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt > >>>>>>>>>>> wouldn't choke when it finds it's not available. Obviously the new > >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device > >>>>>>>>>>> itself should be available always. > >>>>>>>>>> > >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>>> CFLAGS_e500_mmu.o := -I. > >>>>>>>>>>>> CFLAGS_e500_mmu_host.o := -I. > >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif > >>>>>>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ > >>>>>>>>>>>> book3s_xics.o > >>>>>>>>>>>> > >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ > >>>>>>>>>>>> + $(KVM)/vfio.o \ > >>>>>>>>>>>> + > >>>>>>>>>>>> kvm-book3s_64-module-objs += \ > >>>>>>>>>>>> $(KVM)/kvm_main.o \ > >>>>>>>>>>>> $(KVM)/eventfd.o \ > >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > >>>>>>>>>>>> index 19aa59b..63f188d 100644 > >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c > >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c > >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm > >>>>>>>> *kvm, long ext) > >>>>>>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 > >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE: > >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE_64: > >>>>>>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: > >>>>>>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: > >>>>>>>>>>>> case KVM_CAP_PPC_RTAS: > >>>>>>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: > >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > >>>>>>>>>>>> index 080ffbf..f1abbea 100644 > >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h > >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h > >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { > >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP 1 > >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 > >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 > >>>>>>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 > >>>>>>>>>>>> > >>>>>>>>>>>> enum kvm_device_type { > >>>>>>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, > >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { > >>>>>>>>>>>> KVM_DEV_TYPE_MAX, > >>>>>>>>>>>> }; > >>>>>>>>>>>> > >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn { > >>>>>>>>>>>> + __u32 argsz; > >>>>>>>>>>>> + __s32 fd; > >>>>>>>>>>>> + __u32 liobn; > >>>>>>>>>>>> + __u8 pad[4]; > >>>>>>>>>>>> + __u64 start_addr; > >>>>>>>>>>>> +}; > >>>>>>>>>>>> + > >>>>>>>>>>>> /* > >>>>>>>>>>>> * ioctls for VM fds > >>>>>>>>>>>> */ > >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c > >>>>>>>>>>>> index 1dd087d..87c771e 100644 > >>>>>>>>>>>> --- a/virt/kvm/vfio.c > >>>>>>>>>>>> +++ b/virt/kvm/vfio.c > >>>>>>>>>>>> @@ -20,6 +20,10 @@ > >>>>>>>>>>>> #include > >>>>>>>>>>>> #include "vfio.h" > >>>>>>>>>>>> > >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>>>> +#include > >>>>>>>>>>>> +#endif > >>>>>>>>>>>> + > >>>>>>>>>>>> struct kvm_vfio_group { > >>>>>>>>>>>> struct list_head node; > >>>>>>>>>>>> struct vfio_group *vfio_group; > >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void > >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) > >>>>>>>>>>>> symbol_put(vfio_group_put_external_user); > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) > >>>>>>>>>>>> +{ > >>>>>>>>>>>> + int (*fn)(struct vfio_group *); > >>>>>>>>>>>> + int ret = -1; > >>>>>>>>>>> > >>>>>>>>>>> Should this be -ESOMETHING? > >>>>>>>>>>> > >>>>>>>>>>>> + fn = symbol_get(vfio_external_user_iommu_id); > >>>>>>>>>>>> + if (!fn) > >>>>>>>>>>>> + return ret; > >>>>>>>>>>>> + > >>>>>>>>>>>> + ret = fn(vfio_group); > >>>>>>>>>>>> + > >>>>>>>>>>>> + symbol_put(vfio_external_user_iommu_id); > >>>>>>>>>>>> + > >>>>>>>>>>>> + return ret; > >>>>>>>>>>>> +} > >>>>>>>>>>>> + > >>>>>>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) > >>>>>>>>>>>> { > >>>>>>>>>>>> long (*fn)(struct vfio_group *, unsigned long); > >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct > >>>>>>>> kvm_device *dev) > >>>>>>>>>>>> mutex_unlock(&kv->lock); > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, > >>>>>>>>>>>> + struct vfio_group *vfio_group) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach > >>>>>>>>>>> function? > >>>>>>>>>> > >>>>>>>>>> Having less patches which touch different maintainers areas is better. I > >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in > >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE > >>>>>>>>>> table". > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> +{ > >>>>>>>>>>>> + int group_id; > >>>>>>>>>>>> + struct iommu_group *grp; > >>>>>>>>>>>> + > >>>>>>>>>>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); > >>>>>>>>>>>> + grp = iommu_group_get_by_id(group_id); > >>>>>>>>>>>> + if (grp) { > >>>>>>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); > >>>>>>>>>>>> + iommu_group_put(grp); > >>>>>>>>>>>> + } > >>>>>>>>>>>> +} > >>>>>>>>>>>> +#endif > >>>>>>>>>>>> + > >>>>>>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg) > >>>>>>>>>>>> { > >>>>>>>>>>>> struct kvm_vfio *kv = dev->private; > >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device > >>>>>>>> *dev, long attr, u64 arg) > >>>>>>>>>>>> continue; > >>>>>>>>>>>> > >>>>>>>>>>>> list_del(&kvg->node); > >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>>> > >>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the > >>>>>>>>>>> callsite. > >>>>>>>>>> > >>>>>>>>>> It is questionable. A x86 reader may decide that > >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get > >>>>>>>>>> confused. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, > >>>>>>>>>>>> + kvg->vfio_group); > >>>>>>>>>>>> +#endif > >>>>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); > >>>>>>>>>>>> kfree(kvg); > >>>>>>>>>>>> ret = 0; > >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device > >>>>>>>> *dev, long attr, u64 arg) > >>>>>>>>>>>> kvm_vfio_update_coherency(dev); > >>>>>>>>>>>> > >>>>>>>>>>>> return ret; > >>>>>>>>>>>> + > >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { > >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; > >>>>>>>>>>>> + unsigned long minsz; > >>>>>>>>>>>> + struct kvm_vfio *kv = dev->private; > >>>>>>>>>>>> + struct vfio_group *vfio_group; > >>>>>>>>>>>> + struct kvm_vfio_group *kvg; > >>>>>>>>>>>> + struct fd f; > >>>>>>>>>>>> + > >>>>>>>>>>>> + minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn, > >>>>>>>>>>>> + start_addr); > >>>>>>>>>>>> + > >>>>>>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) > >>>>>>>>>>>> + return -EFAULT; > >>>>>>>>>>>> + > >>>>>>>>>>>> + if (param.argsz < minsz) > >>>>>>>>>>>> + return -EINVAL; > >>>>>>>>>>>> + > >>>>>>>>>>>> + f = fdget(param.fd); > >>>>>>>>>>>> + if (!f.file) > >>>>>>>>>>>> + return -EBADF; > >>>>>>>>>>>> + > >>>>>>>>>>>> + vfio_group = kvm_vfio_group_get_external_user(f.file); > >>>>>>>>>>>> + fdput(f); > >>>>>>>>>>>> + > >>>>>>>>>>>> + if (IS_ERR(vfio_group)) > >>>>>>>>>>>> + return PTR_ERR(vfio_group); > >>>>>>>>>>>> + > >>>>>>>>>>>> + ret = -ENOENT; > >>>>>>>>>>> > >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU? It's > >>>>>>>>>>> possible a kernel could be built for a platform supporting multiple > >>>>>>>>>>> IOMMU types. > >>>>>>>>>> > >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a > >>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have > >>>>>>>>>> groups. > >>>>>>>>> > >>>>>>>>> Which, as mentioned previously, is broken. > >>>>>>>> > >>>>>>>> Which I am failing to follow you on this. > >>>>>>>> > >>>>>>>> What I am trying to achieve here is pretty much referencing a group so it > >>>>>>>> cannot be reused. Plus LIOBNs. > >>>>>>> > >>>>>>> "Plus LIOBNs" is not a trivial change. You are establishing a linkage > >>>>>> >from LIOBNs to groups. But that doesn't make sense; if mapping in one > >>>>>>> (guest) LIOBN affects a group it must affect all groups in the > >>>>>>> container. i.e. LIOBN->container is the natural mapping, *not* LIOBN > >>>>>>> to group. > >>>>>> > >>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck. > >>>>>> Pass container fd and then implement new api to lock containers somehow and > >>>>> > >>>>> I'm not really understanding what the question is about locking containers. > >>>>> > >>>>>> enumerate groups when updating TCE table (including real mode)? > >>>>> > >>>>> Why do you need to enumerate groups? The groups within the container > >>>>> share a TCE table, so can't you just update that once? > >>>> > >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache > >>>> invalidate) register address, it is still per PE but this does not matter > >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete > >>>> the picture. > >>> > >>> True, you'll need to enumerate the groups for invalidates. But you > >>> need that already, right. > >>> > >>>>>> Plus new API when we remove a group from a container as the result of guest > >>>>>> PCI hot unplug? > >>>>> > >>>>> I assume you mean a kernel internal API, since it shouldn't need > >>>>> anything else visible to userspace. Won't this happen naturally? > >>>>> When the group is removed from the container, it will get its own TCE > >>>>> table instead of the previously shared one. > >>>>> > >>>>>>>> Passing a container fd does not make much > >>>>>>>> sense here as the VFIO device would walk through groups, reference them and > >>>>>>>> that is it, there is no locking on VFIO containters and so far there was no > >>>>>>>> need to teach KVM about containers. > >>>>>>>> > >>>>>>>> What do I miss now? > >>>>>>> > >>>>>>> Referencing the groups is essentially just a useful side effect. The > >>>>>>> important functionality is informing VFIO of the guest LIOBNs; and > >>>>>>> LIOBNs map to containers, not groups. > >>>>>> > >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There > >>>>>> can be one or many or none containers per liobn. > >>>>> > >>>>> Ah, true. > >>>> > >>>> So I need to add new kernel API for KVM to get table(s) from VFIO > >>>> container(s). And invent some locking mechanism to prevent table(s) (or > >>>> associated container(s)) from going away, like > >>>> vfio_group_get_external_user/vfio_group_put_external_user but for > >>>> containers. Correct? > >>> > >>> Well, the container is attached to an fd, so if you get a reference on > >>> the file* that should do it. > >> > >> I am still trying to think of how to implement this suggestion. > >> > >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right > >> interface as it knows nothing about KVM. There is VFIO-KVM device but it > >> does not have idea about containers. > >> > >> So I have to: > >> > >> Wenever a container is created or removed, notify the VFIO-KVM device by > >> passing there a container fd. ok. > > > > Actually, I don't think the vfio-kvm device is really useful here. It > > was designed as a hack for a particular problem on x86. It certainly > > could be extended to cover the information we need here, but I don't > > think it's a particularly natural way of doing so. > > > > The problem is that conveying the information from the vfio-kvm device > > to the real mode H_PUT_TCE handler, which is what really needs it, > > isn't particularly simpler than conveying that information from > > anywhere else. > > > >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to > >> what LIOBN so the realmode handlers could do the job. The real mode TCE > >> handlers get LIOBN, find a guest view table and update it. Now I want to > >> update the hardware table which is iommu_table attached to LIOBN. > >> > >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd. > >> > >> Now VFIO-KVM device needs to extract iommu_table's from the container. > >> These iommu_table pointers are stored in "struct tce_container" which is > >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I > >> cannot export and use that. > >> > >> The other way to go would be adding API to VFIO to enumerate IOMMU groups > >> in a container and use iommu_table pointers stored in iommu_table_group of > >> each group (in fact the very first group will be enough as multiple groups > >> in a container share the table). Adding vfio_container_get_groups() when > >> only first one is needed is quite tricky in terms of maintainers approvals. > >> > >> So what would be the right course of action? Thanks. > > > > So, from the user side, you need to be able to bind a VFIO backend to > > a particular guest IOMMU. This suggests a new ioctl() used in > > conjunction with KVM_CREATE_SPAPR_TCE. Let's call it > > KVM_SPAPR_TCE_BIND_VFIO. You'd use KVM_CREATE_SPAPR_TCE to make the > > kernel aware of a LIOBN in the first place, then use > > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN. > > So it would be a VM ioctl, taking a LIOBN and a container fd. You'd > > need a capability to go with it, and some way to unbind as well. > > This is what I had in the first place some years ago. And after 5-6 reviews > I was told that there is a VFIO KVM and I should use it. I suspect that's because Alex didn't fully understand what we required here. The primary thing here is that we need to link guest visible LIOBNs to host-visible VFIO containers. Your comments tended to emphasise the fact of giving KVM a list of VFIO groups, which is a side effect of the above, but not really the main point - it does, however, sound misleadingly like what the kvm-vfio device already does. > > To implement that, the ioctl() would need to use a new vfio (kernel > > internal) interface - which can be specific to only the spapr TCE > > type. That would take the container fd, and return the list of > > iommu_tables in some form or other (or various error conditions, > > obviously). > > > > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform > > the kernel of the LIOBN. When the VFIO device is attached to the PHB, > > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the > > LIOBN. The ioctl() implementation uses the new special interface into > > the spapr_tce vfio backend to get the list of iommu tables, which it > > stores in some private format. > > Getting just a list of IOMMU groups would do too. Pushing such API is a > problem, this is how I ended up with the current design. > > > > The H_PUT_TCE implementation uses that > > stored list of iommu tables to translate H_PUT_TCEs from the guest > > into actions on the host IOMMU tables. > > > > And, yes, the special interface to the spapr TCE vfio back end is kind > > of a hack. That's what you get when you need to link to separate > > kernel subsystems for performance reasons. > > One can argue if it is a hack, how is this hack better that the existing > approach? :) Because this hack is only on a kernel internal interface, rather than impactin the user visible interface. > Alex, could you please comment on David's suggestion? Thanks! -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson