Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on Intel platforms allows address space sharing between device DMA and applications. SVA can reduce programming complexity and enhance security. This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing guest application address space with passthru devices. This is called vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU changes. For IOMMU and QEMU changes, they are in separate series (listed in the "Related series"). The high-level architecture for SVA virtualization is as below, the key design of vSVA support is to utilize the dual-stage IOMMU translation ( also known as IOMMU nesting translation) capability in host IOMMU. .-------------. .---------------------------. | vIOMMU | | Guest process CR3, FL only| | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush - '-------------' | | | V | | CR3 in GPA '-------------' Guest ------| Shadow |--------------------------|-------- v v v Host .-------------. .----------------------. | pIOMMU | | Bind FL for GVA-GPA | | | '----------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.------------------------------. | | |SL for GPA-HPA, default domain| | | '------------------------------' '-------------' Where: - FL = First level/stage one page tables - SL = Second level/stage two page tables Patch Overview: 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016) 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007) 3. a fix to a revisit in intel iommu driver (patch 0006) 4. vfio support for binding guest page table to host (patch 0008, 0009, 0010) 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) 7. expose PASID capability to VM (patch 0013) 8. add doc for VFIO dual stage control (patch 0014) The complete vSVA kernel upstream patches are divided into three phases: 1. Common APIs and PCI device direct assignment 2. IOMMU-backed Mediated Device assignment 3. Page Request Services (PRS) support This patchset is aiming for the phase 1 and phase 2, and based on Jacob's below series. *) [PATCH v8 0/7] IOMMU user API enhancement - wip https://lore.kernel.org/linux-iommu/1598898300-65475-1-git-send-email-jacob.jun.pan@linux.intel.com/ *) [PATCH v2 0/9] IOASID extensions for guest SVA - wip https://lore.kernel.org/linux-iommu/1598070918-21321-1-git-send-email-jacob.jun.pan@linux.intel.com/ Complete set for current vSVA can be found in below branch. https://github.com/luxis1999/linux-vsva.git vsva-linux-5.9-rc2-v7 The corresponding QEMU patch series is included in below branch: https://github.com/luxis1999/qemu.git vsva_5.9_rc2_qemu_rfcv10 Regards, Yi Liu Changelog: - Patch v6 -> Patch v7: a) drop [PATCH v6 01/15] of v6 as it's merged by Alex. b) rebase on Jacob's v8 IOMMU uapi enhancement and v2 IOASID extension patchset. c) Address comments against v6 from Alex and Eric. Patch v6: https://lore.kernel.org/kvm/1595917664-33276-1-git-send-email-yi.l.liu@intel.com/ - Patch v5 -> Patch v6: a) Address comments against v5 from Eric. b) rebase on Jacob's v6 IOMMU uapi enhancement Patch v5: https://lore.kernel.org/kvm/1594552870-55687-1-git-send-email-yi.l.liu@intel.com/ - Patch v4 -> Patch v5: a) Address comments against v4 Patch v4: https://lore.kernel.org/kvm/1593861989-35920-1-git-send-email-yi.l.liu@intel.com/ - Patch v3 -> Patch v4: a) Address comments against v3 b) Add rb from Stefan on patch 14/15 Patch v3: https://lore.kernel.org/kvm/1592988927-48009-1-git-send-email-yi.l.liu@intel.com/ - Patch v2 -> Patch v3: a) Rebase on top of Jacob's v3 iommu uapi patchset b) Address comments from Kevin and Stefan Hajnoczi c) Reuse DOMAIN_ATTR_NESTING to get iommu nesting info d) Drop [PATCH v2 07/15] iommu/uapi: Add iommu_gpasid_unbind_data Patch v2: https://lore.kernel.org/kvm/1591877734-66527-1-git-send-email-yi.l.liu@intel.com/ - Patch v1 -> Patch v2: a) Refactor vfio_iommu_type1_ioctl() per suggestion from Christoph Hellwig. b) Re-sequence the patch series for better bisect support. c) Report IOMMU nesting cap info in detail instead of a format in v1. d) Enforce one group per nesting type container for vfio iommu type1 driver. e) Build the vfio_mm related code from vfio.c to be a separate vfio_pasid.ko. f) Add PASID ownership check in IOMMU driver. g) Adopted to latest IOMMU UAPI design. Removed IOMMU UAPI version check. Added iommu_gpasid_unbind_data for unbind requests from userspace. h) Define a single ioctl:VFIO_IOMMU_NESTING_OP for bind/unbind_gtbl and cahce_invld. i) Document dual stage control in vfio.rst. Patch v1: https://lore.kernel.org/kvm/1584880325-10561-1-git-send-email-yi.l.liu@intel.com/ - RFC v3 -> Patch v1: a) Address comments to the PASID request(alloc/free) path b) Report PASID alloc/free availabitiy to user-space c) Add a vfio_iommu_type1 parameter to support pasid quota tuning d) Adjusted to latest ioasid code implementation. e.g. remove the code for tracking the allocated PASIDs as latest ioasid code will track it, VFIO could use ioasid_free_set() to free all PASIDs. RFC v3: https://lore.kernel.org/kvm/1580299912-86084-1-git-send-email-yi.l.liu@intel.com/ - RFC v2 -> v3: a) Refine the whole patchset to fit the roughly parts in this series b) Adds complete vfio PASID management framework. e.g. pasid alloc, free, reclaim in VM crash/down and per-VM PASID quota to prevent PASID abuse. c) Adds IOMMU uAPI version check and page table format check to ensure version compatibility and hardware compatibility. d) Adds vSVA vfio support for IOMMU-backed mdevs. RFC v2: https://lore.kernel.org/kvm/1571919983-3231-1-git-send-email-yi.l.liu@intel.com/ - RFC v1 -> v2: Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE. RFC v1: https://lore.kernel.org/kvm/1562324772-3084-1-git-send-email-yi.l.liu@intel.com/ --- Eric Auger (1): vfio: Document dual stage control Liu Yi L (14): iommu: Report domain nesting info iommu/smmu: Report empty domain nesting info vfio/type1: Report iommu nesting info to userspace vfio: Add PASID allocation/free support iommu/vt-d: Support setting ioasid set to domain iommu/vt-d: Remove get_task_mm() in bind_gpasid() vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free) iommu/vt-d: Check ownership for PASIDs from user-space vfio/type1: Support binding guest page tables to PASID vfio/type1: Allow invalidating first-level/stage IOMMU cache vfio/type1: Add vSVA support for IOMMU-backed mdevs vfio/pci: Expose PCIe PASID capability to guest iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units iommu/vt-d: Support reporting nesting capability info Yi Sun (1): iommu: Pass domain to sva_unbind_gpasid() Documentation/driver-api/vfio.rst | 76 ++++++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +- drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 +- drivers/iommu/intel/iommu.c | 137 +++++++++- drivers/iommu/intel/svm.c | 43 +-- drivers/iommu/iommu.c | 2 +- drivers/vfio/Kconfig | 6 + drivers/vfio/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 2 +- drivers/vfio/vfio_iommu_type1.c | 395 +++++++++++++++++++++++++++- drivers/vfio/vfio_pasid.c | 283 ++++++++++++++++++++ include/linux/intel-iommu.h | 25 +- include/linux/iommu.h | 4 +- include/linux/vfio.h | 54 ++++ include/uapi/linux/iommu.h | 76 ++++++ include/uapi/linux/vfio.h | 101 +++++++ 16 files changed, 1220 insertions(+), 43 deletions(-) create mode 100644 drivers/vfio/vfio_pasid.c -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
IOMMUs that support nesting translation needs report the capability info to userspace. It gives information about requirements the userspace needs to implement plus other features characterizing the physical implementation. This patch introduces a new IOMMU UAPI struct that gives information about the nesting capabilities and features. This struct is supposed to be returned by iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter, with one domain whose type has been set to DOMAIN_ATTR_NESTING. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) rephrase the commit message, replace the @data[] field in struct iommu_nesting_info with union per comments from Eric Auger. v5 -> v6: *) rephrase the feature notes per comments from Eric Auger. *) rename @size of struct iommu_nesting_info to @argsz. v4 -> v5: *) address comments from Eric Auger. v3 -> v4: *) split the SMMU driver changes to be a separate patch *) move the @addr_width and @pasid_bits from vendor specific part to generic part. *) tweak the description for the @features field of struct iommu_nesting_info. *) add description on the @data[] field of struct iommu_nesting_info v2 -> v3: *) remvoe cap/ecap_mask in iommu_nesting_info. *) reuse DOMAIN_ATTR_NESTING to get nesting info. *) return an empty iommu_nesting_info for SMMU drivers per Jean' suggestion. --- include/uapi/linux/iommu.h | 76 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 1ebc23d..ff987e4 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -341,4 +341,80 @@ struct iommu_gpasid_bind_data { } vendor; }; +/* + * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info. + * + * @flags: VT-d specific flags. Currently reserved for future + * extension. must be set to 0. + * @cap_reg: Describe basic capabilities as defined in VT-d capability + * register. + * @ecap_reg: Describe the extended capabilities as defined in VT-d + * extended capability register. + */ +struct iommu_nesting_info_vtd { + __u32 flags; + __u64 cap_reg; + __u64 ecap_reg; +}; + +/* + * struct iommu_nesting_info - Information for nesting-capable IOMMU. + * userspace should check it before using + * nesting capability. + * + * @argsz: size of the whole structure. + * @flags: currently reserved for future extension. must set to 0. + * @format: PASID table entry format, the same definition as struct + * iommu_gpasid_bind_data @format. + * @features: supported nesting features. + * @addr_width: The output addr width of first level/stage translation + * @pasid_bits: Maximum supported PASID bits, 0 represents no PASID + * support. + * @vendor: vendor specific data, structure type can be deduced from + * @format field. + * + * +===============+======================================================+ + * | feature | Notes | + * +===============+======================================================+ + * | SYSWIDE_PASID | IOMMU vendor driver sets it to mandate userspace | + * | | to allocate PASID from kernel. All PASID allocation | + * | | free must be mediated through the IOMMU UAPI. | + * +---------------+------------------------------------------------------+ + * | BIND_PGTBL | IOMMU vendor driver sets it to mandate userspace to | + * | | bind the first level/stage page table to associated | + * | | PASID (either the one specified in bind request or | + * | | the default PASID of iommu domain), through IOMMU | + * | | UAPI. | + * +---------------+------------------------------------------------------+ + * | CACHE_INVLD | IOMMU vendor driver sets it to mandate userspace to | + * | | explicitly invalidate the IOMMU cache through IOMMU | + * | | UAPI according to vendor-specific requirement when | + * | | changing the 1st level/stage page table. | + * +---------------+------------------------------------------------------+ + * + * data struct types defined for @format: + * +================================+=====================================+ + * | @format | data struct | + * +================================+=====================================+ + * | IOMMU_PASID_FORMAT_INTEL_VTD | struct iommu_nesting_info_vtd | + * +--------------------------------+-------------------------------------+ + * + */ +struct iommu_nesting_info { + __u32 argsz; + __u32 flags; + __u32 format; +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0) +#define IOMMU_NESTING_FEAT_BIND_PGTBL (1 << 1) +#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2) + __u32 features; + __u16 addr_width; + __u16 pasid_bits; + __u8 padding[12]; + /* Vendor specific data */ + union { + struct iommu_nesting_info_vtd vtd; + } vendor; +}; + #endif /* _UAPI_IOMMU_H */ -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING, iommu_domain_get_attr() should return an iommu_nesting_info handle. For now, return an empty nesting info struct for now as true nesting is not yet supported by the SMMUs. Cc: Will Deacon <will@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> --- v5 -> v6: *) add review-by from Eric Auger. v4 -> v5: *) address comments from Eric Auger. --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++++++++++++++++++++++++++-- drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 +++++++++++++++++++++++++++-- 2 files changed, 54 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7196207..016e2e5 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain, + void *data) +{ + struct iommu_nesting_info *info = (struct iommu_nesting_info *)data; + unsigned int size; + + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + return -ENODEV; + + size = sizeof(struct iommu_nesting_info); + + /* + * if provided buffer size is smaller than expected, should + * return 0 and also the expected buffer size to caller. + */ + if (info->argsz < size) { + info->argsz = size; + return 0; + } + + /* report an empty iommu_nesting_info for now */ + memset(info, 0x0, size); + info->argsz = size; + return 0; +} + static int arm_smmu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) { @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain, case IOMMU_DOMAIN_UNMANAGED: switch (attr) { case DOMAIN_ATTR_NESTING: - *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); - return 0; + return arm_smmu_domain_nesting_info(smmu_domain, data); default: return -ENODEV; } diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 09c42af9..368486f 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1510,6 +1510,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain, + void *data) +{ + struct iommu_nesting_info *info = (struct iommu_nesting_info *)data; + unsigned int size; + + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + return -ENODEV; + + size = sizeof(struct iommu_nesting_info); + + /* + * if provided buffer size is smaller than expected, should + * return 0 and also the expected buffer size to caller. + */ + if (info->argsz < size) { + info->argsz = size; + return 0; + } + + /* report an empty iommu_nesting_info for now */ + memset(info, 0x0, size); + info->argsz = size; + return 0; +} + static int arm_smmu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) { @@ -1519,8 +1545,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain, case IOMMU_DOMAIN_UNMANAGED: switch (attr) { case DOMAIN_ATTR_NESTING: - *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); - return 0; + return arm_smmu_domain_nesting_info(smmu_domain, data); default: return -ENODEV; } -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch exports iommu nesting capability info to user space through VFIO. Userspace is expected to check this info for supported uAPIs (e.g. PASID alloc/free, bind page table, and cache invalidation) and the vendor specific format information for first level/stage page table that will be bound to. The nesting info is available only after container set to be NESTED type. Current implementation imposes one limitation - one nesting container should include at most one iommu group. The philosophy of vfio container is having all groups/devices within the container share the same IOMMU context. When vSVA is enabled, one IOMMU context could include one 2nd- level address space and multiple 1st-level address spaces. While the 2nd-level address space is reasonably sharable by multiple groups, blindly sharing 1st-level address spaces across all groups within the container might instead break the guest expectation. In the future sub/super container concept might be introduced to allow partial address space sharing within an IOMMU context. But for now let's go with this restriction by requiring singleton container for using nesting iommu features. Below link has the related discussion about this decision. https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/ This patch also changes the NESTING type container behaviour. Something that would have succeeded before will now fail: Before this series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even if the SMMU didn't support stage-2, as the driver would have silently fallen back on stage-1 mappings (which work exactly the same as stage-2 only since there was no nesting supported). After the series, we do check for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the ioctl fails. But it should be a good fix and completely harmless. Detail can be found in below link as well. https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/ Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> --- v6 -> v7: *) using vfio_info_add_capability() for adding nesting cap per suggestion from Eric. v5 -> v6: *) address comments against v5 from Eric Auger. *) don't report nesting cap to userspace if the nesting_info->format is invalid. v4 -> v5: *) address comments from Eric Auger. *) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as cap is much "cheap", if needs extension in future, just define another cap. https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/ v3 -> v4: *) address comments against v3. v1 -> v2: *) added in v2 --- drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++------ include/uapi/linux/vfio.h | 19 +++++++++ 2 files changed, 99 insertions(+), 12 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index c992973..3c0048b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit, "Maximum number of user DMA mappings per container (65535)."); struct vfio_iommu { - struct list_head domain_list; - struct list_head iova_list; - struct vfio_domain *external_domain; /* domain for external user */ - struct mutex lock; - struct rb_root dma_list; - struct blocking_notifier_head notifier; - unsigned int dma_avail; - uint64_t pgsize_bitmap; - bool v2; - bool nesting; - bool dirty_page_tracking; - bool pinned_page_dirty_scope; + struct list_head domain_list; + struct list_head iova_list; + /* domain for external user */ + struct vfio_domain *external_domain; + struct mutex lock; + struct rb_root dma_list; + struct blocking_notifier_head notifier; + unsigned int dma_avail; + uint64_t pgsize_bitmap; + bool v2; + bool nesting; + bool dirty_page_tracking; + bool pinned_page_dirty_scope; + struct iommu_nesting_info *nesting_info; }; struct vfio_domain { @@ -130,6 +132,9 @@ struct vfio_regions { #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ (!list_empty(&iommu->domain_list)) +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \ + (!list_empty(&(iommu)->domain_list))) + #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE) /* @@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, list_splice_tail(iova_copy, iova); } + +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) +{ + kfree(iommu->nesting_info); + iommu->nesting_info = NULL; +} + static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_group *iommu_group) { @@ -2022,6 +2034,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, } } + /* Nesting type container can include only one group */ + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) { + mutex_unlock(&iommu->lock); + return -EINVAL; + } + group = kzalloc(sizeof(*group), GFP_KERNEL); domain = kzalloc(sizeof(*domain), GFP_KERNEL); if (!group || !domain) { @@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, if (ret) goto out_domain; + /* Nesting cap info is available only after attaching */ + if (iommu->nesting) { + int size = sizeof(struct iommu_nesting_info); + + iommu->nesting_info = kzalloc(size, GFP_KERNEL); + if (!iommu->nesting_info) { + ret = -ENOMEM; + goto out_detach; + } + + /* Now get the nesting info */ + iommu->nesting_info->argsz = size; + ret = iommu_domain_get_attr(domain->domain, + DOMAIN_ATTR_NESTING, + iommu->nesting_info); + if (ret) + goto out_detach; + } + /* Get aperture info */ iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo); @@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, return 0; out_detach: + vfio_iommu_release_nesting_info(iommu); vfio_iommu_detach_group(domain, group); out_domain: iommu_domain_free(domain->domain); @@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, vfio_iommu_unmap_unpin_all(iommu); else vfio_iommu_unmap_unpin_reaccount(iommu); + + vfio_iommu_release_nesting_info(iommu); } iommu_domain_free(domain->domain); list_del(&domain->next); @@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu, return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig)); } +static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu, + struct vfio_info_cap *caps) +{ + struct vfio_iommu_type1_info_cap_nesting nesting_cap; + size_t size; + + /* when nesting_info is null, no need to go further */ + if (!iommu->nesting_info) + return 0; + + /* when @format of nesting_info is 0, fail the call */ + if (iommu->nesting_info->format == 0) + return -ENOENT; + + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) + + iommu->nesting_info->argsz; + + nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING; + nesting_cap.header.version = 1; + + memcpy(&nesting_cap.info, iommu->nesting_info, + iommu->nesting_info->argsz); + + return vfio_info_add_capability(caps, &nesting_cap.header, size); +} + static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu, unsigned long arg) { @@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu, if (!ret) ret = vfio_iommu_iova_build_caps(iommu, &caps); + ret = vfio_iommu_add_nesting_cap(iommu, &caps); + mutex_unlock(&iommu->lock); if (ret) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9204705..ff40f9e 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -14,6 +14,7 @@ #include <linux/types.h> #include <linux/ioctl.h> +#include <linux/iommu.h> #define VFIO_API_VERSION 0 @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration { __u64 max_dirty_bitmap_size; /* in bytes */ }; +/* + * The nesting capability allows to report the related capability + * and info for nesting iommu type. + * + * The structures below define version 1 of this capability. + * + * Nested capabilities should be checked by the userspace after + * setting VFIO_TYPE1_NESTING_IOMMU. + * + * @info: the nesting info provided by IOMMU driver. + */ +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3 + +struct vfio_iommu_type1_info_cap_nesting { + struct vfio_info_cap_header header; + struct iommu_nesting_info info; +}; + #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) /** -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing multiple process virtual address spaces with the device for simplified programming model. PASID is used to tag an virtual address space in DMA requests and to identify the related translation structure in IOMMU. When a PASID-capable device is assigned to a VM, we want the same capability of using PASID to tag guest process virtual address spaces to achieve virtual SVA (vSVA). PASID management for guest is vendor specific. Some vendors (e.g. Intel VT-d) requires system-wide managed PASIDs across all devices, regardless of whether a device is used by host or assigned to guest. Other vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully delegated to the guest for assigned devices. For system-wide managed PASIDs, this patch introduces a vfio module to handle explicit PASID alloc/free requests from guest. Allocated PASIDs are associated to a process (or, mm_struct) in IOASID core. A vfio_mm object is introduced to track mm_struct. Multiple VFIO containers within a process share the same vfio_mm object. A quota mechanism is provided to prevent malicious user from exhausting available PASIDs. Currently the quota is a global parameter applied to all VFIO devices. In the future per-device quota might be supported too. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> --- v6 -> v7: *) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger. v5 -> v6: *) address comments from Eric. Add vfio_unlink_pasid() to be consistent with vfio_unlink_dma(). Add a comment in vfio_pasid_exit(). v4 -> v5: *) address comments from Eric Auger. *) address the comments from Alex on the pasid free range support. Added per vfio_mm pasid r-b tree. https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/ v3 -> v4: *) fix lock leam in vfio_mm_get_from_task() *) drop pasid_quota field in struct vfio_mm *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when !CONFIG_VFIO_PASID v1 -> v2: *) added in v2, split from the pasid alloc/free support of v1 --- drivers/vfio/Kconfig | 5 + drivers/vfio/Makefile | 1 + drivers/vfio/vfio_pasid.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/vfio.h | 28 ++++++ 4 files changed, 281 insertions(+) create mode 100644 drivers/vfio/vfio_pasid.c diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index fd17db9..3d8a108 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -19,6 +19,11 @@ config VFIO_VIRQFD depends on VFIO && EVENTFD default n +config VFIO_PASID + tristate + depends on IOASID && VFIO + default n + menuconfig VFIO tristate "VFIO Non-Privileged userspace driver framework" depends on IOMMU_API diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index de67c47..bb836a3 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o obj-$(CONFIG_VFIO) += vfio.o obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode 100644 index 0000000..44ecdd5 --- /dev/null +++ b/drivers/vfio/vfio_pasid.c @@ -0,0 +1,247 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Intel Corporation. + * Author: Liu Yi L <yi.l.liu@intel.com> + * + */ + +#include <linux/vfio.h> +#include <linux/file.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/sched/mm.h> + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Liu Yi L <yi.l.liu@intel.com>" +#define DRIVER_DESC "PASID management for VFIO bus drivers" + +#define VFIO_DEFAULT_PASID_QUOTA 1000 +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA; +module_param_named(pasid_quota, pasid_quota, uint, 0444); +MODULE_PARM_DESC(pasid_quota, + "Set the quota for max number of PASIDs that an application is allowed to request (default 1000)"); + +struct vfio_mm_token { + unsigned long long val; +}; + +struct vfio_mm { + struct kref kref; + struct ioasid_set *ioasid_set; + struct mutex pasid_lock; + struct rb_root pasid_list; + struct list_head next; + struct vfio_mm_token token; +}; + +static struct mutex vfio_mm_lock; +static struct list_head vfio_mm_list; + +struct vfio_pasid { + struct rb_node node; + ioasid_t pasid; +}; + +static void vfio_remove_all_pasids(struct vfio_mm *vmm); + +/* called with vfio.vfio_mm_lock held */ +static void vfio_mm_release(struct kref *kref) +{ + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref); + + list_del(&vmm->next); + mutex_unlock(&vfio_mm_lock); + vfio_remove_all_pasids(vmm); + ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set after allocation? + kfree(vmm); +} + +void vfio_mm_put(struct vfio_mm *vmm) +{ + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); +} + +static void vfio_mm_get(struct vfio_mm *vmm) +{ + kref_get(&vmm->kref); +} + +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) +{ + struct mm_struct *mm = get_task_mm(task); + struct vfio_mm *vmm; + unsigned long long val = (unsigned long long)mm; + int ret; + + mutex_lock(&vfio_mm_lock); + /* Search existing vfio_mm with current mm pointer */ + list_for_each_entry(vmm, &vfio_mm_list, next) { + if (vmm->token.val == val) { + vfio_mm_get(vmm); + goto out; + } + } + + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL); + if (!vmm) { + vmm = ERR_PTR(-ENOMEM); + goto out; + } + + /* + * IOASID core provides a 'IOASID set' concept to track all + * PASIDs associated with a token. Here we use mm_struct as + * the token and create a IOASID set per mm_struct. All the + * containers of the process share the same IOASID set. + */ + vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM); + if (IS_ERR(vmm->ioasid_set)) { + ret = PTR_ERR(vmm->ioasid_set); + kfree(vmm); + vmm = ERR_PTR(ret); + goto out; + } + + kref_init(&vmm->kref); + vmm->token.val = val; + mutex_init(&vmm->pasid_lock); + vmm->pasid_list = RB_ROOT; + + list_add(&vmm->next, &vfio_mm_list); +out: + mutex_unlock(&vfio_mm_lock); + mmput(mm); + return vmm; +} + +/* + * Find PASID within @min and @max + */ +static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm, + ioasid_t min, ioasid_t max) +{ + struct rb_node *node = vmm->pasid_list.rb_node; + + while (node) { + struct vfio_pasid *vid = rb_entry(node, + struct vfio_pasid, node); + + if (max < vid->pasid) + node = node->rb_left; + else if (min > vid->pasid) + node = node->rb_right; + else + return vid; + } + + return NULL; +} + +static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid *new) +{ + struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL; + struct vfio_pasid *vid; + + while (*link) { + parent = *link; + vid = rb_entry(parent, struct vfio_pasid, node); + + if (new->pasid <= vid->pasid) + link = &(*link)->rb_left; + else + link = &(*link)->rb_right; + } + + rb_link_node(&new->node, parent, link); + rb_insert_color(&new->node, &vmm->pasid_list); +} + +static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid *old) +{ + rb_erase(&old->node, &vmm->pasid_list); +} + +static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid *vid) +{ + vfio_unlink_pasid(vmm, vid); + ioasid_free(vmm->ioasid_set, vid->pasid); + kfree(vid); +} + +static void vfio_remove_all_pasids(struct vfio_mm *vmm) +{ + struct rb_node *node; + + mutex_lock(&vmm->pasid_lock); + while ((node = rb_first(&vmm->pasid_list))) + vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node)); + mutex_unlock(&vmm->pasid_lock); +} + +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) +{ + ioasid_t pasid; + struct vfio_pasid *vid; + + pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL); + if (pasid == INVALID_IOASID) + return -ENOSPC; + + vid = kzalloc(sizeof(*vid), GFP_KERNEL); + if (!vid) { + ioasid_free(vmm->ioasid_set, pasid); + return -ENOMEM; + } + + vid->pasid = pasid; + + mutex_lock(&vmm->pasid_lock); + vfio_link_pasid(vmm, vid); + mutex_unlock(&vmm->pasid_lock); + + return pasid; +} + +void vfio_pasid_free_range(struct vfio_mm *vmm, + ioasid_t min, ioasid_t max) +{ + struct vfio_pasid *vid = NULL; + + /* + * IOASID core will notify PASID users (e.g. IOMMU driver) to + * teardown necessary structures depending on the to-be-freed + * PASID. + */ + mutex_lock(&vmm->pasid_lock); + while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) + vfio_remove_pasid(vmm, vid); + mutex_unlock(&vmm->pasid_lock); +} + +static int __init vfio_pasid_init(void) +{ + mutex_init(&vfio_mm_lock); + INIT_LIST_HEAD(&vfio_mm_list); + return 0; +} + +static void __exit vfio_pasid_exit(void) +{ + /* + * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1 + * and may be other module. once vfio_pasid_exit() is triggered, + * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed. + * All the vfio_mm instances should have been released. If not, + * means there is vfio_mm leak, should be a bug of user module. + * So just warn here. + */ + WARN_ON(!list_empty(&vfio_mm_list)); +} + +module_init(vfio_pasid_init); +module_exit(vfio_pasid_exit); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 38d3c6a..31472a9 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops); extern void vfio_unregister_iommu_driver( const struct vfio_iommu_driver_ops *ops); +struct vfio_mm; +#if IS_ENABLED(CONFIG_VFIO_PASID) +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task); +extern void vfio_mm_put(struct vfio_mm *vmm); +extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); +extern void vfio_pasid_free_range(struct vfio_mm *vmm, + ioasid_t min, ioasid_t max); +#else +static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) +{ + return ERR_PTR(-ENOTTY); +} + +static inline void vfio_mm_put(struct vfio_mm *vmm) +{ +} + +static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) +{ + return -ENOTTY; +} + +static inline void vfio_pasid_free_range(struct vfio_mm *vmm, + ioasid_t min, ioasid_t max) +{ +} +#endif /* CONFIG_VFIO_PASID */ + /* * External user API */ -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
From IOMMU p.o.v., PASIDs allocated and managed by external components (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU needs some knowledge to check the PASID ownership, hence add an interface for those components to tell the PASID owner. In latest kernel design, PASID ownership is managed by IOASID set where the PASID is allocated from. This patch adds support for setting ioasid set ID to the domains used for nesting/vSVA. Subsequent SVA operations will check the PASID against its IOASID set for proper ownership. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) add a helper function __domain_config_ioasid_set() per Eric's comment. *) rename @ioasid_sid field of struct dmar_domain to be @pasid_set. *) Eric gave r-b against v6, but since there is change, so will seek for his r-b again on this version. v5 -> v6: *) address comments against v5 from Eric Auger. v4 -> v5: *) address comments from Eric Auger. --- drivers/iommu/intel/iommu.c | 26 ++++++++++++++++++++++++++ include/linux/intel-iommu.h | 4 ++++ include/linux/iommu.h | 1 + 3 files changed, 31 insertions(+) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 5813eea..d1c77fc 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1806,6 +1806,7 @@ static struct dmar_domain *alloc_domain(int flags) if (first_level_by_default()) domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL; domain->has_iotlb_device = false; + domain->pasid_set = host_pasid_set; INIT_LIST_HEAD(&domain->devices); return domain; @@ -6007,6 +6008,22 @@ static bool intel_iommu_is_attach_deferred(struct iommu_domain *domain, return attach_deferred(dev); } +static int __domain_config_ioasid_set(struct dmar_domain *domain, + struct ioasid_set *set) +{ + if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE)) + return -ENODEV; + + if (domain->pasid_set != host_pasid_set && + domain->pasid_set != set) { + pr_warn_ratelimited("multi ioasid_set setting to domain"); + return -EBUSY; + } + + domain->pasid_set = set; + return 0; +} + static int intel_iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) @@ -6030,6 +6047,15 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain, } spin_unlock_irqrestore(&device_domain_lock, flags); break; + case DOMAIN_ATTR_IOASID_SET: + { + struct ioasid_set *set = (struct ioasid_set *)data; + + spin_lock_irqsave(&device_domain_lock, flags); + ret = __domain_config_ioasid_set(dmar_domain, set); + spin_unlock_irqrestore(&device_domain_lock, flags); + break; + } default: ret = -EINVAL; break; diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index d36038e..3345391 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -549,6 +549,10 @@ struct dmar_domain { 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */ u64 max_addr; /* maximum mapped address */ + struct ioasid_set *pasid_set; /* + * the ioasid set which tracks all + * PASIDs used by the domain. + */ int default_pasid; /* * The default pasid used for non-SVM * traffic on mediated devices. diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c364b1c..5b9f630 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -118,6 +118,7 @@ enum iommu_attr { DOMAIN_ATTR_FSL_PAMUV1, DOMAIN_ATTR_NESTING, /* two stages of translation */ DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, + DOMAIN_ATTR_IOASID_SET, DOMAIN_ATTR_MAX, }; -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch is to address a REVISIT. As ioasid_set is added to domain, upper layer/VFIO can set ioasid_set to iommu driver, and track the PASID ownership, so no need to get_task_mm() in intel_svm_bind_gpasid(). Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> --- drivers/iommu/intel/svm.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 519eabb..d3cf52b 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -400,12 +400,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, ret = -ENOMEM; goto out; } - /* REVISIT: upper layer/VFIO can track host process that bind - * the PASID. ioasid_set = mm might be sufficient for vfio to - * check pasid VMM ownership. We can drop the following line - * once VFIO and IOASID set check is in place. - */ - svm->mm = get_task_mm(current); svm->pasid = data->hpasid; if (data->flags & IOMMU_SVA_GPASID_VAL) { svm->gpasid = data->gpasid; @@ -420,7 +414,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, INIT_WORK(&svm->work, intel_svm_free_async_fn); ioasid_attach_data(data->hpasid, svm); INIT_LIST_HEAD_RCU(&svm->devs); - mmput(svm->mm); } sdev = kzalloc(sizeof(*sdev), GFP_KERNEL); if (!sdev) { -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch allows userspace to request PASID allocation/free, e.g. when serving the request from the guest. PASIDs that are not freed by userspace are automatically freed when the IOASID set is destroyed when process exits. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) current VFIO returns allocated pasid via signed int, thus VFIO UAPI can only support 31 bits pasid. If user space gives min,max which is wider than 31 bits, should fail the allocation or free request. v5 -> v6: *) address comments from Eric against v5. remove the alloc/free helper. v4 -> v5: *) address comments from Eric Auger. *) the comments for the PASID_FREE request is addressed in patch 5/15 of this series. v3 -> v4: *) address comments from v3, except the below comment against the range of PASID_FREE request. needs more help on it. "> +if (req.range.min > req.range.max) Is it exploitable that a user can spin the kernel for a long time in the case of a free by calling this with [0, MAX_UINT] regardless of their actual allocations?" https://lore.kernel.org/linux-iommu/20200702151832.048b44d1@x1.home/ v1 -> v2: *) move the vfio_mm related code to be a seprate module *) use a single structure for alloc/free, could support a range of PASIDs *) fetch vfio_mm at group_attach time instead of at iommu driver open time --- drivers/vfio/Kconfig | 1 + drivers/vfio/vfio_iommu_type1.c | 76 +++++++++++++++++++++++++++++++++++++++++ drivers/vfio/vfio_pasid.c | 10 ++++++ include/linux/vfio.h | 6 ++++ include/uapi/linux/vfio.h | 43 +++++++++++++++++++++++ 5 files changed, 136 insertions(+) diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 3d8a108..95d90c6 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -2,6 +2,7 @@ config VFIO_IOMMU_TYPE1 tristate depends on VFIO + select VFIO_PASID if (X86) default n config VFIO_IOMMU_SPAPR_TCE diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 3c0048b..bd4b668 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -76,6 +76,7 @@ struct vfio_iommu { bool dirty_page_tracking; bool pinned_page_dirty_scope; struct iommu_nesting_info *nesting_info; + struct vfio_mm *vmm; }; struct vfio_domain { @@ -2000,6 +2001,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) { + if (iommu->vmm) { + vfio_mm_put(iommu->vmm); + iommu->vmm = NULL; + } + kfree(iommu->nesting_info); iommu->nesting_info = NULL; } @@ -2127,6 +2133,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, iommu->nesting_info); if (ret) goto out_detach; + + if (iommu->nesting_info->features & + IOMMU_NESTING_FEAT_SYSWIDE_PASID) { + struct vfio_mm *vmm; + struct ioasid_set *set; + + vmm = vfio_mm_get_from_task(current); + if (IS_ERR(vmm)) { + ret = PTR_ERR(vmm); + goto out_detach; + } + iommu->vmm = vmm; + + set = vfio_mm_ioasid_set(vmm); + ret = iommu_domain_set_attr(domain->domain, + DOMAIN_ATTR_IOASID_SET, + set); + if (ret) + goto out_detach; + } } /* Get aperture info */ @@ -2908,6 +2934,54 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, return -EINVAL; } +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu, + unsigned long arg) +{ + struct vfio_iommu_type1_pasid_request req; + unsigned long minsz; + int ret; + + minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range); + + if (copy_from_user(&req, (void __user *)arg, minsz)) + return -EFAULT; + + if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK)) + return -EINVAL; + + /* + * Current VFIO_IOMMU_PASID_REQUEST only supports at most + * 31 bits PASID. The min,max value from userspace should + * not exceed 31 bits. + */ + if (req.range.min > req.range.max || + req.range.min > (1 << VFIO_IOMMU_PASID_BITS) || + req.range.max > (1 << VFIO_IOMMU_PASID_BITS)) + return -EINVAL; + + mutex_lock(&iommu->lock); + if (!iommu->vmm) { + mutex_unlock(&iommu->lock); + return -EOPNOTSUPP; + } + + switch (req.flags & VFIO_PASID_REQUEST_MASK) { + case VFIO_IOMMU_FLAG_ALLOC_PASID: + ret = vfio_pasid_alloc(iommu->vmm, req.range.min, + req.range.max); + break; + case VFIO_IOMMU_FLAG_FREE_PASID: + vfio_pasid_free_range(iommu->vmm, req.range.min, + req.range.max); + ret = 0; + break; + default: + ret = -EINVAL; + } + mutex_unlock(&iommu->lock); + return ret; +} + static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { @@ -2924,6 +2998,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return vfio_iommu_type1_unmap_dma(iommu, arg); case VFIO_IOMMU_DIRTY_PAGES: return vfio_iommu_type1_dirty_pages(iommu, arg); + case VFIO_IOMMU_PASID_REQUEST: + return vfio_iommu_type1_pasid_request(iommu, arg); default: return -ENOTTY; } diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c index 44ecdd5..0ec4660 100644 --- a/drivers/vfio/vfio_pasid.c +++ b/drivers/vfio/vfio_pasid.c @@ -60,6 +60,7 @@ void vfio_mm_put(struct vfio_mm *vmm) { kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); } +EXPORT_SYMBOL_GPL(vfio_mm_put); static void vfio_mm_get(struct vfio_mm *vmm) { @@ -113,6 +114,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) mmput(mm); return vmm; } +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task); + +struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) +{ + return vmm->ioasid_set; +} +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_set); /* * Find PASID within @min and @max @@ -201,6 +209,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) return pasid; } +EXPORT_SYMBOL_GPL(vfio_pasid_alloc); void vfio_pasid_free_range(struct vfio_mm *vmm, ioasid_t min, ioasid_t max) @@ -217,6 +226,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, vfio_remove_pasid(vmm, vid); mutex_unlock(&vmm->pasid_lock); } +EXPORT_SYMBOL_GPL(vfio_pasid_free_range); static int __init vfio_pasid_init(void) { diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 31472a9..5c3d7a8 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -101,6 +101,7 @@ struct vfio_mm; #if IS_ENABLED(CONFIG_VFIO_PASID) extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task); extern void vfio_mm_put(struct vfio_mm *vmm); +extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm); extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); extern void vfio_pasid_free_range(struct vfio_mm *vmm, ioasid_t min, ioasid_t max); @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm) { } +static inline struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) +{ + return -ENOTTY; +} + static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) { return -ENOTTY; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ff40f9e..a4bc42e 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1172,6 +1172,49 @@ struct vfio_iommu_type1_dirty_bitmap_get { #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17) +/** + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18, + * struct vfio_iommu_type1_pasid_request) + * + * PASID (Processor Address Space ID) is a PCIe concept for tagging + * address spaces in DMA requests. When system-wide PASID allocation + * is required by the underlying iommu driver (e.g. Intel VT-d), this + * provides an interface for userspace to request pasid alloc/free + * for its assigned devices. Userspace should check the availability + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through + * VFIO_IOMMU_GET_INFO. + * + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range. + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range. + * @range is [min, max], which means both @min and @max are inclusive. + * ALLOC_PASID and FREE_PASID are mutually exclusive. + * + * Current interface supports at most 31 bits PASID bits as returning + * PASID allocation result via signed int. PCIe spec defines 20 bits + * for PASID width, so 31 bits is enough. As a result user space should + * provide min, max no more than 31 bits. + * returns: allocated PASID value on success, -errno on failure for + * ALLOC_PASID; + * 0 for FREE_PASID operation; + */ +struct vfio_iommu_type1_pasid_request { + __u32 argsz; +#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0) +#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1) + __u32 flags; + struct { + __u32 min; + __u32 max; + } range; +}; + +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \ + VFIO_IOMMU_FLAG_FREE_PASID) + +#define VFIO_IOMMU_PASID_BITS 31 + +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) + /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */ /* -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
From: Yi Sun <yi.y.sun@intel.com> Current interface is good enough for SVA virtualization on an assigned physical PCI device, but when it comes to mediated devices, a physical device may be attached with multiple aux-domains. Also, for guest unbind, the PASID to be unbind should be allocated to the VM. This check requires to know the ioasid_set which is associated with the domain. So this interface needs to pass in domain info. Then the iommu driver is able to know which domain will be used for the 2nd stage translation of the nesting mode and also be able to do PASID ownership check. This patch passes @domain per the above reason. Also, the prototype of &pasid is changed from an "int" to "u32" as the below link. [PATCH v6 01/12] iommu: Change type of pasid to u32 https://lore.kernel.org/linux-iommu/1594684087-61184-2-git-send-email-fenghua.yu@intel.com/ Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> --- v6 -> v7: *) correct the link for the details of modifying pasid prototype to bve "u32". *) hold off r-b from Eric Auger as there is modification in this patch, will seek r-b in this version. v5 -> v6: *) use "u32" prototype for @pasid. *) add review-by from Eric Auger. v2 -> v3: *) pass in domain info only *) use u32 for pasid instead of int type v1 -> v2: *) added in v2. --- drivers/iommu/intel/svm.c | 3 ++- drivers/iommu/iommu.c | 2 +- include/linux/intel-iommu.h | 3 ++- include/linux/iommu.h | 3 ++- 4 files changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index d3cf52b..d39fafb 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -476,7 +476,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, return ret; } -int intel_svm_unbind_gpasid(struct device *dev, int pasid) +int intel_svm_unbind_gpasid(struct iommu_domain *domain, + struct device *dev, u32 pasid) { struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL); struct intel_svm_dev *sdev; diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3bc263a..52aabb64 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2155,7 +2155,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev, if (unlikely(!domain->ops->sva_unbind_gpasid)) return -ENODEV; - return domain->ops->sva_unbind_gpasid(dev, data->hpasid); + return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid); } EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid); diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 3345391..ce0b33b 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -741,7 +741,8 @@ extern int intel_svm_enable_prq(struct intel_iommu *iommu); extern int intel_svm_finish_prq(struct intel_iommu *iommu); int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, struct iommu_gpasid_bind_data *data); -int intel_svm_unbind_gpasid(struct device *dev, int pasid); +int intel_svm_unbind_gpasid(struct iommu_domain *domain, + struct device *dev, u32 pasid); struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm, void *drvdata); void intel_svm_unbind(struct iommu_sva *handle); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 5b9f630..d561448 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -297,7 +297,8 @@ struct iommu_ops { int (*sva_bind_gpasid)(struct iommu_domain *domain, struct device *dev, struct iommu_gpasid_bind_data *data); - int (*sva_unbind_gpasid)(struct device *dev, int pasid); + int (*sva_unbind_gpasid)(struct iommu_domain *domain, + struct device *dev, u32 pasid); int (*def_domain_type)(struct device *dev); -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
When an IOMMU domain with nesting attribute is used for guest SVA, a system-wide PASID is allocated for binding with the device and the domain. For security reason, we need to check the PASID passed from user-space. e.g. page table bind/unbind and PASID related cache invalidation. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) acquire device_domain_lock in bind/unbind_gpasid() to ensure dmar_domain is not modified during bind/unbind_gpasid(). *) the change to svm.c varies from previous version as Jacob refactored the svm.c code. --- drivers/iommu/intel/iommu.c | 29 +++++++++++++++++++++++++---- drivers/iommu/intel/svm.c | 33 ++++++++++++++++++++++++--------- include/linux/intel-iommu.h | 2 ++ 3 files changed, 51 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d1c77fc..95740b9 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5451,6 +5451,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev, int granu = 0; u64 pasid = 0; u64 addr = 0; + void *pdata; granu = to_vtd_granularity(cache_type, inv_info->granularity); if (granu == -EINVAL) { @@ -5470,6 +5471,15 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev, (inv_info->granu.addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)) pasid = inv_info->granu.addr_info.pasid; + pdata = ioasid_find(dmar_domain->pasid_set, pasid, NULL); + if (!pdata) { + ret = -EINVAL; + goto out_unlock; + } else if (IS_ERR(pdata)) { + ret = PTR_ERR(pdata); + goto out_unlock; + } + switch (BIT(cache_type)) { case IOMMU_CACHE_INV_TYPE_IOTLB: /* HW will ignore LSB bits based on address mask */ @@ -5787,12 +5797,14 @@ static void intel_iommu_get_resv_regions(struct device *device, list_add_tail(®->list, head); } -int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev) +/* + * Caller should have held device_domain_lock + */ +int intel_iommu_enable_pasid_locked(struct intel_iommu *iommu, struct device *dev) { struct device_domain_info *info; struct context_entry *context; struct dmar_domain *domain; - unsigned long flags; u64 ctx_lo; int ret; @@ -5800,7 +5812,6 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev) if (!domain) return -EINVAL; - spin_lock_irqsave(&device_domain_lock, flags); spin_lock(&iommu->lock); ret = -EINVAL; @@ -5833,11 +5844,21 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev) out: spin_unlock(&iommu->lock); - spin_unlock_irqrestore(&device_domain_lock, flags); return ret; } +int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&device_domain_lock, flags); + ret = intel_iommu_enable_pasid_locked(iommu, dev); + spin_unlock_irqrestore(&device_domain_lock, flags); + return ret; +} + static void intel_iommu_apply_resv_region(struct device *dev, struct iommu_domain *domain, struct iommu_resv_region *region) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index d39fafb..80f58ab 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -293,7 +293,9 @@ static LIST_HEAD(global_svm_list); list_for_each_entry((sdev), &(svm)->devs, list) \ if ((d) != (sdev)->dev) {} else -static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid, +static int pasid_to_svm_sdev(struct device *dev, + struct ioasid_set *set, + unsigned int pasid, struct intel_svm **rsvm, struct intel_svm_dev **rsdev) { @@ -307,7 +309,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid, if (pasid == INVALID_IOASID || pasid >= PASID_MAX) return -EINVAL; - svm = ioasid_find(NULL, pasid, NULL); + svm = ioasid_find(set, pasid, NULL); if (IS_ERR(svm)) return PTR_ERR(svm); @@ -344,6 +346,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, struct intel_svm_dev *sdev = NULL; struct dmar_domain *dmar_domain; struct intel_svm *svm = NULL; + unsigned long flags; int ret = 0; if (WARN_ON(!iommu) || !data) @@ -377,7 +380,9 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, dmar_domain = to_dmar_domain(domain); mutex_lock(&pasid_mutex); - ret = pasid_to_svm_sdev(dev, data->hpasid, &svm, &sdev); + spin_lock_irqsave(&device_domain_lock, flags); + ret = pasid_to_svm_sdev(dev, dmar_domain->pasid_set, + data->hpasid, &svm, &sdev); if (ret) goto out; @@ -395,7 +400,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, if (!svm) { /* We come here when PASID has never been bond to a device. */ - svm = kzalloc(sizeof(*svm), GFP_KERNEL); + svm = kzalloc(sizeof(*svm), GFP_ATOMIC); if (!svm) { ret = -ENOMEM; goto out; @@ -415,7 +420,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, ioasid_attach_data(data->hpasid, svm); INIT_LIST_HEAD_RCU(&svm->devs); } - sdev = kzalloc(sizeof(*sdev), GFP_KERNEL); + sdev = kzalloc(sizeof(*sdev), GFP_ATOMIC); if (!sdev) { ret = -ENOMEM; goto out; @@ -427,7 +432,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, sdev->users = 1; /* Set up device context entry for PASID if not enabled already */ - ret = intel_iommu_enable_pasid(iommu, sdev->dev); + ret = intel_iommu_enable_pasid_locked(iommu, sdev->dev); if (ret) { dev_err_ratelimited(dev, "Failed to enable PASID capability\n"); kfree(sdev); @@ -462,6 +467,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev, init_rcu_head(&sdev->rcu); list_add_rcu(&sdev->list, &svm->devs); out: + spin_unlock_irqrestore(&device_domain_lock, flags); if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) { ioasid_attach_data(data->hpasid, NULL); kfree(svm); @@ -480,15 +486,22 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain, struct device *dev, u32 pasid) { struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL); + struct dmar_domain *dmar_domain; struct intel_svm_dev *sdev; struct intel_svm *svm; + unsigned long flags; int ret; if (WARN_ON(!iommu)) return -EINVAL; + dmar_domain = to_dmar_domain(domain); + mutex_lock(&pasid_mutex); - ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev); + spin_lock_irqsave(&device_domain_lock, flags); + ret = pasid_to_svm_sdev(dev, dmar_domain->pasid_set, + pasid, &svm, &sdev); + spin_unlock_irqrestore(&device_domain_lock, flags); if (ret) goto out; @@ -712,7 +725,8 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid) if (!iommu) goto out; - ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev); + ret = pasid_to_svm_sdev(dev, host_pasid_set, + pasid, &svm, &sdev); if (ret) goto out; @@ -1204,7 +1218,8 @@ int intel_svm_page_response(struct device *dev, goto out; } - ret = pasid_to_svm_sdev(dev, prm->pasid, &svm, &sdev); + ret = pasid_to_svm_sdev(dev, host_pasid_set, + prm->pasid, &svm, &sdev); if (ret || !sdev) { ret = -ENODEV; goto out; diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index ce0b33b..db7fc59 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -730,6 +730,8 @@ struct intel_iommu *domain_get_iommu(struct dmar_domain *domain); int for_each_device_domain(int (*fn)(struct device_domain_info *info, void *data), void *data); void iommu_flush_write_buffer(struct intel_iommu *iommu); +int intel_iommu_enable_pasid_locked(struct intel_iommu *iommu, + struct device *dev); int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev); struct dmar_domain *find_domain(struct device *dev); struct device_domain_info *get_domain_info(struct device *dev); -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Nesting translation allows two-levels/stages page tables, with 1st level for guest translations (e.g. GVA->GPA), 2nd level for host translations (e.g. GPA->HPA). This patch adds interface for binding guest page tables to a PASID. This PASID must have been allocated by the userspace before the binding request. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) introduced @user in struct domain_capsule to simplify the code per Eric's suggestion. *) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace. *) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn(). v5 -> v6: *) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule(). per comment from Eric. *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in linux/iommu.h for userspace operation and in-kernel operation. v3 -> v4: *) address comments from Alex on v3 v2 -> v3: *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun.pan@linux.intel.com/ v1 -> v2: *) rename subject from "vfio/type1: Bind guest page tables to host" *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/ unbind guet page table *) replaced vfio_iommu_for_each_dev() with a group level loop since this series enforces one group per container w/ nesting type as start. *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn() *) vfio_dev_unbind_gpasid() always successful *) use vfio_mm->pasid_lock to avoid race between PASID free and page table bind/unbind --- drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++++ drivers/vfio/vfio_pasid.c | 26 +++++++ include/linux/vfio.h | 20 +++++ include/uapi/linux/vfio.h | 36 +++++++++ 4 files changed, 245 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index bd4b668..11f1156 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -149,6 +149,39 @@ struct vfio_regions { #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) +struct domain_capsule { + struct vfio_group *group; + struct iommu_domain *domain; + /* set if @data contains a user pointer*/ + bool user; + void *data; +}; + +/* iommu->lock must be held */ +static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu, + struct domain_capsule *dc) +{ + struct vfio_domain *domain = NULL; + struct vfio_group *group = NULL; + + if (!iommu->nesting_info) + return -EINVAL; + + /* + * Only support singleton container with nesting type. If + * nesting_info is non-NULL, the container is non-empty. + * Also domain is non-empty. + */ + domain = list_first_entry(&iommu->domain_list, + struct vfio_domain, next); + group = list_first_entry(&domain->group_list, + struct vfio_group, next); + dc->group = group; + dc->domain = domain->domain; + dc->user = true; + return 0; +} + static int put_pfn(unsigned long pfn, int prot); static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu, @@ -2405,6 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu, return ret; } +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data) +{ + struct domain_capsule *dc = (struct domain_capsule *)data; + unsigned long arg = *(unsigned long *)dc->data; + + return iommu_uapi_sva_bind_gpasid(dc->domain, dev, + (void __user *)arg); +} + +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data) +{ + struct domain_capsule *dc = (struct domain_capsule *)data; + + if (dc->user) { + unsigned long arg = *(unsigned long *)dc->data; + + iommu_uapi_sva_unbind_gpasid(dc->domain, + dev, (void __user *)arg); + } else { + struct iommu_gpasid_bind_data *unbind_data = + (struct iommu_gpasid_bind_data *)dc->data; + + iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data); + } + return 0; +} + +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data) +{ + struct domain_capsule *dc = (struct domain_capsule *)data; + struct iommu_gpasid_bind_data unbind_data; + + unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data); + unbind_data.flags = 0; + unbind_data.hpasid = pasid; + + dc->user = false; + dc->data = &unbind_data; + + iommu_group_for_each_dev(dc->group->iommu_group, + dc, vfio_dev_unbind_gpasid_fn); +} + static void vfio_iommu_type1_detach_group(void *iommu_data, struct iommu_group *iommu_group) { @@ -2448,6 +2524,20 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, if (!group) continue; + if (iommu->vmm && (iommu->nesting_info->features & + IOMMU_NESTING_FEAT_BIND_PGTBL)) { + struct domain_capsule dc = { .group = group, + .domain = domain->domain, + .data = NULL }; + + /* + * Unbind page tables bound with system wide PASIDs + * which are allocated to userspace. + */ + vfio_mm_for_each_pasid(iommu->vmm, &dc, + vfio_group_unbind_gpasid_fn); + } + vfio_iommu_detach_group(domain, group); update_dirty_scope = !group->pinned_page_dirty_scope; list_del(&group->next); @@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu, return ret; } +static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu, + bool is_bind, unsigned long arg) +{ + struct domain_capsule dc = { .data = &arg }; + struct iommu_nesting_info *info; + int ret; + + mutex_lock(&iommu->lock); + + info = iommu->nesting_info; + if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) { + ret = -EOPNOTSUPP; + goto out_unlock; + } + + if (!iommu->vmm) { + ret = -EINVAL; + goto out_unlock; + } + + ret = vfio_prepare_nesting_domain_capsule(iommu, &dc); + if (ret) + goto out_unlock; + + /* Avoid race with other containers within the same process */ + vfio_mm_pasid_lock(iommu->vmm); + + if (is_bind) + ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc, + vfio_dev_bind_gpasid_fn); + if (ret || !is_bind) + iommu_group_for_each_dev(dc.group->iommu_group, + &dc, vfio_dev_unbind_gpasid_fn); + + vfio_mm_pasid_unlock(iommu->vmm); +out_unlock: + mutex_unlock(&iommu->lock); + return ret; +} + +static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu, + unsigned long arg) +{ + struct vfio_iommu_type1_nesting_op hdr; + unsigned int minsz; + int ret; + + minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags); + + if (copy_from_user(&hdr, (void __user *)arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz || + hdr.flags & ~VFIO_NESTING_OP_MASK || + (hdr.flags & VFIO_NESTING_OP_MASK) >= VFIO_IOMMU_NESTING_OP_NUM) + return -EINVAL; + + switch (hdr.flags & VFIO_NESTING_OP_MASK) { + case VFIO_IOMMU_NESTING_OP_BIND_PGTBL: + ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz); + break; + case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL: + ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz); + break; + default: + ret = -EINVAL; + } + + return ret; +} + static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { @@ -3000,6 +3161,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return vfio_iommu_type1_dirty_pages(iommu, arg); case VFIO_IOMMU_PASID_REQUEST: return vfio_iommu_type1_pasid_request(iommu, arg); + case VFIO_IOMMU_NESTING_OP: + return vfio_iommu_type1_nesting_op(iommu, arg); default: return -ENOTTY; } diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c index 0ec4660..9e2e4b0 100644 --- a/drivers/vfio/vfio_pasid.c +++ b/drivers/vfio/vfio_pasid.c @@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, * IOASID core will notify PASID users (e.g. IOMMU driver) to * teardown necessary structures depending on the to-be-freed * PASID. + * Hold pasid_lock also avoids race with PASID usages like bind/ + * unbind page tables to requested PASID. */ mutex_lock(&vmm->pasid_lock); while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) @@ -228,6 +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, } EXPORT_SYMBOL_GPL(vfio_pasid_free_range); +int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, + void (*fn)(ioasid_t id, void *data)) +{ + int ret; + + mutex_lock(&vmm->pasid_lock); + ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data); + mutex_unlock(&vmm->pasid_lock); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid); + +void vfio_mm_pasid_lock(struct vfio_mm *vmm) +{ + mutex_lock(&vmm->pasid_lock); +} +EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock); + +void vfio_mm_pasid_unlock(struct vfio_mm *vmm) +{ + mutex_unlock(&vmm->pasid_lock); +} +EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock); + static int __init vfio_pasid_init(void) { mutex_init(&vfio_mm_lock); diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 5c3d7a8..6a999c3 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -105,6 +105,11 @@ extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm); extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); extern void vfio_pasid_free_range(struct vfio_mm *vmm, ioasid_t min, ioasid_t max); +extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, + void (*fn)(ioasid_t id, void *data)); +extern void vfio_mm_pasid_lock(struct vfio_mm *vmm); +extern void vfio_mm_pasid_unlock(struct vfio_mm *vmm); + #else static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) { @@ -129,6 +134,21 @@ static inline void vfio_pasid_free_range(struct vfio_mm *vmm, ioasid_t min, ioasid_t max) { } + +static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, + void (*fn)(ioasid_t id, void *data)) +{ + return -ENOTTY; +} + +static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm) +{ +} + +static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm) +{ +} + #endif /* CONFIG_VFIO_PASID */ /* diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index a4bc42e..a99bd71 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request { #define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) +/** + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19, + * struct vfio_iommu_type1_nesting_op) + * + * This interface allows userspace to utilize the nesting IOMMU + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING + * cap through VFIO_IOMMU_GET_INFO. For platforms which require + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID + * _REQUEST. + * + * @data[] types defined for each op: + * +=================+===============================================+ + * | NESTING OP | @data[] | + * +=================+===============================================+ + * | BIND_PGTBL | struct iommu_gpasid_bind_data | + * +-----------------+-----------------------------------------------+ + * | UNBIND_PGTBL | struct iommu_gpasid_bind_data | + * +-----------------+-----------------------------------------------+ + * + * returns: 0 on success, -errno on failure. + */ +struct vfio_iommu_type1_nesting_op { + __u32 argsz; + __u32 flags; +#define VFIO_NESTING_OP_MASK (0xffff) /* lower 16-bits for op */ + __u8 data[]; +}; + +enum { + VFIO_IOMMU_NESTING_OP_BIND_PGTBL, + VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL, + VFIO_IOMMU_NESTING_OP_NUM, +}; + +#define VFIO_IOMMU_NESTING_OP _IO(VFIO_TYPE, VFIO_BASE + 19) + /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */ /* -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch provides an interface allowing the userspace to invalidate IOMMU cache for first-level page table. It is required when the first level IOMMU page table is not managed by the host kernel in the nested translation setup. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v1 -> v2: *) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type" *) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn() *) vfio_dev_cache_inv_fn() always successful *) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP --- drivers/vfio/vfio_iommu_type1.c | 38 ++++++++++++++++++++++++++++++++++++++ include/uapi/linux/vfio.h | 3 +++ 2 files changed, 41 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 11f1156..b67ce2d 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3112,6 +3112,41 @@ static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu, return ret; } +static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data) +{ + struct domain_capsule *dc = (struct domain_capsule *)data; + unsigned long arg = *(unsigned long *)dc->data; + + iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg); + return 0; +} + +static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu, + unsigned long arg) +{ + struct domain_capsule dc = { .data = &arg }; + struct iommu_nesting_info *info; + int ret; + + mutex_lock(&iommu->lock); + info = iommu->nesting_info; + if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) { + ret = -EOPNOTSUPP; + goto out_unlock; + } + + ret = vfio_prepare_nesting_domain_capsule(iommu, &dc); + if (ret) + goto out_unlock; + + iommu_group_for_each_dev(dc.group->iommu_group, &dc, + vfio_dev_cache_invalidate_fn); + +out_unlock: + mutex_unlock(&iommu->lock); + return ret; +} + static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu, unsigned long arg) { @@ -3136,6 +3171,9 @@ static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu, case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL: ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz); break; + case VFIO_IOMMU_NESTING_OP_CACHE_INVLD: + ret = vfio_iommu_invalidate_cache(iommu, arg + minsz); + break; default: ret = -EINVAL; } diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index a99bd71..a09a407 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1233,6 +1233,8 @@ struct vfio_iommu_type1_pasid_request { * +-----------------+-----------------------------------------------+ * | UNBIND_PGTBL | struct iommu_gpasid_bind_data | * +-----------------+-----------------------------------------------+ + * | CACHE_INVLD | struct iommu_cache_invalidate_info | + * +-----------------+-----------------------------------------------+ * * returns: 0 on success, -errno on failure. */ @@ -1246,6 +1248,7 @@ struct vfio_iommu_type1_nesting_op { enum { VFIO_IOMMU_NESTING_OP_BIND_PGTBL, VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL, + VFIO_IOMMU_NESTING_OP_CACHE_INVLD, VFIO_IOMMU_NESTING_OP_NUM, }; -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Recent years, mediated device pass-through framework (e.g. vfio-mdev) is used to achieve flexible device sharing across domains (e.g. VMs). Also there are hardware assisted mediated pass-through solutions from platform vendors. e.g. Intel VT-d scalable mode which supports Intel Scalable I/O Virtualization technology. Such mdevs are called IOMMU- backed mdevs as there are IOMMU enforced DMA isolation for such mdevs. In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain concept, which means mdevs are protected by an iommu domain which is auxiliary to the domain that the kernel driver primarily uses for DMA API. Details can be found in the KVM presentation as below: https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The main requirement is to use the auxiliary domain associated with mdev. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> CC: Jun Tian <jun.j.tian@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> --- v5 -> v6: *) add review-by from Eric Auger. v1 -> v2: *) check the iommu_device to ensure the handling mdev is IOMMU-backed --- drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b67ce2d..5cef732 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -2438,29 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu, return ret; } +static struct device *vfio_get_iommu_device(struct vfio_group *group, + struct device *dev) +{ + if (group->mdev_group) + return vfio_mdev_get_iommu_device(dev); + else + return dev; +} + static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data) { struct domain_capsule *dc = (struct domain_capsule *)data; unsigned long arg = *(unsigned long *)dc->data; + struct device *iommu_device; + + iommu_device = vfio_get_iommu_device(dc->group, dev); + if (!iommu_device) + return -EINVAL; - return iommu_uapi_sva_bind_gpasid(dc->domain, dev, + return iommu_uapi_sva_bind_gpasid(dc->domain, iommu_device, (void __user *)arg); } static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data) { struct domain_capsule *dc = (struct domain_capsule *)data; + struct device *iommu_device; + + iommu_device = vfio_get_iommu_device(dc->group, dev); + if (!iommu_device) + return -EINVAL; if (dc->user) { unsigned long arg = *(unsigned long *)dc->data; - iommu_uapi_sva_unbind_gpasid(dc->domain, - dev, (void __user *)arg); + iommu_uapi_sva_unbind_gpasid(dc->domain, iommu_device, + (void __user *)arg); } else { struct iommu_gpasid_bind_data *unbind_data = (struct iommu_gpasid_bind_data *)dc->data; - iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data); + iommu_sva_unbind_gpasid(dc->domain, + iommu_device, unbind_data); } return 0; } @@ -3116,8 +3136,14 @@ static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data) { struct domain_capsule *dc = (struct domain_capsule *)data; unsigned long arg = *(unsigned long *)dc->data; + struct device *iommu_device; + + iommu_device = vfio_get_iommu_device(dc->group, dev); + if (!iommu_device) + return -EINVAL; - iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg); + iommu_uapi_cache_invalidate(dc->domain, iommu_device, + (void __user *)arg); return 0; } -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch exposes PCIe PASID capability to guest for assigned devices. Existing vfio_pci driver hides it from guest by setting the capability length as 0 in pci_ext_cap_length[]. And this patch only exposes PASID capability for devices which has PCIe PASID extended struture in its configuration space. VFs will not expose the PASID capability as they do not implement the PASID extended structure in their config space. It is a TODO in future. Related discussion can be found in below link: https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/ Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> --- v5 -> v6: *) add review-by from Eric Auger. v1 -> v2: *) added in v2, but it was sent in a separate patchseries before --- drivers/vfio/pci/vfio_pci_config.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d98843f..07ff2e6 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = { [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF, [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */ [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */ - [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */ + [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF, }; /* -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
From: Eric Auger <eric.auger@redhat.com> The VFIO API was enhanced to support nested stage control: a bunch of new ioctls and usage guideline. Let's document the process to follow to set up nested mode. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> --- v6 -> v7: *) tweak per Eric's comments. v5 -> v6: *) tweak per Eric's comments. v3 -> v4: *) add review-by from Stefan Hajnoczi v2 -> v3: *) address comments from Stefan Hajnoczi v1 -> v2: *) new in v2, compared with Eric's original version, pasid table bind and fault reporting is removed as this series doesn't cover them. Original version from Eric. https://lore.kernel.org/kvm/20200320161911.27494-12-eric.auger@redhat.com/ --- Documentation/driver-api/vfio.rst | 76 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index f1a4d3c..10851dd 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -239,6 +239,82 @@ group and can access them as follows:: /* Gratuitous device reset and go... */ ioctl(device, VFIO_DEVICE_RESET); +IOMMU Dual Stage Control +------------------------ + +Some IOMMUs support 2 stages/levels of translation. Stage corresponds +to the ARM terminology while level corresponds to Intel's terminology. +In the following text we use either without distinction. + +This is useful when the guest is exposed with a virtual IOMMU and some +devices are assigned to the guest through VFIO. Then the guest OS can +use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage +2 for VM isolation (GPA -> HPA). + +Under dual stage translation, the guest gets ownership of the stage-1 +page tables or both the stage-1 configuration structures and page tables. +This depends on vendor. e.g. on Intel platform, guest owns stage-1 page +tables under nesting. While on ARM, guest owns both the stage-1 configuration +structures and page tables under nesting. The hypervisor owns the root +configuration structure (for security reason), including stage-2 configuration. +This works as long as configuration structures and page table formats are +compatible between the virtual IOMMU and the physical IOMMU. + +Assuming the HW supports it, this nested mode is selected by choosing the +VFIO_TYPE1_NESTING_IOMMU type through: + + ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU); + +This forces the hypervisor to use the stage-2, leaving stage-1 available +for guest usage. +The stage-1 format and binding method are reported in nesting capability. +(VFIO_IOMMU_TYPE1_INFO_CAP_NESTING) through VFIO_IOMMU_GET_INFO: + + ioctl(container->fd, VFIO_IOMMU_GET_INFO, &nesting_info); + +The nesting cap info is available only after NESTING_IOMMU is selected. +If the underlying IOMMU doesn't support nesting, VFIO_SET_IOMMU fails and +userspace should try other IOMMU types. Details of the nesting cap info +can be found in Documentation/userspace-api/iommu.rst. + +Bind stage-1 page table to the IOMMU differs per platform. On Intel, +the stage1 page table info are mediated by the userspace for each PASID. +On ARM, the userspace directly passes the GPA of the whole PASID table. +Currently only Intel's binding is supported (IOMMU_NESTING_FEAT_BIND_PGTBL) +is supported: + + nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL; + memcpy(&nesting_op->data, &bind_data, sizeof(bind_data)); + ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op); + +When multiple stage-1 page tables are supported on a device, each page +table is associated with a PASID (Process Address Space ID) to differentiate +with each other. In such case, userspace should include PASID in the +bind_data when issuing direct binding request. + +PASID could be managed per-device or system-wide which, again, depends on +IOMMU vendor and is reported in nesting cap info. When system-wide policy +is reported (IOMMU_NESTING_FEAT_SYSWIDE_PASID), e.g. as by Intel platforms, +userspace *must* allocate PASID from VFIO before attempting binding of +stage-1 page table: + + req.flags = VFIO_IOMMU_ALLOC_PASID; + ioctl(container, VFIO_IOMMU_PASID_REQUEST, &req); + +Once the stage-1 page table is bound to the IOMMU, the guest is allowed to +fully manage its mapping at its disposal. The IOMMU walks nested stage-1 +and stage-2 page tables when serving DMA requests from assigned device, and +may cache the stage-1 mapping in the IOTLB. When required (IOMMU_NESTING_ +FEAT_CACHE_INVLD), userspace *must* forward guest stage-1 invalidation to +the host, so the IOTLB is invalidated: + + nesting_op->flags = VFIO_IOMMU_NESTING_OP_CACHE_INVLD; + memcpy(&nesting_op->data, &cache_inv_data, sizeof(cache_inv_data)); + ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op); + +Forwarded invalidations can happen at various granularity levels (page +level, context level, etc.) + VFIO User API ------------------------------------------------------------------------------- -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch makes change to only supports the case where all the physical iommu units have the same CAP/ECAP MASKS for nested translation. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 - > v7: *) added per comment from Eric. https://lore.kernel.org/kvm/7fe337fa-abbc-82be-c8e8-b9e2a6179b90@redhat.com/ --- drivers/iommu/intel/iommu.c | 8 ++++++-- include/linux/intel-iommu.h | 16 ++++++++++++++++ 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 95740b9..38c6c9b 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5675,12 +5675,16 @@ static inline bool iommu_pasid_support(void) static inline bool nested_mode_support(void) { struct dmar_drhd_unit *drhd; - struct intel_iommu *iommu; + struct intel_iommu *iommu, *prev = NULL; bool ret = true; rcu_read_lock(); for_each_active_iommu(iommu, drhd) { - if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) { + if (!prev) + prev = iommu; + if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) || + (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) || + (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) { ret = false; break; } diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index db7fc59..e7b7512 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -197,6 +197,22 @@ #define ecap_max_handle_mask(e) ((e >> 20) & 0xf) #define ecap_sc_support(e) ((e >> 7) & 0x1) /* Snooping Control */ +/* Nesting Support Capability Alignment */ +#define VTD_CAP_FL1GP BIT_ULL(56) +#define VTD_CAP_FL5LP BIT_ULL(60) +#define VTD_ECAP_PRS BIT_ULL(29) +#define VTD_ECAP_ERS BIT_ULL(30) +#define VTD_ECAP_SRS BIT_ULL(31) +#define VTD_ECAP_EAFS BIT_ULL(34) +#define VTD_ECAP_PASID BIT_ULL(40) + +/* Only capabilities marked in below MASKs are reported */ +#define VTD_CAP_MASK (VTD_CAP_FL1GP | VTD_CAP_FL5LP) + +#define VTD_ECAP_MASK (VTD_ECAP_PRS | VTD_ECAP_ERS | \ + VTD_ECAP_SRS | VTD_ECAP_EAFS | \ + VTD_ECAP_PASID) + /* Virtual command interface capability */ #define vccap_pasid(v) (((v) & DMA_VCS_PAS)) /* PASID allocation */ -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
This patch reports nesting info when iommu_domain_get_attr() is called with DOMAIN_ATTR_NESTING and one domain with nesting set. Cc: Kevin Tian <kevin.tian@intel.com> CC: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> --- v6 -> v7: *) split the patch in v6 into two patches: [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info v2 -> v3: *) remove cap/ecap_mask in iommu_nesting_info. --- drivers/iommu/intel/iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 38c6c9b..e46214e 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -6089,6 +6089,79 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain, return ret; } +static int intel_iommu_get_nesting_info(struct iommu_domain *domain, + struct iommu_nesting_info *info) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK; + struct device_domain_info *domain_info; + struct iommu_nesting_info_vtd vtd; + unsigned int size; + + if (!info) + return -EINVAL; + + if (domain->type != IOMMU_DOMAIN_UNMANAGED || + !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) + return -ENODEV; + + size = sizeof(struct iommu_nesting_info); + /* + * if provided buffer size is smaller than expected, should + * return 0 and also the expected buffer size to caller. + */ + if (info->argsz < size) { + info->argsz = size; + return 0; + } + + /* + * arbitrary select the first domain_info as all nesting + * related capabilities should be consistent across iommu + * units. + */ + domain_info = list_first_entry(&dmar_domain->devices, + struct device_domain_info, link); + cap &= domain_info->iommu->cap; + ecap &= domain_info->iommu->ecap; + + info->addr_width = dmar_domain->gaw; + info->format = IOMMU_PASID_FORMAT_INTEL_VTD; + info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID | + IOMMU_NESTING_FEAT_BIND_PGTBL | + IOMMU_NESTING_FEAT_CACHE_INVLD; + info->pasid_bits = ilog2(intel_pasid_max_id); + memset(&info->padding, 0x0, 12); + + vtd.flags = 0; + vtd.cap_reg = cap; + vtd.ecap_reg = ecap; + + memcpy(&info->vendor.vtd, &vtd, sizeof(vtd)); + return 0; +} + +static int intel_iommu_domain_get_attr(struct iommu_domain *domain, + enum iommu_attr attr, void *data) +{ + switch (attr) { + case DOMAIN_ATTR_NESTING: + { + struct iommu_nesting_info *info = + (struct iommu_nesting_info *)data; + unsigned long flags; + int ret; + + spin_lock_irqsave(&device_domain_lock, flags); + ret = intel_iommu_get_nesting_info(domain, info); + spin_unlock_irqrestore(&device_domain_lock, flags); + return ret; + } + default: + return -ENOENT; + } +} + /* * Check that the device does not live on an external facing PCI port that is * marked as untrusted. Such devices should not be able to apply quirks and @@ -6111,6 +6184,7 @@ const struct iommu_ops intel_iommu_ops = { .domain_alloc = intel_iommu_domain_alloc, .domain_free = intel_iommu_domain_free, .domain_set_attr = intel_iommu_domain_set_attr, + .domain_get_attr = intel_iommu_domain_get_attr, .attach_dev = intel_iommu_attach_device, .detach_dev = intel_iommu_detach_device, .aux_attach_dev = intel_iommu_aux_attach_device, -- 2.7.4 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:18 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > IOMMUs that support nesting translation needs report the capability info > to userspace. It gives information about requirements the userspace needs > to implement plus other features characterizing the physical implementation. > > This patch introduces a new IOMMU UAPI struct that gives information about > the nesting capabilities and features. This struct is supposed to be returned > by iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter, with > one domain whose type has been set to DOMAIN_ATTR_NESTING. > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > --- > v6 -> v7: > *) rephrase the commit message, replace the @data[] field in struct > iommu_nesting_info with union per comments from Eric Auger. > > v5 -> v6: > *) rephrase the feature notes per comments from Eric Auger. > *) rename @size of struct iommu_nesting_info to @argsz. > > v4 -> v5: > *) address comments from Eric Auger. > > v3 -> v4: > *) split the SMMU driver changes to be a separate patch > *) move the @addr_width and @pasid_bits from vendor specific > part to generic part. > *) tweak the description for the @features field of struct > iommu_nesting_info. > *) add description on the @data[] field of struct iommu_nesting_info > > v2 -> v3: > *) remvoe cap/ecap_mask in iommu_nesting_info. > *) reuse DOMAIN_ATTR_NESTING to get nesting info. > *) return an empty iommu_nesting_info for SMMU drivers per Jean' > suggestion. > --- > include/uapi/linux/iommu.h | 76 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 76 insertions(+) > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h > index 1ebc23d..ff987e4 100644 > --- a/include/uapi/linux/iommu.h > +++ b/include/uapi/linux/iommu.h > @@ -341,4 +341,80 @@ struct iommu_gpasid_bind_data { > } vendor; > }; > > +/* > + * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info. > + * > + * @flags: VT-d specific flags. Currently reserved for future > + * extension. must be set to 0. > + * @cap_reg: Describe basic capabilities as defined in VT-d capability > + * register. > + * @ecap_reg: Describe the extended capabilities as defined in VT-d > + * extended capability register. > + */ > +struct iommu_nesting_info_vtd { > + __u32 flags; > + __u64 cap_reg; > + __u64 ecap_reg; > +}; The vendor union has 8-byte alignment, so flags here will be 8-byte aligned, followed by a compiler dependent gap before the 8-byte fields. We should fill that gap with padding to make it deterministic for userspace. Thanks, Alex > + > +/* > + * struct iommu_nesting_info - Information for nesting-capable IOMMU. > + * userspace should check it before using > + * nesting capability. > + * > + * @argsz: size of the whole structure. > + * @flags: currently reserved for future extension. must set to 0. > + * @format: PASID table entry format, the same definition as struct > + * iommu_gpasid_bind_data @format. > + * @features: supported nesting features. > + * @addr_width: The output addr width of first level/stage translation > + * @pasid_bits: Maximum supported PASID bits, 0 represents no PASID > + * support. > + * @vendor: vendor specific data, structure type can be deduced from > + * @format field. > + * > + * +===============+======================================================+ > + * | feature | Notes | > + * +===============+======================================================+ > + * | SYSWIDE_PASID | IOMMU vendor driver sets it to mandate userspace | > + * | | to allocate PASID from kernel. All PASID allocation | > + * | | free must be mediated through the IOMMU UAPI. | > + * +---------------+------------------------------------------------------+ > + * | BIND_PGTBL | IOMMU vendor driver sets it to mandate userspace to | > + * | | bind the first level/stage page table to associated | > + * | | PASID (either the one specified in bind request or | > + * | | the default PASID of iommu domain), through IOMMU | > + * | | UAPI. | > + * +---------------+------------------------------------------------------+ > + * | CACHE_INVLD | IOMMU vendor driver sets it to mandate userspace to | > + * | | explicitly invalidate the IOMMU cache through IOMMU | > + * | | UAPI according to vendor-specific requirement when | > + * | | changing the 1st level/stage page table. | > + * +---------------+------------------------------------------------------+ > + * > + * data struct types defined for @format: > + * +================================+=====================================+ > + * | @format | data struct | > + * +================================+=====================================+ > + * | IOMMU_PASID_FORMAT_INTEL_VTD | struct iommu_nesting_info_vtd | > + * +--------------------------------+-------------------------------------+ > + * > + */ > +struct iommu_nesting_info { > + __u32 argsz; > + __u32 flags; > + __u32 format; > +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0) > +#define IOMMU_NESTING_FEAT_BIND_PGTBL (1 << 1) > +#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2) > + __u32 features; > + __u16 addr_width; > + __u16 pasid_bits; > + __u8 padding[12]; > + /* Vendor specific data */ > + union { > + struct iommu_nesting_info_vtd vtd; > + } vendor; > +}; > + > #endif /* _UAPI_IOMMU_H */ _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:20 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > This patch exports iommu nesting capability info to user space through > VFIO. Userspace is expected to check this info for supported uAPIs (e.g. > PASID alloc/free, bind page table, and cache invalidation) and the vendor > specific format information for first level/stage page table that will be > bound to. > > The nesting info is available only after container set to be NESTED type. > Current implementation imposes one limitation - one nesting container > should include at most one iommu group. The philosophy of vfio container > is having all groups/devices within the container share the same IOMMU > context. When vSVA is enabled, one IOMMU context could include one 2nd- > level address space and multiple 1st-level address spaces. While the > 2nd-level address space is reasonably sharable by multiple groups, blindly > sharing 1st-level address spaces across all groups within the container > might instead break the guest expectation. In the future sub/super container > concept might be introduced to allow partial address space sharing within > an IOMMU context. But for now let's go with this restriction by requiring > singleton container for using nesting iommu features. Below link has the > related discussion about this decision. > > https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/ > > This patch also changes the NESTING type container behaviour. Something > that would have succeeded before will now fail: Before this series, if > user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even > if the SMMU didn't support stage-2, as the driver would have silently > fallen back on stage-1 mappings (which work exactly the same as stage-2 > only since there was no nesting supported). After the series, we do check > for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and > the SMMU doesn't support stage-2, the ioctl fails. But it should be a good > fix and completely harmless. Detail can be found in below link as well. > > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/ > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > --- > v6 -> v7: > *) using vfio_info_add_capability() for adding nesting cap per suggestion > from Eric. > > v5 -> v6: > *) address comments against v5 from Eric Auger. > *) don't report nesting cap to userspace if the nesting_info->format is > invalid. > > v4 -> v5: > *) address comments from Eric Auger. > *) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as > cap is much "cheap", if needs extension in future, just define another cap. > https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/ > > v3 -> v4: > *) address comments against v3. > > v1 -> v2: > *) added in v2 > --- > drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++------ > include/uapi/linux/vfio.h | 19 +++++++++ > 2 files changed, 99 insertions(+), 12 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index c992973..3c0048b 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit, > "Maximum number of user DMA mappings per container (65535)."); > > struct vfio_iommu { > - struct list_head domain_list; > - struct list_head iova_list; > - struct vfio_domain *external_domain; /* domain for external user */ > - struct mutex lock; > - struct rb_root dma_list; > - struct blocking_notifier_head notifier; > - unsigned int dma_avail; > - uint64_t pgsize_bitmap; > - bool v2; > - bool nesting; > - bool dirty_page_tracking; > - bool pinned_page_dirty_scope; > + struct list_head domain_list; > + struct list_head iova_list; > + /* domain for external user */ > + struct vfio_domain *external_domain; > + struct mutex lock; > + struct rb_root dma_list; > + struct blocking_notifier_head notifier; > + unsigned int dma_avail; > + uint64_t pgsize_bitmap; > + bool v2; > + bool nesting; > + bool dirty_page_tracking; > + bool pinned_page_dirty_scope; > + struct iommu_nesting_info *nesting_info; Nit, not as important as the previous alignment, but might as well move this up with the uint64_t pgsize_bitmap with the bools at the end of the structure to avoid adding new gaps. > }; > > struct vfio_domain { > @@ -130,6 +132,9 @@ struct vfio_regions { > #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ > (!list_empty(&iommu->domain_list)) > > +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \ > + (!list_empty(&(iommu)->domain_list))) > + > #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE) > > /* > @@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, > > list_splice_tail(iova_copy, iova); > } > + > +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) > +{ > + kfree(iommu->nesting_info); > + iommu->nesting_info = NULL; > +} > + > static int vfio_iommu_type1_attach_group(void *iommu_data, > struct iommu_group *iommu_group) > { > @@ -2022,6 +2034,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > } > } > > + /* Nesting type container can include only one group */ > + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) { > + mutex_unlock(&iommu->lock); > + return -EINVAL; > + } > + > group = kzalloc(sizeof(*group), GFP_KERNEL); > domain = kzalloc(sizeof(*domain), GFP_KERNEL); > if (!group || !domain) { > @@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > if (ret) > goto out_domain; > > + /* Nesting cap info is available only after attaching */ > + if (iommu->nesting) { > + int size = sizeof(struct iommu_nesting_info); > + > + iommu->nesting_info = kzalloc(size, GFP_KERNEL); > + if (!iommu->nesting_info) { > + ret = -ENOMEM; > + goto out_detach; > + } > + > + /* Now get the nesting info */ > + iommu->nesting_info->argsz = size; > + ret = iommu_domain_get_attr(domain->domain, > + DOMAIN_ATTR_NESTING, > + iommu->nesting_info); > + if (ret) > + goto out_detach; > + } > + > /* Get aperture info */ > iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo); > > @@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > return 0; > > out_detach: > + vfio_iommu_release_nesting_info(iommu); > vfio_iommu_detach_group(domain, group); > out_domain: > iommu_domain_free(domain->domain); > @@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > vfio_iommu_unmap_unpin_all(iommu); > else > vfio_iommu_unmap_unpin_reaccount(iommu); > + > + vfio_iommu_release_nesting_info(iommu); > } > iommu_domain_free(domain->domain); > list_del(&domain->next); > @@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu, > return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig)); > } > > +static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu, > + struct vfio_info_cap *caps) > +{ > + struct vfio_iommu_type1_info_cap_nesting nesting_cap; > + size_t size; > + > + /* when nesting_info is null, no need to go further */ > + if (!iommu->nesting_info) > + return 0; > + > + /* when @format of nesting_info is 0, fail the call */ > + if (iommu->nesting_info->format == 0) > + return -ENOENT; Should we fail this in the attach_group? Seems the user would be in a bad situation here if they successfully created a nesting container but can't get info. Is there backwards compatibility we're trying to maintain with this? > + > + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) + > + iommu->nesting_info->argsz; > + > + nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING; > + nesting_cap.header.version = 1; > + > + memcpy(&nesting_cap.info, iommu->nesting_info, > + iommu->nesting_info->argsz); > + > + return vfio_info_add_capability(caps, &nesting_cap.header, size); > +} > + > static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu, > unsigned long arg) > { > @@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu, > if (!ret) > ret = vfio_iommu_iova_build_caps(iommu, &caps); > > + ret = vfio_iommu_add_nesting_cap(iommu, &caps); Why don't we follow either the naming scheme or the error handling scheme of the previous caps? Seems like this should be: if (!ret) ret = vfio_iommu_nesting_build_caps(...); Thanks, Alex > + > mutex_unlock(&iommu->lock); > > if (ret) > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9204705..ff40f9e 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -14,6 +14,7 @@ > > #include <linux/types.h> > #include <linux/ioctl.h> > +#include <linux/iommu.h> > > #define VFIO_API_VERSION 0 > > @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration { > __u64 max_dirty_bitmap_size; /* in bytes */ > }; > > +/* > + * The nesting capability allows to report the related capability > + * and info for nesting iommu type. > + * > + * The structures below define version 1 of this capability. > + * > + * Nested capabilities should be checked by the userspace after > + * setting VFIO_TYPE1_NESTING_IOMMU. > + * > + * @info: the nesting info provided by IOMMU driver. > + */ > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3 > + > +struct vfio_iommu_type1_info_cap_nesting { > + struct vfio_info_cap_header header; > + struct iommu_nesting_info info; > +}; > + > #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) > > /** _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:21 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing > multiple process virtual address spaces with the device for simplified > programming model. PASID is used to tag an virtual address space in DMA > requests and to identify the related translation structure in IOMMU. When > a PASID-capable device is assigned to a VM, we want the same capability > of using PASID to tag guest process virtual address spaces to achieve > virtual SVA (vSVA). > > PASID management for guest is vendor specific. Some vendors (e.g. Intel > VT-d) requires system-wide managed PASIDs across all devices, regardless > of whether a device is used by host or assigned to guest. Other vendors > (e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully > delegated to the guest for assigned devices. > > For system-wide managed PASIDs, this patch introduces a vfio module to > handle explicit PASID alloc/free requests from guest. Allocated PASIDs > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm > object is introduced to track mm_struct. Multiple VFIO containers within > a process share the same vfio_mm object. > > A quota mechanism is provided to prevent malicious user from exhausting > available PASIDs. Currently the quota is a global parameter applied to > all VFIO devices. In the future per-device quota might be supported too. > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Suggested-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Reviewed-by: Eric Auger <eric.auger@redhat.com> > --- > v6 -> v7: > *) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger. > > v5 -> v6: > *) address comments from Eric. Add vfio_unlink_pasid() to be consistent > with vfio_unlink_dma(). Add a comment in vfio_pasid_exit(). > > v4 -> v5: > *) address comments from Eric Auger. > *) address the comments from Alex on the pasid free range support. Added > per vfio_mm pasid r-b tree. > https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/ > > v3 -> v4: > *) fix lock leam in vfio_mm_get_from_task() > *) drop pasid_quota field in struct vfio_mm > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when !CONFIG_VFIO_PASID > > v1 -> v2: > *) added in v2, split from the pasid alloc/free support of v1 > --- > drivers/vfio/Kconfig | 5 + > drivers/vfio/Makefile | 1 + > drivers/vfio/vfio_pasid.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/vfio.h | 28 ++++++ > 4 files changed, 281 insertions(+) > create mode 100644 drivers/vfio/vfio_pasid.c > > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig > index fd17db9..3d8a108 100644 > --- a/drivers/vfio/Kconfig > +++ b/drivers/vfio/Kconfig > @@ -19,6 +19,11 @@ config VFIO_VIRQFD > depends on VFIO && EVENTFD > default n > > +config VFIO_PASID > + tristate > + depends on IOASID && VFIO > + default n > + > menuconfig VFIO > tristate "VFIO Non-Privileged userspace driver framework" > depends on IOMMU_API > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile > index de67c47..bb836a3 100644 > --- a/drivers/vfio/Makefile > +++ b/drivers/vfio/Makefile > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o > > obj-$(CONFIG_VFIO) += vfio.o > obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o > obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o > obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o > obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c > new file mode 100644 > index 0000000..44ecdd5 > --- /dev/null > +++ b/drivers/vfio/vfio_pasid.c > @@ -0,0 +1,247 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2020 Intel Corporation. > + * Author: Liu Yi L <yi.l.liu@intel.com> > + * > + */ > + > +#include <linux/vfio.h> > +#include <linux/file.h> > +#include <linux/module.h> > +#include <linux/slab.h> > +#include <linux/sched/mm.h> > + > +#define DRIVER_VERSION "0.1" > +#define DRIVER_AUTHOR "Liu Yi L <yi.l.liu@intel.com>" > +#define DRIVER_DESC "PASID management for VFIO bus drivers" > + > +#define VFIO_DEFAULT_PASID_QUOTA 1000 I'm not sure we really need a macro to define this since it's only used once, but a comment discussing the basis for this default value would be useful. Also, since Matthew Rosato is finding it necessary to expose the available DMA mapping counter to userspace, is this also a limitation that userspace might be interested in knowing such that we should plumb it through an IOMMU info capability? > +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA; > +module_param_named(pasid_quota, pasid_quota, uint, 0444); > +MODULE_PARM_DESC(pasid_quota, > + "Set the quota for max number of PASIDs that an application is allowed to request (default 1000)"); > + > +struct vfio_mm_token { > + unsigned long long val; > +}; > + > +struct vfio_mm { > + struct kref kref; > + struct ioasid_set *ioasid_set; > + struct mutex pasid_lock; > + struct rb_root pasid_list; > + struct list_head next; > + struct vfio_mm_token token; > +}; > + > +static struct mutex vfio_mm_lock; > +static struct list_head vfio_mm_list; > + > +struct vfio_pasid { > + struct rb_node node; > + ioasid_t pasid; > +}; > + > +static void vfio_remove_all_pasids(struct vfio_mm *vmm); > + > +/* called with vfio.vfio_mm_lock held */ s/vfio.// > +static void vfio_mm_release(struct kref *kref) > +{ > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref); > + > + list_del(&vmm->next); > + mutex_unlock(&vfio_mm_lock); > + vfio_remove_all_pasids(vmm); > + ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set after allocation? Is the question whether each pasid should hold a reference to the set? That really seems like a question internal to the ioasid_alloc/free, but this FIXME needs to be resolved. > + kfree(vmm); > +} > + > +void vfio_mm_put(struct vfio_mm *vmm) > +{ > + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); > +} > + > +static void vfio_mm_get(struct vfio_mm *vmm) > +{ > + kref_get(&vmm->kref); > +} > + > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) > +{ > + struct mm_struct *mm = get_task_mm(task); > + struct vfio_mm *vmm; > + unsigned long long val = (unsigned long long)mm; > + int ret; > + > + mutex_lock(&vfio_mm_lock); > + /* Search existing vfio_mm with current mm pointer */ > + list_for_each_entry(vmm, &vfio_mm_list, next) { > + if (vmm->token.val == val) { > + vfio_mm_get(vmm); > + goto out; > + } > + } > + > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL); > + if (!vmm) { > + vmm = ERR_PTR(-ENOMEM); > + goto out; > + } > + > + /* > + * IOASID core provides a 'IOASID set' concept to track all > + * PASIDs associated with a token. Here we use mm_struct as > + * the token and create a IOASID set per mm_struct. All the > + * containers of the process share the same IOASID set. > + */ > + vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM); > + if (IS_ERR(vmm->ioasid_set)) { > + ret = PTR_ERR(vmm->ioasid_set); > + kfree(vmm); > + vmm = ERR_PTR(ret); > + goto out; This would be a little less convoluted if we had a separate variable to store ioasid_set so that we could free vmm without stashing the error in a temporary variable. Or at least make the stash more obvious by defining the stash variable as something like "tmp" within the scope of this branch. > + } > + > + kref_init(&vmm->kref); > + vmm->token.val = val; > + mutex_init(&vmm->pasid_lock); > + vmm->pasid_list = RB_ROOT; > + > + list_add(&vmm->next, &vfio_mm_list); > +out: > + mutex_unlock(&vfio_mm_lock); > + mmput(mm); > + return vmm; > +} > + > +/* > + * Find PASID within @min and @max > + */ > +static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm, > + ioasid_t min, ioasid_t max) > +{ > + struct rb_node *node = vmm->pasid_list.rb_node; > + > + while (node) { > + struct vfio_pasid *vid = rb_entry(node, > + struct vfio_pasid, node); > + > + if (max < vid->pasid) > + node = node->rb_left; > + else if (min > vid->pasid) > + node = node->rb_right; > + else > + return vid; > + } > + > + return NULL; > +} > + > +static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid *new) > +{ > + struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL; > + struct vfio_pasid *vid; > + > + while (*link) { > + parent = *link; > + vid = rb_entry(parent, struct vfio_pasid, node); > + > + if (new->pasid <= vid->pasid) > + link = &(*link)->rb_left; > + else > + link = &(*link)->rb_right; > + } > + > + rb_link_node(&new->node, parent, link); > + rb_insert_color(&new->node, &vmm->pasid_list); > +} > + > +static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid *old) > +{ > + rb_erase(&old->node, &vmm->pasid_list); > +} > + > +static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid *vid) > +{ > + vfio_unlink_pasid(vmm, vid); > + ioasid_free(vmm->ioasid_set, vid->pasid); > + kfree(vid); > +} > + > +static void vfio_remove_all_pasids(struct vfio_mm *vmm) > +{ > + struct rb_node *node; > + > + mutex_lock(&vmm->pasid_lock); > + while ((node = rb_first(&vmm->pasid_list))) > + vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node)); > + mutex_unlock(&vmm->pasid_lock); > +} > + > +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) I might have asked before, but why doesn't this return an ioasid_t and require ioasid_t args? Our free function below uses an ioasid_t range, seems rather inconsistent. We can use a BUILD_BUG_ON if we need to test that an ioasid_t fits within our uapi. > +{ > + ioasid_t pasid; > + struct vfio_pasid *vid; > + > + pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL); > + if (pasid == INVALID_IOASID) > + return -ENOSPC; > + > + vid = kzalloc(sizeof(*vid), GFP_KERNEL); > + if (!vid) { > + ioasid_free(vmm->ioasid_set, pasid); > + return -ENOMEM; > + } > + > + vid->pasid = pasid; > + > + mutex_lock(&vmm->pasid_lock); > + vfio_link_pasid(vmm, vid); > + mutex_unlock(&vmm->pasid_lock); > + > + return pasid; > +} > + > +void vfio_pasid_free_range(struct vfio_mm *vmm, > + ioasid_t min, ioasid_t max) > +{ > + struct vfio_pasid *vid = NULL; > + > + /* > + * IOASID core will notify PASID users (e.g. IOMMU driver) to > + * teardown necessary structures depending on the to-be-freed > + * PASID. > + */ > + mutex_lock(&vmm->pasid_lock); > + while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) != NULL is not necessary and isn't consistent with the same time of test in the above rb_first() loop. > + vfio_remove_pasid(vmm, vid); > + mutex_unlock(&vmm->pasid_lock); > +} > + > +static int __init vfio_pasid_init(void) > +{ > + mutex_init(&vfio_mm_lock); > + INIT_LIST_HEAD(&vfio_mm_list); > + return 0; > +} > + > +static void __exit vfio_pasid_exit(void) > +{ > + /* > + * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1 > + * and may be other module. once vfio_pasid_exit() is triggered, > + * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed. > + * All the vfio_mm instances should have been released. If not, > + * means there is vfio_mm leak, should be a bug of user module. > + * So just warn here. > + */ > + WARN_ON(!list_empty(&vfio_mm_list)); Do we need to be using try_module_get/module_put to enforce that we cannot be removed while in use or does that already work correctly via the function references and this is just paranoia? If we do exit, I'm not sure what good it does to keep the remaining list entries. Thanks, Alex > +} > + > +module_init(vfio_pasid_init); > +module_exit(vfio_pasid_exit); > + > +MODULE_VERSION(DRIVER_VERSION); > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR(DRIVER_AUTHOR); > +MODULE_DESCRIPTION(DRIVER_DESC); > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > index 38d3c6a..31472a9 100644 > --- a/include/linux/vfio.h > +++ b/include/linux/vfio.h > @@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops); > extern void vfio_unregister_iommu_driver( > const struct vfio_iommu_driver_ops *ops); > > +struct vfio_mm; > +#if IS_ENABLED(CONFIG_VFIO_PASID) > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task); > +extern void vfio_mm_put(struct vfio_mm *vmm); > +extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); > +extern void vfio_pasid_free_range(struct vfio_mm *vmm, > + ioasid_t min, ioasid_t max); > +#else > +static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) > +{ > + return ERR_PTR(-ENOTTY); > +} > + > +static inline void vfio_mm_put(struct vfio_mm *vmm) > +{ > +} > + > +static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > +{ > + return -ENOTTY; > +} > + > +static inline void vfio_pasid_free_range(struct vfio_mm *vmm, > + ioasid_t min, ioasid_t max) > +{ > +} > +#endif /* CONFIG_VFIO_PASID */ > + > /* > * External user API > */ _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:24 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > This patch allows userspace to request PASID allocation/free, e.g. when > serving the request from the guest. > > PASIDs that are not freed by userspace are automatically freed when the > IOASID set is destroyed when process exits. > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > --- > v6 -> v7: > *) current VFIO returns allocated pasid via signed int, thus VFIO UAPI > can only support 31 bits pasid. If user space gives min,max which is > wider than 31 bits, should fail the allocation or free request. > > v5 -> v6: > *) address comments from Eric against v5. remove the alloc/free helper. > > v4 -> v5: > *) address comments from Eric Auger. > *) the comments for the PASID_FREE request is addressed in patch 5/15 of > this series. > > v3 -> v4: > *) address comments from v3, except the below comment against the range > of PASID_FREE request. needs more help on it. > "> +if (req.range.min > req.range.max) > > Is it exploitable that a user can spin the kernel for a long time in > the case of a free by calling this with [0, MAX_UINT] regardless of > their actual allocations?" > https://lore.kernel.org/linux-iommu/20200702151832.048b44d1@x1.home/ > > v1 -> v2: > *) move the vfio_mm related code to be a seprate module > *) use a single structure for alloc/free, could support a range of PASIDs > *) fetch vfio_mm at group_attach time instead of at iommu driver open time > --- > drivers/vfio/Kconfig | 1 + > drivers/vfio/vfio_iommu_type1.c | 76 +++++++++++++++++++++++++++++++++++++++++ > drivers/vfio/vfio_pasid.c | 10 ++++++ > include/linux/vfio.h | 6 ++++ > include/uapi/linux/vfio.h | 43 +++++++++++++++++++++++ > 5 files changed, 136 insertions(+) > > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig > index 3d8a108..95d90c6 100644 > --- a/drivers/vfio/Kconfig > +++ b/drivers/vfio/Kconfig > @@ -2,6 +2,7 @@ > config VFIO_IOMMU_TYPE1 > tristate > depends on VFIO > + select VFIO_PASID if (X86) > default n > > config VFIO_IOMMU_SPAPR_TCE > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 3c0048b..bd4b668 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -76,6 +76,7 @@ struct vfio_iommu { > bool dirty_page_tracking; > bool pinned_page_dirty_scope; > struct iommu_nesting_info *nesting_info; > + struct vfio_mm *vmm; > }; > > struct vfio_domain { > @@ -2000,6 +2001,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, > > static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) > { > + if (iommu->vmm) { > + vfio_mm_put(iommu->vmm); > + iommu->vmm = NULL; > + } > + > kfree(iommu->nesting_info); > iommu->nesting_info = NULL; > } > @@ -2127,6 +2133,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > iommu->nesting_info); > if (ret) > goto out_detach; > + > + if (iommu->nesting_info->features & > + IOMMU_NESTING_FEAT_SYSWIDE_PASID) { > + struct vfio_mm *vmm; > + struct ioasid_set *set; > + > + vmm = vfio_mm_get_from_task(current); > + if (IS_ERR(vmm)) { > + ret = PTR_ERR(vmm); > + goto out_detach; > + } > + iommu->vmm = vmm; > + > + set = vfio_mm_ioasid_set(vmm); > + ret = iommu_domain_set_attr(domain->domain, > + DOMAIN_ATTR_IOASID_SET, > + set); > + if (ret) > + goto out_detach; > + } > } > > /* Get aperture info */ > @@ -2908,6 +2934,54 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, > return -EINVAL; > } > > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu, > + unsigned long arg) > +{ > + struct vfio_iommu_type1_pasid_request req; > + unsigned long minsz; > + int ret; > + > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range); > + > + if (copy_from_user(&req, (void __user *)arg, minsz)) > + return -EFAULT; > + > + if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK)) > + return -EINVAL; > + > + /* > + * Current VFIO_IOMMU_PASID_REQUEST only supports at most > + * 31 bits PASID. The min,max value from userspace should > + * not exceed 31 bits. Please describe the source of this restriction. I think it's due to using the ioctl return value to return the PASID, thus excluding the negative values, but aren't we actually restricted to pasid_bits exposed in the nesting_info? If this is just a sanity test for the API then why are we defining VFIO_IOMMU_PASID_BITS in the uapi header, which causes conflicting information to the user... which do they honor? Should we instead verify that pasid_bits matches our API scheme when configuring the nested domain and then let the ioasid allocator reject requests outside of the range? > + */ > + if (req.range.min > req.range.max || > + req.range.min > (1 << VFIO_IOMMU_PASID_BITS) || > + req.range.max > (1 << VFIO_IOMMU_PASID_BITS)) Off by one, >= for the bit test. > + return -EINVAL; > + > + mutex_lock(&iommu->lock); > + if (!iommu->vmm) { > + mutex_unlock(&iommu->lock); > + return -EOPNOTSUPP; > + } > + > + switch (req.flags & VFIO_PASID_REQUEST_MASK) { > + case VFIO_IOMMU_FLAG_ALLOC_PASID: > + ret = vfio_pasid_alloc(iommu->vmm, req.range.min, > + req.range.max); > + break; > + case VFIO_IOMMU_FLAG_FREE_PASID: > + vfio_pasid_free_range(iommu->vmm, req.range.min, > + req.range.max); > + ret = 0; Set the initial value when it's declared? > + break; > + default: > + ret = -EINVAL; > + } > + mutex_unlock(&iommu->lock); > + return ret; > +} > + > static long vfio_iommu_type1_ioctl(void *iommu_data, > unsigned int cmd, unsigned long arg) > { > @@ -2924,6 +2998,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > return vfio_iommu_type1_unmap_dma(iommu, arg); > case VFIO_IOMMU_DIRTY_PAGES: > return vfio_iommu_type1_dirty_pages(iommu, arg); > + case VFIO_IOMMU_PASID_REQUEST: > + return vfio_iommu_type1_pasid_request(iommu, arg); > default: > return -ENOTTY; > } > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c > index 44ecdd5..0ec4660 100644 > --- a/drivers/vfio/vfio_pasid.c > +++ b/drivers/vfio/vfio_pasid.c > @@ -60,6 +60,7 @@ void vfio_mm_put(struct vfio_mm *vmm) > { > kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); > } > +EXPORT_SYMBOL_GPL(vfio_mm_put); > > static void vfio_mm_get(struct vfio_mm *vmm) > { > @@ -113,6 +114,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) > mmput(mm); > return vmm; > } > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task); > + > +struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) > +{ > + return vmm->ioasid_set; > +} > +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_set); > > /* > * Find PASID within @min and @max > @@ -201,6 +209,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > > return pasid; > } > +EXPORT_SYMBOL_GPL(vfio_pasid_alloc); > > void vfio_pasid_free_range(struct vfio_mm *vmm, > ioasid_t min, ioasid_t max) > @@ -217,6 +226,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, > vfio_remove_pasid(vmm, vid); > mutex_unlock(&vmm->pasid_lock); > } > +EXPORT_SYMBOL_GPL(vfio_pasid_free_range); > > static int __init vfio_pasid_init(void) > { > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > index 31472a9..5c3d7a8 100644 > --- a/include/linux/vfio.h > +++ b/include/linux/vfio.h > @@ -101,6 +101,7 @@ struct vfio_mm; > #if IS_ENABLED(CONFIG_VFIO_PASID) > extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task); > extern void vfio_mm_put(struct vfio_mm *vmm); > +extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm); > extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); > extern void vfio_pasid_free_range(struct vfio_mm *vmm, > ioasid_t min, ioasid_t max); > @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm) > { > } > > +static inline struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) > +{ > + return -ENOTTY; > +} > + > static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > { > return -ENOTTY; > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index ff40f9e..a4bc42e 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -1172,6 +1172,49 @@ struct vfio_iommu_type1_dirty_bitmap_get { > > #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17) > > +/** > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18, > + * struct vfio_iommu_type1_pasid_request) > + * > + * PASID (Processor Address Space ID) is a PCIe concept for tagging > + * address spaces in DMA requests. When system-wide PASID allocation > + * is required by the underlying iommu driver (e.g. Intel VT-d), this > + * provides an interface for userspace to request pasid alloc/free > + * for its assigned devices. Userspace should check the availability > + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through > + * VFIO_IOMMU_GET_INFO. > + * > + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range. > + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range. > + * @range is [min, max], which means both @min and @max are inclusive. > + * ALLOC_PASID and FREE_PASID are mutually exclusive. > + * > + * Current interface supports at most 31 bits PASID bits as returning > + * PASID allocation result via signed int. PCIe spec defines 20 bits > + * for PASID width, so 31 bits is enough. As a result user space should > + * provide min, max no more than 31 bits. Perhaps this is the description I was looking for, but this still conflicts with what I think the user is supposed to do, which is to provide a range within nesting_info.pasid_bits. These seem like implementation details, not uapi. Thanks, Alex > + * returns: allocated PASID value on success, -errno on failure for > + * ALLOC_PASID; > + * 0 for FREE_PASID operation; > + */ > +struct vfio_iommu_type1_pasid_request { > + __u32 argsz; > +#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0) > +#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1) > + __u32 flags; > + struct { > + __u32 min; > + __u32 max; > + } range; > +}; > + > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \ > + VFIO_IOMMU_FLAG_FREE_PASID) > + > +#define VFIO_IOMMU_PASID_BITS 31 > + > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) > + > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */ > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:27 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > Nesting translation allows two-levels/stages page tables, with 1st level > for guest translations (e.g. GVA->GPA), 2nd level for host translations > (e.g. GPA->HPA). This patch adds interface for binding guest page tables > to a PASID. This PASID must have been allocated by the userspace before > the binding request. > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > --- > v6 -> v7: > *) introduced @user in struct domain_capsule to simplify the code per Eric's > suggestion. > *) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace. > *) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn(). > > v5 -> v6: > *) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule(). > per comment from Eric. > *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in > linux/iommu.h for userspace operation and in-kernel operation. > > v3 -> v4: > *) address comments from Alex on v3 > > v2 -> v3: > *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO > https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun.pan@linux.intel.com/ > > v1 -> v2: > *) rename subject from "vfio/type1: Bind guest page tables to host" > *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/ > unbind guet page table > *) replaced vfio_iommu_for_each_dev() with a group level loop since this > series enforces one group per container w/ nesting type as start. > *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn() > *) vfio_dev_unbind_gpasid() always successful > *) use vfio_mm->pasid_lock to avoid race between PASID free and page table > bind/unbind > --- > drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++++ > drivers/vfio/vfio_pasid.c | 26 +++++++ > include/linux/vfio.h | 20 +++++ > include/uapi/linux/vfio.h | 36 +++++++++ > 4 files changed, 245 insertions(+) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index bd4b668..11f1156 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -149,6 +149,39 @@ struct vfio_regions { > #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) > #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) > > +struct domain_capsule { > + struct vfio_group *group; > + struct iommu_domain *domain; > + /* set if @data contains a user pointer*/ > + bool user; > + void *data; > +}; Put the hole in the structure at the end, but I suspect we might lose the user field when the internal api drops the unnecessary structure for unbind anyway. > + > +/* iommu->lock must be held */ > +static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu, > + struct domain_capsule *dc) > +{ > + struct vfio_domain *domain = NULL; > + struct vfio_group *group = NULL; Unnecessary initialization. > + > + if (!iommu->nesting_info) > + return -EINVAL; > + > + /* > + * Only support singleton container with nesting type. If > + * nesting_info is non-NULL, the container is non-empty. > + * Also domain is non-empty. > + */ > + domain = list_first_entry(&iommu->domain_list, > + struct vfio_domain, next); > + group = list_first_entry(&domain->group_list, > + struct vfio_group, next); > + dc->group = group; > + dc->domain = domain->domain; > + dc->user = true; > + return 0; > +} > + > static int put_pfn(unsigned long pfn, int prot); > > static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu, > @@ -2405,6 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu, > return ret; > } > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data) > +{ > + struct domain_capsule *dc = (struct domain_capsule *)data; > + unsigned long arg = *(unsigned long *)dc->data; > + > + return iommu_uapi_sva_bind_gpasid(dc->domain, dev, > + (void __user *)arg); > +} > + > +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data) > +{ > + struct domain_capsule *dc = (struct domain_capsule *)data; > + > + if (dc->user) { > + unsigned long arg = *(unsigned long *)dc->data; > + > + iommu_uapi_sva_unbind_gpasid(dc->domain, > + dev, (void __user *)arg); > + } else { > + struct iommu_gpasid_bind_data *unbind_data = > + (struct iommu_gpasid_bind_data *)dc->data; > + > + iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data); > + } > + return 0; > +} > + > +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data) > +{ > + struct domain_capsule *dc = (struct domain_capsule *)data; > + struct iommu_gpasid_bind_data unbind_data; > + > + unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data); > + unbind_data.flags = 0; > + unbind_data.hpasid = pasid; As in thread with Jacob, this all seems a little excessive for an internal api callback that requires one arg. > + > + dc->user = false; > + dc->data = &unbind_data; > + > + iommu_group_for_each_dev(dc->group->iommu_group, > + dc, vfio_dev_unbind_gpasid_fn); > +} > + > static void vfio_iommu_type1_detach_group(void *iommu_data, > struct iommu_group *iommu_group) > { > @@ -2448,6 +2524,20 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > if (!group) > continue; > > + if (iommu->vmm && (iommu->nesting_info->features & > + IOMMU_NESTING_FEAT_BIND_PGTBL)) { > + struct domain_capsule dc = { .group = group, > + .domain = domain->domain, > + .data = NULL }; > + > + /* > + * Unbind page tables bound with system wide PASIDs > + * which are allocated to userspace. > + */ > + vfio_mm_for_each_pasid(iommu->vmm, &dc, > + vfio_group_unbind_gpasid_fn); > + } > + > vfio_iommu_detach_group(domain, group); > update_dirty_scope = !group->pinned_page_dirty_scope; > list_del(&group->next); > @@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu, > return ret; > } > > +static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu, > + bool is_bind, unsigned long arg) > +{ > + struct domain_capsule dc = { .data = &arg }; > + struct iommu_nesting_info *info; > + int ret; > + > + mutex_lock(&iommu->lock); > + > + info = iommu->nesting_info; > + if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) { > + ret = -EOPNOTSUPP; > + goto out_unlock; > + } > + > + if (!iommu->vmm) { > + ret = -EINVAL; > + goto out_unlock; > + } > + > + ret = vfio_prepare_nesting_domain_capsule(iommu, &dc); > + if (ret) > + goto out_unlock; > + > + /* Avoid race with other containers within the same process */ > + vfio_mm_pasid_lock(iommu->vmm); > + > + if (is_bind) > + ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc, > + vfio_dev_bind_gpasid_fn); > + if (ret || !is_bind) > + iommu_group_for_each_dev(dc.group->iommu_group, > + &dc, vfio_dev_unbind_gpasid_fn); > + > + vfio_mm_pasid_unlock(iommu->vmm); > +out_unlock: > + mutex_unlock(&iommu->lock); > + return ret; > +} > + > +static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu, > + unsigned long arg) > +{ > + struct vfio_iommu_type1_nesting_op hdr; > + unsigned int minsz; > + int ret; > + > + minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags); > + > + if (copy_from_user(&hdr, (void __user *)arg, minsz)) > + return -EFAULT; > + > + if (hdr.argsz < minsz || > + hdr.flags & ~VFIO_NESTING_OP_MASK || > + (hdr.flags & VFIO_NESTING_OP_MASK) >= VFIO_IOMMU_NESTING_OP_NUM) Isn't this redundant to the default switch case? > + return -EINVAL; > + > + switch (hdr.flags & VFIO_NESTING_OP_MASK) { > + case VFIO_IOMMU_NESTING_OP_BIND_PGTBL: > + ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz); > + break; > + case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL: > + ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz); > + break; > + default: > + ret = -EINVAL; > + } > + > + return ret; > +} > + > static long vfio_iommu_type1_ioctl(void *iommu_data, > unsigned int cmd, unsigned long arg) > { > @@ -3000,6 +3161,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > return vfio_iommu_type1_dirty_pages(iommu, arg); > case VFIO_IOMMU_PASID_REQUEST: > return vfio_iommu_type1_pasid_request(iommu, arg); > + case VFIO_IOMMU_NESTING_OP: > + return vfio_iommu_type1_nesting_op(iommu, arg); > default: > return -ENOTTY; > } > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c > index 0ec4660..9e2e4b0 100644 > --- a/drivers/vfio/vfio_pasid.c > +++ b/drivers/vfio/vfio_pasid.c > @@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, > * IOASID core will notify PASID users (e.g. IOMMU driver) to > * teardown necessary structures depending on the to-be-freed > * PASID. > + * Hold pasid_lock also avoids race with PASID usages like bind/ > + * unbind page tables to requested PASID. > */ > mutex_lock(&vmm->pasid_lock); > while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) > @@ -228,6 +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, > } > EXPORT_SYMBOL_GPL(vfio_pasid_free_range); > > +int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > + void (*fn)(ioasid_t id, void *data)) > +{ > + int ret; > + > + mutex_lock(&vmm->pasid_lock); > + ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data); > + mutex_unlock(&vmm->pasid_lock); > + return ret; > +} > +EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid); > + > +void vfio_mm_pasid_lock(struct vfio_mm *vmm) > +{ > + mutex_lock(&vmm->pasid_lock); > +} > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock); > + > +void vfio_mm_pasid_unlock(struct vfio_mm *vmm) > +{ > + mutex_unlock(&vmm->pasid_lock); > +} > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock); > + > static int __init vfio_pasid_init(void) > { > mutex_init(&vfio_mm_lock); > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > index 5c3d7a8..6a999c3 100644 > --- a/include/linux/vfio.h > +++ b/include/linux/vfio.h > @@ -105,6 +105,11 @@ extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm); > extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); > extern void vfio_pasid_free_range(struct vfio_mm *vmm, > ioasid_t min, ioasid_t max); > +extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > + void (*fn)(ioasid_t id, void *data)); > +extern void vfio_mm_pasid_lock(struct vfio_mm *vmm); > +extern void vfio_mm_pasid_unlock(struct vfio_mm *vmm); > + > #else > static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) > { > @@ -129,6 +134,21 @@ static inline void vfio_pasid_free_range(struct vfio_mm *vmm, > ioasid_t min, ioasid_t max) > { > } > + > +static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > + void (*fn)(ioasid_t id, void *data)) > +{ > + return -ENOTTY; > +} > + > +static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm) > +{ > +} > + > +static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm) > +{ > +} > + > #endif /* CONFIG_VFIO_PASID */ > > /* > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index a4bc42e..a99bd71 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request { > > #define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) > > +/** > + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19, > + * struct vfio_iommu_type1_nesting_op) > + * > + * This interface allows userspace to utilize the nesting IOMMU > + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING > + * cap through VFIO_IOMMU_GET_INFO. For platforms which require > + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID > + * _REQUEST. > + * > + * @data[] types defined for each op: > + * +=================+===============================================+ > + * | NESTING OP | @data[] | > + * +=================+===============================================+ > + * | BIND_PGTBL | struct iommu_gpasid_bind_data | > + * +-----------------+-----------------------------------------------+ > + * | UNBIND_PGTBL | struct iommu_gpasid_bind_data | > + * +-----------------+-----------------------------------------------+ > + * > + * returns: 0 on success, -errno on failure. > + */ > +struct vfio_iommu_type1_nesting_op { > + __u32 argsz; > + __u32 flags; > +#define VFIO_NESTING_OP_MASK (0xffff) /* lower 16-bits for op */ > + __u8 data[]; > +}; > + > +enum { > + VFIO_IOMMU_NESTING_OP_BIND_PGTBL, > + VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL, > + VFIO_IOMMU_NESTING_OP_NUM, > +}; "VFIO_IOMMU_NESTING_NUM_OPS" would be more consistent with the vfio uapi. Thanks, Alex > + > +#define VFIO_IOMMU_NESTING_OP _IO(VFIO_TYPE, VFIO_BASE + 19) > + > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */ > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, 10 Sep 2020 03:45:30 -0700 Liu Yi L <yi.l.liu@intel.com> wrote: > This patch exposes PCIe PASID capability to guest for assigned devices. > Existing vfio_pci driver hides it from guest by setting the capability > length as 0 in pci_ext_cap_length[]. This exposes the PASID capability, but it's still read-only, so this largely just helps userspace know where to emulate the capability, right? Thanks, Alex > And this patch only exposes PASID capability for devices which has PCIe > PASID extended struture in its configuration space. VFs will not expose > the PASID capability as they do not implement the PASID extended structure > in their config space. It is a TODO in future. Related discussion can be > found in below link: > > https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/ > > Cc: Kevin Tian <kevin.tian@intel.com> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Lu Baolu <baolu.lu@linux.intel.com> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Reviewed-by: Eric Auger <eric.auger@redhat.com> > --- > v5 -> v6: > *) add review-by from Eric Auger. > > v1 -> v2: > *) added in v2, but it was sent in a separate patchseries before > --- > drivers/vfio/pci/vfio_pci_config.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index d98843f..07ff2e6 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = { > [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF, > [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */ > [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */ > - [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */ > + [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF, > }; > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, September 12, 2020 6:04 AM > > On Thu, 10 Sep 2020 03:45:27 -0700 > Liu Yi L <yi.l.liu@intel.com> wrote: > > > Nesting translation allows two-levels/stages page tables, with 1st > > level for guest translations (e.g. GVA->GPA), 2nd level for host > > translations (e.g. GPA->HPA). This patch adds interface for binding > > guest page tables to a PASID. This PASID must have been allocated by > > the userspace before the binding request. > > > > Cc: Kevin Tian <kevin.tian@intel.com> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Cc: Alex Williamson <alex.williamson@redhat.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Cc: Joerg Roedel <joro@8bytes.org> > > Cc: Lu Baolu <baolu.lu@linux.intel.com> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > --- > > v6 -> v7: > > *) introduced @user in struct domain_capsule to simplify the code per Eric's > > suggestion. > > *) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace. > > *) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn(). > > > > v5 -> v6: > > *) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule(). > > per comment from Eric. > > *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in > > linux/iommu.h for userspace operation and in-kernel operation. > > > > v3 -> v4: > > *) address comments from Alex on v3 > > > > v2 -> v3: > > *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO > > > > https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email- > > jacob.jun.pan@linux.intel.com/ > > > > v1 -> v2: > > *) rename subject from "vfio/type1: Bind guest page tables to host" > > *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support > bind/ > > unbind guet page table > > *) replaced vfio_iommu_for_each_dev() with a group level loop since this > > series enforces one group per container w/ nesting type as start. > > *) rename vfio_bind/unbind_gpasid_fn() to > > vfio_dev_bind/unbind_gpasid_fn() > > *) vfio_dev_unbind_gpasid() always successful > > *) use vfio_mm->pasid_lock to avoid race between PASID free and page table > > bind/unbind > > --- > > drivers/vfio/vfio_iommu_type1.c | 163 > ++++++++++++++++++++++++++++++++++++++++ > > drivers/vfio/vfio_pasid.c | 26 +++++++ > > include/linux/vfio.h | 20 +++++ > > include/uapi/linux/vfio.h | 36 +++++++++ > > 4 files changed, 245 insertions(+) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > b/drivers/vfio/vfio_iommu_type1.c index bd4b668..11f1156 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -149,6 +149,39 @@ struct vfio_regions { > > #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) > > #define DIRTY_BITMAP_SIZE_MAX > DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) > > > > +struct domain_capsule { > > + struct vfio_group *group; > > + struct iommu_domain *domain; > > + /* set if @data contains a user pointer*/ > > + bool user; > > + void *data; > > +}; > > Put the hole in the structure at the end, but I suspect we might lose the user field > when the internal api drops the unnecessary structure for unbind anyway. I see. will move @user and its comment to the end of this struct. As it's used to imply the @data field user pointer or not, I guess it's still useful to keep it. The difference would be the @data is a pasid not a bind_data struct. > > + > > +/* iommu->lock must be held */ > > +static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu, > > + struct domain_capsule *dc) { > > + struct vfio_domain *domain = NULL; > > + struct vfio_group *group = NULL; > > Unnecessary initialization. will remove them. :-) > > + > > + if (!iommu->nesting_info) > > + return -EINVAL; > > + > > + /* > > + * Only support singleton container with nesting type. If > > + * nesting_info is non-NULL, the container is non-empty. > > + * Also domain is non-empty. > > + */ > > + domain = list_first_entry(&iommu->domain_list, > > + struct vfio_domain, next); > > + group = list_first_entry(&domain->group_list, > > + struct vfio_group, next); > > + dc->group = group; > > + dc->domain = domain->domain; > > + dc->user = true; > > + return 0; > > +} > > + > > static int put_pfn(unsigned long pfn, int prot); > > > > static struct vfio_group *vfio_iommu_find_iommu_group(struct > > vfio_iommu *iommu, @@ -2405,6 +2438,49 @@ static int > vfio_iommu_resv_refresh(struct vfio_iommu *iommu, > > return ret; > > } > > > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data) { > > + struct domain_capsule *dc = (struct domain_capsule *)data; > > + unsigned long arg = *(unsigned long *)dc->data; > > + > > + return iommu_uapi_sva_bind_gpasid(dc->domain, dev, > > + (void __user *)arg); > > +} > > + > > +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data) > > +{ > > + struct domain_capsule *dc = (struct domain_capsule *)data; > > + > > + if (dc->user) { > > + unsigned long arg = *(unsigned long *)dc->data; > > + > > + iommu_uapi_sva_unbind_gpasid(dc->domain, > > + dev, (void __user *)arg); > > + } else { > > + struct iommu_gpasid_bind_data *unbind_data = > > + (struct iommu_gpasid_bind_data *)dc->data; > > + > > + iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data); > > + } > > + return 0; > > +} > > + > > +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data) { > > + struct domain_capsule *dc = (struct domain_capsule *)data; > > + struct iommu_gpasid_bind_data unbind_data; > > + > > + unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data); > > + unbind_data.flags = 0; > > + unbind_data.hpasid = pasid; > > > As in thread with Jacob, this all seems a little excessive for an internal api callback > that requires one arg. yep, Jacob informed me about that change. > > > + > > + dc->user = false; > > + dc->data = &unbind_data; > > + > > + iommu_group_for_each_dev(dc->group->iommu_group, > > + dc, vfio_dev_unbind_gpasid_fn); > > +} > > + > > static void vfio_iommu_type1_detach_group(void *iommu_data, > > struct iommu_group *iommu_group) > { @@ -2448,6 +2524,20 @@ > > static void vfio_iommu_type1_detach_group(void *iommu_data, > > if (!group) > > continue; > > > > + if (iommu->vmm && (iommu->nesting_info->features & > > + IOMMU_NESTING_FEAT_BIND_PGTBL)) { > > + struct domain_capsule dc = { .group = group, > > + .domain = domain->domain, > > + .data = NULL }; > > + > > + /* > > + * Unbind page tables bound with system wide PASIDs > > + * which are allocated to userspace. > > + */ > > + vfio_mm_for_each_pasid(iommu->vmm, &dc, > > + vfio_group_unbind_gpasid_fn); > > + } > > + > > vfio_iommu_detach_group(domain, group); > > update_dirty_scope = !group->pinned_page_dirty_scope; > > list_del(&group->next); > > @@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct > vfio_iommu *iommu, > > return ret; > > } > > > > +static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu, > > + bool is_bind, unsigned long arg) { > > + struct domain_capsule dc = { .data = &arg }; > > + struct iommu_nesting_info *info; > > + int ret; > > + > > + mutex_lock(&iommu->lock); > > + > > + info = iommu->nesting_info; > > + if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) { > > + ret = -EOPNOTSUPP; > > + goto out_unlock; > > + } > > + > > + if (!iommu->vmm) { > > + ret = -EINVAL; > > + goto out_unlock; > > + } > > + > > + ret = vfio_prepare_nesting_domain_capsule(iommu, &dc); > > + if (ret) > > + goto out_unlock; > > + > > + /* Avoid race with other containers within the same process */ > > + vfio_mm_pasid_lock(iommu->vmm); > > + > > + if (is_bind) > > + ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc, > > + vfio_dev_bind_gpasid_fn); > > + if (ret || !is_bind) > > + iommu_group_for_each_dev(dc.group->iommu_group, > > + &dc, vfio_dev_unbind_gpasid_fn); > > + > > + vfio_mm_pasid_unlock(iommu->vmm); > > +out_unlock: > > + mutex_unlock(&iommu->lock); > > + return ret; > > +} > > + > > +static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu, > > + unsigned long arg) > > +{ > > + struct vfio_iommu_type1_nesting_op hdr; > > + unsigned int minsz; > > + int ret; > > + > > + minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags); > > + > > + if (copy_from_user(&hdr, (void __user *)arg, minsz)) > > + return -EFAULT; > > + > > + if (hdr.argsz < minsz || > > + hdr.flags & ~VFIO_NESTING_OP_MASK || > > + (hdr.flags & VFIO_NESTING_OP_MASK) >= > VFIO_IOMMU_NESTING_OP_NUM) > > > Isn't this redundant to the default switch case? oh, yes. From sanity chek p.o.v, it looks to be necessary to put the flags check here. but it also makes the default switch case to be a dead code. perhaps, I could remove the check against the OP_NUM and keep the switch case. how about your opinion? > > > + return -EINVAL; > > + > > + switch (hdr.flags & VFIO_NESTING_OP_MASK) { > > + case VFIO_IOMMU_NESTING_OP_BIND_PGTBL: > > + ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz); > > + break; > > + case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL: > > + ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz); > > + break; > > + default: > > + ret = -EINVAL; > > + } > > + > > + return ret; > > +} > > + > > static long vfio_iommu_type1_ioctl(void *iommu_data, > > unsigned int cmd, unsigned long arg) { @@ - > 3000,6 +3161,8 @@ > > static long vfio_iommu_type1_ioctl(void *iommu_data, > > return vfio_iommu_type1_dirty_pages(iommu, arg); > > case VFIO_IOMMU_PASID_REQUEST: > > return vfio_iommu_type1_pasid_request(iommu, arg); > > + case VFIO_IOMMU_NESTING_OP: > > + return vfio_iommu_type1_nesting_op(iommu, arg); > > default: > > return -ENOTTY; > > } > > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c > > index 0ec4660..9e2e4b0 100644 > > --- a/drivers/vfio/vfio_pasid.c > > +++ b/drivers/vfio/vfio_pasid.c > > @@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, > > * IOASID core will notify PASID users (e.g. IOMMU driver) to > > * teardown necessary structures depending on the to-be-freed > > * PASID. > > + * Hold pasid_lock also avoids race with PASID usages like bind/ > > + * unbind page tables to requested PASID. > > */ > > mutex_lock(&vmm->pasid_lock); > > while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) @@ -228,6 > > +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, } > > EXPORT_SYMBOL_GPL(vfio_pasid_free_range); > > > > +int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > > + void (*fn)(ioasid_t id, void *data)) { > > + int ret; > > + > > + mutex_lock(&vmm->pasid_lock); > > + ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data); > > + mutex_unlock(&vmm->pasid_lock); > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid); > > + > > +void vfio_mm_pasid_lock(struct vfio_mm *vmm) { > > + mutex_lock(&vmm->pasid_lock); > > +} > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock); > > + > > +void vfio_mm_pasid_unlock(struct vfio_mm *vmm) { > > + mutex_unlock(&vmm->pasid_lock); > > +} > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock); > > + > > static int __init vfio_pasid_init(void) { > > mutex_init(&vfio_mm_lock); > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h index > > 5c3d7a8..6a999c3 100644 > > --- a/include/linux/vfio.h > > +++ b/include/linux/vfio.h > > @@ -105,6 +105,11 @@ extern struct ioasid_set > > *vfio_mm_ioasid_set(struct vfio_mm *vmm); extern int > > vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); extern void > vfio_pasid_free_range(struct vfio_mm *vmm, > > ioasid_t min, ioasid_t max); > > +extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > > + void (*fn)(ioasid_t id, void *data)); extern void > > +vfio_mm_pasid_lock(struct vfio_mm *vmm); extern void > > +vfio_mm_pasid_unlock(struct vfio_mm *vmm); > > + > > #else > > static inline struct vfio_mm *vfio_mm_get_from_task(struct > > task_struct *task) { @@ -129,6 +134,21 @@ static inline void > > vfio_pasid_free_range(struct vfio_mm *vmm, > > ioasid_t min, ioasid_t max) > > { > > } > > + > > +static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data, > > + void (*fn)(ioasid_t id, void *data)) { > > + return -ENOTTY; > > +} > > + > > +static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm) { } > > + > > +static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm) { } > > + > > #endif /* CONFIG_VFIO_PASID */ > > > > /* > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index a4bc42e..a99bd71 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request { > > > > #define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) > > > > +/** > > + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19, > > + * struct vfio_iommu_type1_nesting_op) > > + * > > + * This interface allows userspace to utilize the nesting IOMMU > > + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING > > + * cap through VFIO_IOMMU_GET_INFO. For platforms which require > > + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID > > + * _REQUEST. > > + * > > + * @data[] types defined for each op: > > + * > +=================+===============================================+ > > + * | NESTING OP | @data[] | > > + * > +=================+===============================================+ > > + * | BIND_PGTBL | struct iommu_gpasid_bind_data | > > + * +-----------------+-----------------------------------------------+ > > + * | UNBIND_PGTBL | struct iommu_gpasid_bind_data | > > + * > > ++-----------------+-----------------------------------------------+ > > + * > > + * returns: 0 on success, -errno on failure. > > + */ > > +struct vfio_iommu_type1_nesting_op { > > + __u32 argsz; > > + __u32 flags; > > +#define VFIO_NESTING_OP_MASK (0xffff) /* lower 16-bits for op */ > > + __u8 data[]; > > +}; > > + > > +enum { > > + VFIO_IOMMU_NESTING_OP_BIND_PGTBL, > > + VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL, > > + VFIO_IOMMU_NESTING_OP_NUM, > > +}; > > "VFIO_IOMMU_NESTING_NUM_OPS" would be more consistent with the vfio uapi. I see. will rename it if we decide to keep it. Regards, Yi Liu > Thanks, > > Alex > > > + > > +#define VFIO_IOMMU_NESTING_OP _IO(VFIO_TYPE, VFIO_BASE + 19) > > + > > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU > > -------- */ > > > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, September 12, 2020 5:38 AM > > On Thu, 10 Sep 2020 03:45:24 -0700 > Liu Yi L <yi.l.liu@intel.com> wrote: > > > This patch allows userspace to request PASID allocation/free, e.g. when > > serving the request from the guest. > > > > PASIDs that are not freed by userspace are automatically freed when the > > IOASID set is destroyed when process exits. > > > > Cc: Kevin Tian <kevin.tian@intel.com> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Cc: Alex Williamson <alex.williamson@redhat.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Cc: Joerg Roedel <joro@8bytes.org> > > Cc: Lu Baolu <baolu.lu@linux.intel.com> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > --- > > v6 -> v7: > > *) current VFIO returns allocated pasid via signed int, thus VFIO UAPI > > can only support 31 bits pasid. If user space gives min,max which is > > wider than 31 bits, should fail the allocation or free request. > > > > v5 -> v6: > > *) address comments from Eric against v5. remove the alloc/free helper. > > > > v4 -> v5: > > *) address comments from Eric Auger. > > *) the comments for the PASID_FREE request is addressed in patch 5/15 of > > this series. > > > > v3 -> v4: > > *) address comments from v3, except the below comment against the range > > of PASID_FREE request. needs more help on it. > > "> +if (req.range.min > req.range.max) > > > > Is it exploitable that a user can spin the kernel for a long time in > > the case of a free by calling this with [0, MAX_UINT] regardless of > > their actual allocations?" > > https://lore.kernel.org/linux-iommu/20200702151832.048b44d1@x1.home/ > > > > v1 -> v2: > > *) move the vfio_mm related code to be a seprate module > > *) use a single structure for alloc/free, could support a range of PASIDs > > *) fetch vfio_mm at group_attach time instead of at iommu driver open time > > --- > > drivers/vfio/Kconfig | 1 + > > drivers/vfio/vfio_iommu_type1.c | 76 > +++++++++++++++++++++++++++++++++++++++++ > > drivers/vfio/vfio_pasid.c | 10 ++++++ > > include/linux/vfio.h | 6 ++++ > > include/uapi/linux/vfio.h | 43 +++++++++++++++++++++++ > > 5 files changed, 136 insertions(+) > > > > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig > > index 3d8a108..95d90c6 100644 > > --- a/drivers/vfio/Kconfig > > +++ b/drivers/vfio/Kconfig > > @@ -2,6 +2,7 @@ > > config VFIO_IOMMU_TYPE1 > > tristate > > depends on VFIO > > + select VFIO_PASID if (X86) > > default n > > > > config VFIO_IOMMU_SPAPR_TCE > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index 3c0048b..bd4b668 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -76,6 +76,7 @@ struct vfio_iommu { > > bool dirty_page_tracking; > > bool pinned_page_dirty_scope; > > struct iommu_nesting_info *nesting_info; > > + struct vfio_mm *vmm; > > }; > > > > struct vfio_domain { > > @@ -2000,6 +2001,11 @@ static void vfio_iommu_iova_insert_copy(struct > vfio_iommu *iommu, > > > > static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) > > { > > + if (iommu->vmm) { > > + vfio_mm_put(iommu->vmm); > > + iommu->vmm = NULL; > > + } > > + > > kfree(iommu->nesting_info); > > iommu->nesting_info = NULL; > > } > > @@ -2127,6 +2133,26 @@ static int vfio_iommu_type1_attach_group(void > *iommu_data, > > iommu->nesting_info); > > if (ret) > > goto out_detach; > > + > > + if (iommu->nesting_info->features & > > + IOMMU_NESTING_FEAT_SYSWIDE_PASID) > { > > + struct vfio_mm *vmm; > > + struct ioasid_set *set; > > + > > + vmm = vfio_mm_get_from_task(current); > > + if (IS_ERR(vmm)) { > > + ret = PTR_ERR(vmm); > > + goto out_detach; > > + } > > + iommu->vmm = vmm; > > + > > + set = vfio_mm_ioasid_set(vmm); > > + ret = iommu_domain_set_attr(domain->domain, > > + DOMAIN_ATTR_IOASID_SET, > > + set); > > + if (ret) > > + goto out_detach; > > + } > > } > > > > /* Get aperture info */ > > @@ -2908,6 +2934,54 @@ static int vfio_iommu_type1_dirty_pages(struct > vfio_iommu *iommu, > > return -EINVAL; > > } > > > > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu, > > + unsigned long arg) > > +{ > > + struct vfio_iommu_type1_pasid_request req; > > + unsigned long minsz; > > + int ret; > > + > > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range); > > + > > + if (copy_from_user(&req, (void __user *)arg, minsz)) > > + return -EFAULT; > > + > > + if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK)) > > + return -EINVAL; > > + > > + /* > > + * Current VFIO_IOMMU_PASID_REQUEST only supports at most > > + * 31 bits PASID. The min,max value from userspace should > > + * not exceed 31 bits. > > Please describe the source of this restriction. I think it's due to > using the ioctl return value to return the PASID, thus excluding the > negative values, but aren't we actually restricted to pasid_bits > exposed in the nesting_info? yes, the description for this restriction is in the uapi/vfio.h. I think you are right. We should restricted to the pasid_bits exposed in the nesting_info. thanks for the spotting. > If this is just a sanity test for the API > then why are we defining VFIO_IOMMU_PASID_BITS in the uapi header, > which causes conflicting information to the user... which do they > honor? yes, it should not be in the uapi header. will fix it. > Should we instead verify that pasid_bits matches our API scheme > when configuring the nested domain and then let the ioasid allocator > reject requests outside of the range? agreed, I think it may be checked in the attach_group phase, if pasid_bits from iommu vendor driver are larger than 31 bits, we fail the attach. > > > + */ > > + if (req.range.min > req.range.max || > > + req.range.min > (1 << VFIO_IOMMU_PASID_BITS) || > > + req.range.max > (1 << VFIO_IOMMU_PASID_BITS)) > > Off by one, >= for the bit test. got it. thanks for spotting it. > > + return -EINVAL; > > + > > + mutex_lock(&iommu->lock); > > + if (!iommu->vmm) { > > + mutex_unlock(&iommu->lock); > > + return -EOPNOTSUPP; > > + } > > + > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) { > > + case VFIO_IOMMU_FLAG_ALLOC_PASID: > > + ret = vfio_pasid_alloc(iommu->vmm, req.range.min, > > + req.range.max); > > + break; > > + case VFIO_IOMMU_FLAG_FREE_PASID: > > + vfio_pasid_free_range(iommu->vmm, req.range.min, > > + req.range.max); > > + ret = 0; > > Set the initial value when it's declared? I see, will do. :-) > > + break; > > + default: > > + ret = -EINVAL; > > + } > > + mutex_unlock(&iommu->lock); > > + return ret; > > +} > > + > > static long vfio_iommu_type1_ioctl(void *iommu_data, > > unsigned int cmd, unsigned long arg) > > { > > @@ -2924,6 +2998,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > > return vfio_iommu_type1_unmap_dma(iommu, arg); > > case VFIO_IOMMU_DIRTY_PAGES: > > return vfio_iommu_type1_dirty_pages(iommu, arg); > > + case VFIO_IOMMU_PASID_REQUEST: > > + return vfio_iommu_type1_pasid_request(iommu, arg); > > default: > > return -ENOTTY; > > } > > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c > > index 44ecdd5..0ec4660 100644 > > --- a/drivers/vfio/vfio_pasid.c > > +++ b/drivers/vfio/vfio_pasid.c > > @@ -60,6 +60,7 @@ void vfio_mm_put(struct vfio_mm *vmm) > > { > > kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); > > } > > +EXPORT_SYMBOL_GPL(vfio_mm_put); > > > > static void vfio_mm_get(struct vfio_mm *vmm) > > { > > @@ -113,6 +114,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct > task_struct *task) > > mmput(mm); > > return vmm; > > } > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task); > > + > > +struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) > > +{ > > + return vmm->ioasid_set; > > +} > > +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_set); > > > > /* > > * Find PASID within @min and @max > > @@ -201,6 +209,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > > > > return pasid; > > } > > +EXPORT_SYMBOL_GPL(vfio_pasid_alloc); > > > > void vfio_pasid_free_range(struct vfio_mm *vmm, > > ioasid_t min, ioasid_t max) > > @@ -217,6 +226,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm, > > vfio_remove_pasid(vmm, vid); > > mutex_unlock(&vmm->pasid_lock); > > } > > +EXPORT_SYMBOL_GPL(vfio_pasid_free_range); > > > > static int __init vfio_pasid_init(void) > > { > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > > index 31472a9..5c3d7a8 100644 > > --- a/include/linux/vfio.h > > +++ b/include/linux/vfio.h > > @@ -101,6 +101,7 @@ struct vfio_mm; > > #if IS_ENABLED(CONFIG_VFIO_PASID) > > extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task); > > extern void vfio_mm_put(struct vfio_mm *vmm); > > +extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm); > > extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); > > extern void vfio_pasid_free_range(struct vfio_mm *vmm, > > ioasid_t min, ioasid_t max); > > @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm) > > { > > } > > > > +static inline struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm) > > +{ > > + return -ENOTTY; > > +} > > + > > static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > > { > > return -ENOTTY; > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index ff40f9e..a4bc42e 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -1172,6 +1172,49 @@ struct vfio_iommu_type1_dirty_bitmap_get { > > > > #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17) > > > > +/** > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18, > > + * struct vfio_iommu_type1_pasid_request) > > + * > > + * PASID (Processor Address Space ID) is a PCIe concept for tagging > > + * address spaces in DMA requests. When system-wide PASID allocation > > + * is required by the underlying iommu driver (e.g. Intel VT-d), this > > + * provides an interface for userspace to request pasid alloc/free > > + * for its assigned devices. Userspace should check the availability > > + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through > > + * VFIO_IOMMU_GET_INFO. > > + * > > + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within > @range. > > + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range. > > + * @range is [min, max], which means both @min and @max are inclusive. > > + * ALLOC_PASID and FREE_PASID are mutually exclusive. > > + * > > + * Current interface supports at most 31 bits PASID bits as returning > > + * PASID allocation result via signed int. PCIe spec defines 20 bits > > + * for PASID width, so 31 bits is enough. As a result user space should > > + * provide min, max no more than 31 bits. > > Perhaps this is the description I was looking for, but this still > conflicts with what I think the user is supposed to do, which is to > provide a range within nesting_info.pasid_bits. These seem like > implementation details, not uapi. Thanks, agreed, I may move this comment (after refining :-)) to the place attach_group where we check if pasid_bits matches our API scheme, and just put a comment "userspace should provide a range within nesting_info.pasid_bits" here. Regards, Yi Liu > Alex > > > + * returns: allocated PASID value on success, -errno on failure for > > + * ALLOC_PASID; > > + * 0 for FREE_PASID operation; > > + */ > > +struct vfio_iommu_type1_pasid_request { > > + __u32 argsz; > > +#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0) > > +#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1) > > + __u32 flags; > > + struct { > > + __u32 min; > > + __u32 max; > > + } range; > > +}; > > + > > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \ > > + VFIO_IOMMU_FLAG_FREE_PASID) > > + > > +#define VFIO_IOMMU_PASID_BITS 31 > > + > > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18) > > + > > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */ > > > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, September 12, 2020 6:13 AM > > On Thu, 10 Sep 2020 03:45:30 -0700 > Liu Yi L <yi.l.liu@intel.com> wrote: > > > This patch exposes PCIe PASID capability to guest for assigned devices. > > Existing vfio_pci driver hides it from guest by setting the capability > > length as 0 in pci_ext_cap_length[]. > > This exposes the PASID capability, but it's still read-only, so this largely just helps > userspace know where to emulate the capability, right? Thanks, oh, yes. This path only makes it visible to userspace. perhaps, I should refine the commit message and the patch name. right? Regards, Yi Liu > Alex > > > And this patch only exposes PASID capability for devices which has > > PCIe PASID extended struture in its configuration space. VFs will not > > expose the PASID capability as they do not implement the PASID > > extended structure in their config space. It is a TODO in future. > > Related discussion can be found in below link: > > > > https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/ > > > > Cc: Kevin Tian <kevin.tian@intel.com> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Cc: Alex Williamson <alex.williamson@redhat.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Cc: Joerg Roedel <joro@8bytes.org> > > Cc: Lu Baolu <baolu.lu@linux.intel.com> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > Reviewed-by: Eric Auger <eric.auger@redhat.com> > > --- > > v5 -> v6: > > *) add review-by from Eric Auger. > > > > v1 -> v2: > > *) added in v2, but it was sent in a separate patchseries before > > --- > > drivers/vfio/pci/vfio_pci_config.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/vfio/pci/vfio_pci_config.c > > b/drivers/vfio/pci/vfio_pci_config.c > > index d98843f..07ff2e6 100644 > > --- a/drivers/vfio/pci/vfio_pci_config.c > > +++ b/drivers/vfio/pci/vfio_pci_config.c > > @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + > 1] = { > > [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF, > > [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */ > > [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */ > > - [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */ > > + [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF, > > }; > > > > /* _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, September 12, 2020 4:17 AM > > On Thu, 10 Sep 2020 03:45:20 -0700 > Liu Yi L <yi.l.liu@intel.com> wrote: > > > This patch exports iommu nesting capability info to user space through > > VFIO. Userspace is expected to check this info for supported uAPIs (e.g. > > PASID alloc/free, bind page table, and cache invalidation) and the > > vendor specific format information for first level/stage page table > > that will be bound to. > > > > The nesting info is available only after container set to be NESTED type. > > Current implementation imposes one limitation - one nesting container > > should include at most one iommu group. The philosophy of vfio > > container is having all groups/devices within the container share the > > same IOMMU context. When vSVA is enabled, one IOMMU context could > > include one 2nd- level address space and multiple 1st-level address > > spaces. While the 2nd-level address space is reasonably sharable by > > multiple groups, blindly sharing 1st-level address spaces across all > > groups within the container might instead break the guest expectation. > > In the future sub/super container concept might be introduced to allow > > partial address space sharing within an IOMMU context. But for now > > let's go with this restriction by requiring singleton container for > > using nesting iommu features. Below link has the related discussion about this > decision. > > > > https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/ > > > > This patch also changes the NESTING type container behaviour. > > Something that would have succeeded before will now fail: Before this > > series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have > > succeeded even if the SMMU didn't support stage-2, as the driver would > > have silently fallen back on stage-1 mappings (which work exactly the > > same as stage-2 only since there was no nesting supported). After the > > series, we do check for DOMAIN_ATTR_NESTING so if user asks for > > VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the > > ioctl fails. But it should be a good fix and completely harmless. Detail can be found > in below link as well. > > > > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/ > > > > Cc: Kevin Tian <kevin.tian@intel.com> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Cc: Alex Williamson <alex.williamson@redhat.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Cc: Joerg Roedel <joro@8bytes.org> > > Cc: Lu Baolu <baolu.lu@linux.intel.com> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > --- > > v6 -> v7: > > *) using vfio_info_add_capability() for adding nesting cap per suggestion > > from Eric. > > > > v5 -> v6: > > *) address comments against v5 from Eric Auger. > > *) don't report nesting cap to userspace if the nesting_info->format is > > invalid. > > > > v4 -> v5: > > *) address comments from Eric Auger. > > *) return struct iommu_nesting_info for > VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as > > cap is much "cheap", if needs extension in future, just define another cap. > > https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/ > > > > v3 -> v4: > > *) address comments against v3. > > > > v1 -> v2: > > *) added in v2 > > --- > > drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++- > ----- > > include/uapi/linux/vfio.h | 19 +++++++++ > > 2 files changed, 99 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > b/drivers/vfio/vfio_iommu_type1.c index c992973..3c0048b 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit, > > "Maximum number of user DMA mappings per container (65535)."); > > > > struct vfio_iommu { > > - struct list_head domain_list; > > - struct list_head iova_list; > > - struct vfio_domain *external_domain; /* domain for external user */ > > - struct mutex lock; > > - struct rb_root dma_list; > > - struct blocking_notifier_head notifier; > > - unsigned int dma_avail; > > - uint64_t pgsize_bitmap; > > - bool v2; > > - bool nesting; > > - bool dirty_page_tracking; > > - bool pinned_page_dirty_scope; > > + struct list_head domain_list; > > + struct list_head iova_list; > > + /* domain for external user */ > > + struct vfio_domain *external_domain; > > + struct mutex lock; > > + struct rb_root dma_list; > > + struct blocking_notifier_head notifier; > > + unsigned int dma_avail; > > + uint64_t pgsize_bitmap; > > + bool v2; > > + bool nesting; > > + bool dirty_page_tracking; > > + bool pinned_page_dirty_scope; > > + struct iommu_nesting_info *nesting_info; > > Nit, not as important as the previous alignment, but might as well move this up with > the uint64_t pgsize_bitmap with the bools at the end of the structure to avoid adding > new gaps. got it. :-) > > > }; > > > > struct vfio_domain { > > @@ -130,6 +132,9 @@ struct vfio_regions { > > #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ > > (!list_empty(&iommu->domain_list)) > > > > +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \ > > + (!list_empty(&(iommu)->domain_list))) > > + > > #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / > BITS_PER_BYTE) > > > > /* > > @@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct > > vfio_iommu *iommu, > > > > list_splice_tail(iova_copy, iova); > > } > > + > > +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu) > > +{ > > + kfree(iommu->nesting_info); > > + iommu->nesting_info = NULL; > > +} > > + > > static int vfio_iommu_type1_attach_group(void *iommu_data, > > struct iommu_group *iommu_group) > { @@ -2022,6 +2034,12 @@ > > static int vfio_iommu_type1_attach_group(void *iommu_data, > > } > > } > > > > + /* Nesting type container can include only one group */ > > + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) { > > + mutex_unlock(&iommu->lock); > > + return -EINVAL; > > + } > > + > > group = kzalloc(sizeof(*group), GFP_KERNEL); > > domain = kzalloc(sizeof(*domain), GFP_KERNEL); > > if (!group || !domain) { > > @@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void > *iommu_data, > > if (ret) > > goto out_domain; > > > > + /* Nesting cap info is available only after attaching */ > > + if (iommu->nesting) { > > + int size = sizeof(struct iommu_nesting_info); > > + > > + iommu->nesting_info = kzalloc(size, GFP_KERNEL); > > + if (!iommu->nesting_info) { > > + ret = -ENOMEM; > > + goto out_detach; > > + } > > + > > + /* Now get the nesting info */ > > + iommu->nesting_info->argsz = size; > > + ret = iommu_domain_get_attr(domain->domain, > > + DOMAIN_ATTR_NESTING, > > + iommu->nesting_info); > > + if (ret) > > + goto out_detach; > > + } > > + > > /* Get aperture info */ > > iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, > &geo); > > > > @@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void > *iommu_data, > > return 0; > > > > out_detach: > > + vfio_iommu_release_nesting_info(iommu); > > vfio_iommu_detach_group(domain, group); > > out_domain: > > iommu_domain_free(domain->domain); > > @@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void > *iommu_data, > > vfio_iommu_unmap_unpin_all(iommu); > > else > > > vfio_iommu_unmap_unpin_reaccount(iommu); > > + > > + vfio_iommu_release_nesting_info(iommu); > > } > > iommu_domain_free(domain->domain); > > list_del(&domain->next); > > @@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct > vfio_iommu *iommu, > > return vfio_info_add_capability(caps, &cap_mig.header, > > sizeof(cap_mig)); } > > > > +static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu, > > + struct vfio_info_cap *caps) { > > + struct vfio_iommu_type1_info_cap_nesting nesting_cap; > > + size_t size; > > + > > + /* when nesting_info is null, no need to go further */ > > + if (!iommu->nesting_info) > > + return 0; > > + > > + /* when @format of nesting_info is 0, fail the call */ > > + if (iommu->nesting_info->format == 0) > > + return -ENOENT; > > > Should we fail this in the attach_group? Seems the user would be in a bad situation > here if they successfully created a nesting container but can't get info. Is there > backwards compatibility we're trying to maintain with this? agreed. fail it in attach_group would be better. > > + > > + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) + > > + iommu->nesting_info->argsz; > > + > > + nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING; > > + nesting_cap.header.version = 1; > > + > > + memcpy(&nesting_cap.info, iommu->nesting_info, > > + iommu->nesting_info->argsz); > > + > > + return vfio_info_add_capability(caps, &nesting_cap.header, size); } > > + > > static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu, > > unsigned long arg) > > { > > @@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct > vfio_iommu *iommu, > > if (!ret) > > ret = vfio_iommu_iova_build_caps(iommu, &caps); > > > > + ret = vfio_iommu_add_nesting_cap(iommu, &caps); > > Why don't we follow either the naming scheme or the error handling scheme of the > previous caps? Seems like this should be: > > if (!ret) > ret = vfio_iommu_nesting_build_caps(...); got it. should follow the error handling scheme and also the naming. will do it. Regards, Yi Liu > Thanks, > > Alex > > > > + > > mutex_unlock(&iommu->lock); > > > > if (ret) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index 9204705..ff40f9e 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -14,6 +14,7 @@ > > > > #include <linux/types.h> > > #include <linux/ioctl.h> > > +#include <linux/iommu.h> > > > > #define VFIO_API_VERSION 0 > > > > @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration { > > __u64 max_dirty_bitmap_size; /* in bytes */ > > }; > > > > +/* > > + * The nesting capability allows to report the related capability > > + * and info for nesting iommu type. > > + * > > + * The structures below define version 1 of this capability. > > + * > > + * Nested capabilities should be checked by the userspace after > > + * setting VFIO_TYPE1_NESTING_IOMMU. > > + * > > + * @info: the nesting info provided by IOMMU driver. > > + */ > > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3 > > + > > +struct vfio_iommu_type1_info_cap_nesting { > > + struct vfio_info_cap_header header; > > + struct iommu_nesting_info info; > > +}; > > + > > #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) > > > > /** _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/10 下午6:45, Liu Yi L wrote: > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on > Intel platforms allows address space sharing between device DMA and > applications. SVA can reduce programming complexity and enhance security. > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing > guest application address space with passthru devices. This is called > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU > changes. For IOMMU and QEMU changes, they are in separate series (listed > in the "Related series"). > > The high-level architecture for SVA virtualization is as below, the key > design of vSVA support is to utilize the dual-stage IOMMU translation ( > also known as IOMMU nesting translation) capability in host IOMMU. > > > .-------------. .---------------------------. > | vIOMMU | | Guest process CR3, FL only| > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush - > '-------------' | > | | V > | | CR3 in GPA > '-------------' > Guest > ------| Shadow |--------------------------|-------- > v v v > Host > .-------------. .----------------------. > | pIOMMU | | Bind FL for GVA-GPA | > | | '----------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.------------------------------. > | ||SL for GPA-HPA, default domain| > | | '------------------------------' > '-------------' > Where: > - FL = First level/stage one page tables > - SL = Second level/stage two page tables > > Patch Overview: > 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016) > 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007) > 3. a fix to a revisit in intel iommu driver (patch 0006) > 4. vfio support for binding guest page table to host (patch 0008, 0009, 0010) > 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) > 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) > 7. expose PASID capability to VM (patch 0013) > 8. add doc for VFIO dual stage control (patch 0014) If it's possible, I would suggest a generic uAPI instead of a VFIO specific one. Jason suggest something like /dev/sva. There will be a lot of other subsystems that could benefit from this (e.g vDPA). Have you ever considered this approach? Thanks _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
> From: Jason Wang <jasowang@redhat.com> > Sent: Monday, September 14, 2020 12:20 PM > > On 2020/9/10 下午6:45, Liu Yi L wrote: > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on > > Intel platforms allows address space sharing between device DMA and > > applications. SVA can reduce programming complexity and enhance > security. > > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing > > guest application address space with passthru devices. This is called > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU > > changes. For IOMMU and QEMU changes, they are in separate series (listed > > in the "Related series"). > > > > The high-level architecture for SVA virtualization is as below, the key > > design of vSVA support is to utilize the dual-stage IOMMU translation ( > > also known as IOMMU nesting translation) capability in host IOMMU. > > > > > > .-------------. .---------------------------. > > | vIOMMU | | Guest process CR3, FL only| > > | | '---------------------------' > > .----------------/ > > | PASID Entry |--- PASID cache flush - > > '-------------' | > > | | V > > | | CR3 in GPA > > '-------------' > > Guest > > ------| Shadow |--------------------------|-------- > > v v v > > Host > > .-------------. .----------------------. > > | pIOMMU | | Bind FL for GVA-GPA | > > | | '----------------------' > > .----------------/ | > > | PASID Entry | V (Nested xlate) > > '----------------\.------------------------------. > > | ||SL for GPA-HPA, default domain| > > | | '------------------------------' > > '-------------' > > Where: > > - FL = First level/stage one page tables > > - SL = Second level/stage two page tables > > > > Patch Overview: > > 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, > 0015 , 0016) > > 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, > 0007) > > 3. a fix to a revisit in intel iommu driver (patch 0006) > > 4. vfio support for binding guest page table to host (patch 0008, 0009, > 0010) > > 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) > > 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) > > 7. expose PASID capability to VM (patch 0013) > > 8. add doc for VFIO dual stage control (patch 0014) > > > If it's possible, I would suggest a generic uAPI instead of a VFIO > specific one. > > Jason suggest something like /dev/sva. There will be a lot of other > subsystems that could benefit from this (e.g vDPA). > Just be curious. When does vDPA subsystem plan to support vSVA and when could one expect a SVA-capable vDPA device in market? Thanks Kevin _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/14 下午4:01, Tian, Kevin wrote: >> From: Jason Wang <jasowang@redhat.com> >> Sent: Monday, September 14, 2020 12:20 PM >> >> On 2020/9/10 下午6:45, Liu Yi L wrote: >>> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on >>> Intel platforms allows address space sharing between device DMA and >>> applications. SVA can reduce programming complexity and enhance >> security. >>> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing >>> guest application address space with passthru devices. This is called >>> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU >>> changes. For IOMMU and QEMU changes, they are in separate series (listed >>> in the "Related series"). >>> >>> The high-level architecture for SVA virtualization is as below, the key >>> design of vSVA support is to utilize the dual-stage IOMMU translation ( >>> also known as IOMMU nesting translation) capability in host IOMMU. >>> >>> >>> .-------------. .---------------------------. >>> | vIOMMU | | Guest process CR3, FL only| >>> | | '---------------------------' >>> .----------------/ >>> | PASID Entry |--- PASID cache flush - >>> '-------------' | >>> | | V >>> | | CR3 in GPA >>> '-------------' >>> Guest >>> ------| Shadow |--------------------------|-------- >>> v v v >>> Host >>> .-------------. .----------------------. >>> | pIOMMU | | Bind FL for GVA-GPA | >>> | | '----------------------' >>> .----------------/ | >>> | PASID Entry | V (Nested xlate) >>> '----------------\.------------------------------. >>> | ||SL for GPA-HPA, default domain| >>> | | '------------------------------' >>> '-------------' >>> Where: >>> - FL = First level/stage one page tables >>> - SL = Second level/stage two page tables >>> >>> Patch Overview: >>> 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, >> 0015 , 0016) >>> 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, >> 0007) >>> 3. a fix to a revisit in intel iommu driver (patch 0006) >>> 4. vfio support for binding guest page table to host (patch 0008, 0009, >> 0010) >>> 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) >>> 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) >>> 7. expose PASID capability to VM (patch 0013) >>> 8. add doc for VFIO dual stage control (patch 0014) >> >> If it's possible, I would suggest a generic uAPI instead of a VFIO >> specific one. >> >> Jason suggest something like /dev/sva. There will be a lot of other >> subsystems that could benefit from this (e.g vDPA). >> > Just be curious. When does vDPA subsystem plan to support vSVA and > when could one expect a SVA-capable vDPA device in market? > > Thanks > Kevin vSVA is in the plan but there's no ETA. I think we might start the work after control vq support. It will probably start from SVA first and then vSVA (since it might require platform support). For the device part, it really depends on the chipset and other device vendors. We plan to do the prototype in virtio by introducing PASID support in the spec. Thanks _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
> From: Jason Wang > Sent: Monday, September 14, 2020 4:57 PM > > On 2020/9/14 下午4:01, Tian, Kevin wrote: > >> From: Jason Wang <jasowang@redhat.com> > >> Sent: Monday, September 14, 2020 12:20 PM > >> > >> On 2020/9/10 下午6:45, Liu Yi L wrote: > >>> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on > >>> Intel platforms allows address space sharing between device DMA and > >>> applications. SVA can reduce programming complexity and enhance > >> security. > >>> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing > >>> guest application address space with passthru devices. This is called > >>> vSVA in this series. The whole vSVA enabling requires > QEMU/VFIO/IOMMU > >>> changes. For IOMMU and QEMU changes, they are in separate series > (listed > >>> in the "Related series"). > >>> > >>> The high-level architecture for SVA virtualization is as below, the key > >>> design of vSVA support is to utilize the dual-stage IOMMU translation ( > >>> also known as IOMMU nesting translation) capability in host IOMMU. > >>> > >>> > >>> .-------------. .---------------------------. > >>> | vIOMMU | | Guest process CR3, FL only| > >>> | | '---------------------------' > >>> .----------------/ > >>> | PASID Entry |--- PASID cache flush - > >>> '-------------' | > >>> | | V > >>> | | CR3 in GPA > >>> '-------------' > >>> Guest > >>> ------| Shadow |--------------------------|-------- > >>> v v v > >>> Host > >>> .-------------. .----------------------. > >>> | pIOMMU | | Bind FL for GVA-GPA | > >>> | | '----------------------' > >>> .----------------/ | > >>> | PASID Entry | V (Nested xlate) > >>> '----------------\.------------------------------. > >>> | ||SL for GPA-HPA, default domain| > >>> | | '------------------------------' > >>> '-------------' > >>> Where: > >>> - FL = First level/stage one page tables > >>> - SL = Second level/stage two page tables > >>> > >>> Patch Overview: > >>> 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, > >> 0015 , 0016) > >>> 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, > >> 0007) > >>> 3. a fix to a revisit in intel iommu driver (patch 0006) > >>> 4. vfio support for binding guest page table to host (patch 0008, 0009, > >> 0010) > >>> 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) > >>> 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) > >>> 7. expose PASID capability to VM (patch 0013) > >>> 8. add doc for VFIO dual stage control (patch 0014) > >> > >> If it's possible, I would suggest a generic uAPI instead of a VFIO > >> specific one. > >> > >> Jason suggest something like /dev/sva. There will be a lot of other > >> subsystems that could benefit from this (e.g vDPA). > >> > > Just be curious. When does vDPA subsystem plan to support vSVA and > > when could one expect a SVA-capable vDPA device in market? > > > > Thanks > > Kevin > > > vSVA is in the plan but there's no ETA. I think we might start the work > after control vq support. It will probably start from SVA first and > then vSVA (since it might require platform support). > > For the device part, it really depends on the chipset and other device > vendors. We plan to do the prototype in virtio by introducing PASID > support in the spec. > Thanks for the info. Then here is my thought. First, I don't think /dev/sva is the right interface. Once we start considering such generic uAPI, it better behaves as the one interface for all kinds of DMA requirements on device/subdevice passthrough. Nested page table thru vSVA is one way. Manual map/unmap is another way. It doesn't make sense to have one through generic uAPI and the other through subsystem specific uAPI. In the end the interface might become /dev/iommu, for delegating certain IOMMU operations to userspace. In addition, delegated IOMMU operations have different scopes. PASID allocation is per process/VM. pgtbl-bind/unbind, map/unmap and cache invalidation are per iommu domain. page request/ response are per device/subdevice. This requires the uAPI to also understand and manage the association between domain/group/ device/subdevice (such as group attach/detach), instead of doing it separately in VFIO or vDPA as today. Based on above, I feel a more reasonable way is to first make a /dev/iommu uAPI supporting DMA map/unmap usages and then introduce vSVA to it. Doing this order is because DMA map/unmap is widely used thus can better help verify the core logic with many existing devices. For vSVA, vDPA support has not be started while VFIO support is close to be accepted. It doesn't make much sense by blocking the VFIO part until vDPA is ready for wide verification and /dev/iommu is mature enough. Yes, the newly- added uAPIs will be finally deprecated when /dev/iommu starts to support vSVA. But using /dev/iommu will anyway deprecate some existing VFIO IOMMU uAPIs at that time... Thanks Kevin _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 10:38:10AM +0000, Tian, Kevin wrote: > is widely used thus can better help verify the core logic with > many existing devices. For vSVA, vDPA support has not be started > while VFIO support is close to be accepted. It doesn't make much > sense by blocking the VFIO part until vDPA is ready for wide You keep saying that, but if we keep ignoring the right architecture we end up with a mess inside VFIO just to save some development time. That is usually not how the kernel process works. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 12:20:10PM +0800, Jason Wang wrote: > > On 2020/9/10 下午6:45, Liu Yi L wrote: > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on > > Intel platforms allows address space sharing between device DMA and > > applications. SVA can reduce programming complexity and enhance security. > > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing > > guest application address space with passthru devices. This is called > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU > > changes. For IOMMU and QEMU changes, they are in separate series (listed > > in the "Related series"). > > > > The high-level architecture for SVA virtualization is as below, the key > > design of vSVA support is to utilize the dual-stage IOMMU translation ( > > also known as IOMMU nesting translation) capability in host IOMMU. > > > > > > .-------------. .---------------------------. > > | vIOMMU | | Guest process CR3, FL only| > > | | '---------------------------' > > .----------------/ > > | PASID Entry |--- PASID cache flush - > > '-------------' | > > | | V > > | | CR3 in GPA > > '-------------' > > Guest > > ------| Shadow |--------------------------|-------- > > v v v > > Host > > .-------------. .----------------------. > > | pIOMMU | | Bind FL for GVA-GPA | > > | | '----------------------' > > .----------------/ | > > | PASID Entry | V (Nested xlate) > > '----------------\.------------------------------. > > | ||SL for GPA-HPA, default domain| > > | | '------------------------------' > > '-------------' > > Where: > > - FL = First level/stage one page tables > > - SL = Second level/stage two page tables > > > > Patch Overview: > > 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016) > > 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007) > > 3. a fix to a revisit in intel iommu driver (patch 0006) > > 4. vfio support for binding guest page table to host (patch 0008, 0009, 0010) > > 5. vfio support for IOMMU cache invalidation from VMs (patch 0011) > > 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012) > > 7. expose PASID capability to VM (patch 0013) > > 8. add doc for VFIO dual stage control (patch 0014) > > > If it's possible, I would suggest a generic uAPI instead of a VFIO specific > one. A large part of this work is already generic uAPI, in include/uapi/linux/iommu.h. This patchset connects that generic interface to the pre-existing VFIO uAPI that deals with IOMMU mappings of an assigned device. But the bulk of the work is done by the IOMMU subsystem, and is available to all device drivers. > Jason suggest something like /dev/sva. There will be a lot of other > subsystems that could benefit from this (e.g vDPA). Do you have a more precise idea of the interface /dev/sva would provide, how it would interact with VFIO and others? vDPA could transport the generic iommu.h structures via its own uAPI, and call the IOMMU API directly without going through an intermediate /dev/sva handle. Thanks, Jean > Have you ever considered this approach? > > Thanks > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote: > > Jason suggest something like /dev/sva. There will be a lot of other > > subsystems that could benefit from this (e.g vDPA). > > Do you have a more precise idea of the interface /dev/sva would provide, > how it would interact with VFIO and others? vDPA could transport the > generic iommu.h structures via its own uAPI, and call the IOMMU API > directly without going through an intermediate /dev/sva handle. Prior to PASID IOMMU really only makes sense as part of vfio-pci because the iommu can only key on the BDF. That can't work unless the whole PCI function can be assigned. It is hard to see how a shared PCI device can work with IOMMU like this, so may as well use vfio. SVA and various vIOMMU models change this, a shared PCI driver can absoultely work with a PASID that is assigned to a VM safely, and actually don't need to know if their PASID maps a mm_struct or something else. So, some /dev/sva is another way to allocate a PASID that is not 1:1 with mm_struct, as the existing SVA stuff enforces. ie it is a way to program the DMA address map of the PASID. This new PASID allocator would match the guest memory layout and support the IOMMU nesting stuff needed for vPASID. This is the common code for the complex cases of virtualization with PASID, shared by all user DMA drivers, including VFIO. It doesn't make a lot of sense to build a uAPI exclusive to VFIO just for PASID and vPASID. We already know everything doing user DMA will eventually need this stuff. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote: > On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote: > > > > Jason suggest something like /dev/sva. There will be a lot of other > > > subsystems that could benefit from this (e.g vDPA). > > > > Do you have a more precise idea of the interface /dev/sva would provide, > > how it would interact with VFIO and others? vDPA could transport the > > generic iommu.h structures via its own uAPI, and call the IOMMU API > > directly without going through an intermediate /dev/sva handle. > > Prior to PASID IOMMU really only makes sense as part of vfio-pci > because the iommu can only key on the BDF. That can't work unless the > whole PCI function can be assigned. It is hard to see how a shared PCI > device can work with IOMMU like this, so may as well use vfio. > > SVA and various vIOMMU models change this, a shared PCI driver can > absoultely work with a PASID that is assigned to a VM safely, and > actually don't need to know if their PASID maps a mm_struct or > something else. Well, IOMMU does care if its a native mm_struct or something that belongs to guest. Because you need ability to forward page-requests and pickup page-responses from guest. Since there is just one PRQ on the IOMMU and responses can't be sent directly. You have to depend on vIOMMU type interface in guest to make all of this magic work right? > > So, some /dev/sva is another way to allocate a PASID that is not 1:1 > with mm_struct, as the existing SVA stuff enforces. ie it is a way to > program the DMA address map of the PASID. > > This new PASID allocator would match the guest memory layout and Not sure what you mean by "match guest memory layout"? Probably, meaning first level is gVA or gIOVA? Cheers, Ashok _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 09:22:47AM -0700, Raj, Ashok wrote: > Hi Jason, > > On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote: > > On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote: > > > > > > Jason suggest something like /dev/sva. There will be a lot of other > > > > subsystems that could benefit from this (e.g vDPA). > > > > > > Do you have a more precise idea of the interface /dev/sva would provide, > > > how it would interact with VFIO and others? vDPA could transport the > > > generic iommu.h structures via its own uAPI, and call the IOMMU API > > > directly without going through an intermediate /dev/sva handle. > > > > Prior to PASID IOMMU really only makes sense as part of vfio-pci > > because the iommu can only key on the BDF. That can't work unless the > > whole PCI function can be assigned. It is hard to see how a shared PCI > > device can work with IOMMU like this, so may as well use vfio. > > > > SVA and various vIOMMU models change this, a shared PCI driver can > > absoultely work with a PASID that is assigned to a VM safely, and > > actually don't need to know if their PASID maps a mm_struct or > > something else. > > Well, IOMMU does care if its a native mm_struct or something that belongs > to guest. Because you need ability to forward page-requests and pickup > page-responses from guest. Since there is just one PRQ on the IOMMU and > responses can't be sent directly. You have to depend on vIOMMU type > interface in guest to make all of this magic work right? Yes, IOMMU cares, but not the PCI Driver. It just knows it has a PASID. Details on how page faultings is handled or how the mapping is setup is abstracted by the PASID. > > This new PASID allocator would match the guest memory layout and > > Not sure what you mean by "match guest memory layout"? > Probably, meaning first level is gVA or gIOVA? It means whatever the qemu/viommu/guest/etc needs across all the IOMMU/arch implementations. Basically, there should only be two ways to get a PASID - From mm_struct that mirrors the creating process - Via '/dev/sva' which has an complete interface to create and control a PASID suitable for virtualization and more VFIO should not have its own special way to get a PASID. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, 14 Sep 2020 13:33:54 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Sep 14, 2020 at 09:22:47AM -0700, Raj, Ashok wrote: > > Hi Jason, > > > > On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote: > > > On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote: > > > > > > > > Jason suggest something like /dev/sva. There will be a lot of other > > > > > subsystems that could benefit from this (e.g vDPA). > > > > > > > > Do you have a more precise idea of the interface /dev/sva would provide, > > > > how it would interact with VFIO and others? vDPA could transport the > > > > generic iommu.h structures via its own uAPI, and call the IOMMU API > > > > directly without going through an intermediate /dev/sva handle. > > > > > > Prior to PASID IOMMU really only makes sense as part of vfio-pci > > > because the iommu can only key on the BDF. That can't work unless the > > > whole PCI function can be assigned. It is hard to see how a shared PCI > > > device can work with IOMMU like this, so may as well use vfio. > > > > > > SVA and various vIOMMU models change this, a shared PCI driver can > > > absoultely work with a PASID that is assigned to a VM safely, and > > > actually don't need to know if their PASID maps a mm_struct or > > > something else. > > > > Well, IOMMU does care if its a native mm_struct or something that belongs > > to guest. Because you need ability to forward page-requests and pickup > > page-responses from guest. Since there is just one PRQ on the IOMMU and > > responses can't be sent directly. You have to depend on vIOMMU type > > interface in guest to make all of this magic work right? > > Yes, IOMMU cares, but not the PCI Driver. It just knows it has a > PASID. Details on how page faultings is handled or how the mapping is > setup is abstracted by the PASID. > > > > This new PASID allocator would match the guest memory layout and > > > > Not sure what you mean by "match guest memory layout"? > > Probably, meaning first level is gVA or gIOVA? > > It means whatever the qemu/viommu/guest/etc needs across all the > IOMMU/arch implementations. > > Basically, there should only be two ways to get a PASID > - From mm_struct that mirrors the creating process > - Via '/dev/sva' which has an complete interface to create and > control a PASID suitable for virtualization and more > > VFIO should not have its own special way to get a PASID. "its own special way" is arguable, VFIO is just making use of what's being proposed as the uapi via its existing IOMMU interface. PASIDs are also a system resource, so we require some degree of access control and quotas for management of PASIDs. Does libvirt now get involved to know whether an assigned device requires PASIDs such that access to this dev file is provided to QEMU? How does the kernel validate usage or implement quotas when disconnected from device ownership? PASIDs would be an obvious DoS path if any user can create arbitrary allocations. If we can move code out of VFIO, I'm all for it, but I think it needs to be better defined than "implement magic universal sva uapi interface" before we can really consider it. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote: > "its own special way" is arguable, VFIO is just making use of what's > being proposed as the uapi via its existing IOMMU interface. I mean, if we have a /dev/sva then it makes no sense to extend the VFIO interfaces with the same stuff. VFIO should simply accept a PASID created from /dev/sva and use it just like any other user-DMA driver would. > are also a system resource, so we require some degree of access control > and quotas for management of PASIDs. This has already happened, the SVA patches generally allow unpriv user space to allocate a PASID for their process. If a device implements a mdev shared with a kernel driver (like IDXD) then it will be sharing that PASID pool across both drivers. In this case it makes no sense that VFIO has PASID quota logic because it has an incomplete view. It could only make sense if VFIO is the exclusive owner of the bus/device/function. The tracking logic needs to be global.. Most probably in some kind of PASID cgroup controller? > know whether an assigned device requires PASIDs such that access to > this dev file is provided to QEMU? Wouldn't QEMU just open /dev/sva if it needs it? Like other dev files? Why would it need something special? > would be an obvious DoS path if any user can create arbitrary > allocations. If we can move code out of VFIO, I'm all for it, but I > think it needs to be better defined than "implement magic universal sva > uapi interface" before we can really consider it. Thanks, Jason began by saying VDPA will need this too, I agree with him. I'm not sure why it would be "magic"? This series already gives a pretty solid blueprint for what the interface would need to have. Interested folks need to sit down and talk about it not just default everything to being built inside VFIO. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, 14 Sep 2020 14:41:21 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote: > > > "its own special way" is arguable, VFIO is just making use of what's > > being proposed as the uapi via its existing IOMMU interface. > > I mean, if we have a /dev/sva then it makes no sense to extend the > VFIO interfaces with the same stuff. VFIO should simply accept a PASID > created from /dev/sva and use it just like any other user-DMA driver > would. I don't think that's absolutely true. By the same logic, we could say that pci-sysfs provides access to PCI BAR and config space resources, the VFIO device interface duplicates part of that interface therefore it should be abandoned. But in reality, VFIO providing access to those resources puts those accesses within the scope and control of the VFIO interface. Ownership of a device through vfio is provided by allowing the user access to the vfio group dev file, not by the group file, plus some number of resource files, and the config file, and running with admin permissions to see the full extent of config space. Reserved ranges for the IOMMU are also provided via sysfs, but VFIO includes a capability on the IOMMU get_info ioctl for the user to learn about available IOVA ranges w/o scraping through sysfs. > > are also a system resource, so we require some degree of access control > > and quotas for management of PASIDs. > > This has already happened, the SVA patches generally allow unpriv user > space to allocate a PASID for their process. > > If a device implements a mdev shared with a kernel driver (like IDXD) > then it will be sharing that PASID pool across both drivers. In this > case it makes no sense that VFIO has PASID quota logic because it has > an incomplete view. It could only make sense if VFIO is the exclusive > owner of the bus/device/function. > > The tracking logic needs to be global.. Most probably in some kind of > PASID cgroup controller? AIUI, that doesn't exist yet, so it makes sense that VFIO, as the mechanism through which a user would allocate a PASID, implements a reasonable quota to avoid an unprivileged user exhausting the address space. Also, "unprivileged user" is a bit of a misnomer in this context as the VFIO user must be privileged with ownership of a device before they can even participate in PASID allocation. Is truly unprivileged access reasonable for a limited resource? > > know whether an assigned device requires PASIDs such that access to > > this dev file is provided to QEMU? > > Wouldn't QEMU just open /dev/sva if it needs it? Like other dev files? > Why would it need something special? QEMU typically runs in a sandbox with limited access, when a device or mdev is assigned to a VM, file permissions are configured to allow that access. QEMU doesn't get to poke at any random dev file it likes, that's part of how userspace reduces the potential attack surface. > > would be an obvious DoS path if any user can create arbitrary > > allocations. If we can move code out of VFIO, I'm all for it, but I > > think it needs to be better defined than "implement magic universal sva > > uapi interface" before we can really consider it. Thanks, > > Jason began by saying VDPA will need this too, I agree with him. > > I'm not sure why it would be "magic"? This series already gives a > pretty solid blueprint for what the interface would need to > have. Interested folks need to sit down and talk about it not just > default everything to being built inside VFIO. This series is a blueprint within the context of the ownership and permission model that VFIO already provides. It doesn't seem like we can pluck that out on its own, nor is it necessarily the case that VFIO wouldn't want to provide PASID services within its own API even if we did have this undefined /dev/sva interface. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote: > On Mon, 14 Sep 2020 14:41:21 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote: > > > > > "its own special way" is arguable, VFIO is just making use of what's > > > being proposed as the uapi via its existing IOMMU interface. > > > > I mean, if we have a /dev/sva then it makes no sense to extend the > > VFIO interfaces with the same stuff. VFIO should simply accept a PASID > > created from /dev/sva and use it just like any other user-DMA driver > > would. > > I don't think that's absolutely true. By the same logic, we could say > that pci-sysfs provides access to PCI BAR and config space > resources, No, it is the reverse, VFIO is a better version of pci-sysfs, so pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva would be the superset interface for PASID, so whatver VFIO has would be obsoleted. It would be very unusual for the kernel to have to 'preferred' interfaces for the same thing, IMHO. The review process for uAPI should really prevent that by allowing all interests to be served while the uAPI is designed. > the VFIO device interface duplicates part of that interface therefore it > should be abandoned. But in reality, VFIO providing access to those > resources puts those accesses within the scope and control of the VFIO > interface. Not clear to my why VFIO needs that. PASID seems quite orthogonal from VFIO to me. > > This has already happened, the SVA patches generally allow unpriv user > > space to allocate a PASID for their process. > > > > If a device implements a mdev shared with a kernel driver (like IDXD) > > then it will be sharing that PASID pool across both drivers. In this > > case it makes no sense that VFIO has PASID quota logic because it has > > an incomplete view. It could only make sense if VFIO is the exclusive > > owner of the bus/device/function. > > > > The tracking logic needs to be global.. Most probably in some kind of > > PASID cgroup controller? > > AIUI, that doesn't exist yet, so it makes sense that VFIO, as the > mechanism through which a user would allocate a PASID, VFIO is not the exclusive user interface for PASID. Other SVA drivers will allocate PASIDs. Any quota has to be implemented by the IOMMU layer, and shared across all drivers. > space. Also, "unprivileged user" is a bit of a misnomer in this > context as the VFIO user must be privileged with ownership of a device > before they can even participate in PASID allocation. Is truly > unprivileged access reasonable for a limited resource? I'm not talking about VFIO, I'm talking about the other SVA drivers. I expect some of them will be unpriv safe, like IDXD, for instance. Some way to manage the limited PASID resource will be necessary beyond just VFIO. > QEMU typically runs in a sandbox with limited access, when a device or > mdev is assigned to a VM, file permissions are configured to allow that > access. QEMU doesn't get to poke at any random dev file it likes, > that's part of how userspace reduces the potential attack surface. Plumbing the exact same APIs through VFIO's uAPI vs /dev/sva doesn't reduce the attack surface. qemu can simply include /dev/sva in the sandbox when using VFIO with no increase in attack surface from this proposed series. > This series is a blueprint within the context of the ownership and > permission model that VFIO already provides. It doesn't seem like we > can pluck that out on its own, nor is it necessarily the case that VFIO > wouldn't want to provide PASID services within its own API even if we > did have this undefined /dev/sva interface. I don't see what you do - VFIO does not own PASID, and in this vfio-mdev mode it does not own the PCI device/IOMMU either. So why would this need to be part of the VFIO owernship and permission model? Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, 14 Sep 2020 16:00:57 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote: > > On Mon, 14 Sep 2020 14:41:21 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote: > > > > > > > "its own special way" is arguable, VFIO is just making use of what's > > > > being proposed as the uapi via its existing IOMMU interface. > > > > > > I mean, if we have a /dev/sva then it makes no sense to extend the > > > VFIO interfaces with the same stuff. VFIO should simply accept a PASID > > > created from /dev/sva and use it just like any other user-DMA driver > > > would. > > > > I don't think that's absolutely true. By the same logic, we could say > > that pci-sysfs provides access to PCI BAR and config space > > resources, > > No, it is the reverse, VFIO is a better version of pci-sysfs, so > pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva > would be the superset interface for PASID, so whatver VFIO has would > be obsoleted. > > It would be very unusual for the kernel to have to 'preferred' > interfaces for the same thing, IMHO. The review process for uAPI > should really prevent that by allowing all interests to be served > while the uAPI is designed. > > > the VFIO device interface duplicates part of that interface therefore it > > should be abandoned. But in reality, VFIO providing access to those > > resources puts those accesses within the scope and control of the VFIO > > interface. > > Not clear to my why VFIO needs that. PASID seems quite orthogonal from > VFIO to me. Can you explain that further, or spit-ball what you think this /dev/sva interface looks like and how a user might interact between vfio and this new interface? The interface proposed here definitely does not seem orthogonal to the vfio IOMMU interface, ie. selecting a specific IOMMU domain mode during vfio setup, allocating pasids and associating them with page tables for that two-stage IOMMU setup, performing cache invalidations based on page table updates, etc. How does it make more sense for a vIOMMU to setup some aspects of the IOMMU through vfio and others through a TBD interface? > > > This has already happened, the SVA patches generally allow unpriv user > > > space to allocate a PASID for their process. > > > > > > If a device implements a mdev shared with a kernel driver (like IDXD) > > > then it will be sharing that PASID pool across both drivers. In this > > > case it makes no sense that VFIO has PASID quota logic because it has > > > an incomplete view. It could only make sense if VFIO is the exclusive > > > owner of the bus/device/function. > > > > > > The tracking logic needs to be global.. Most probably in some kind of > > > PASID cgroup controller? > > > > AIUI, that doesn't exist yet, so it makes sense that VFIO, as the > > mechanism through which a user would allocate a PASID, > > VFIO is not the exclusive user interface for PASID. Other SVA drivers > will allocate PASIDs. Any quota has to be implemented by the IOMMU > layer, and shared across all drivers. The IOMMU needs to allocate PASIDs, so in that sense it enforces a quota via the architectural limits, but is the IOMMU layer going to distinguish in-kernel versus user limits? A cgroup limit seems like a good idea, but that's not really at the IOMMU layer either and I don't see that a /dev/sva and vfio interface couldn't both support a cgroup type quota. > > space. Also, "unprivileged user" is a bit of a misnomer in this > > context as the VFIO user must be privileged with ownership of a device > > before they can even participate in PASID allocation. Is truly > > unprivileged access reasonable for a limited resource? > > I'm not talking about VFIO, I'm talking about the other SVA drivers. I > expect some of them will be unpriv safe, like IDXD, for > instance. > > Some way to manage the limited PASID resource will be necessary beyond > just VFIO. And it's not clear that they'll have compatible requirements. A userspace idxd driver might have limited needs versus a vIOMMU backend. Does a single quota model adequately support both or are we back to the differences between access to a device and ownership of a device? Maybe a single pasid per user makes sense in the former. If we could bring this discussion to some sort of more concrete proposal it might be easier to weigh the choices. > > QEMU typically runs in a sandbox with limited access, when a device or > > mdev is assigned to a VM, file permissions are configured to allow that > > access. QEMU doesn't get to poke at any random dev file it likes, > > that's part of how userspace reduces the potential attack surface. > > Plumbing the exact same APIs through VFIO's uAPI vs /dev/sva doesn't > reduce the attack surface. qemu can simply include /dev/sva in the > sandbox when using VFIO with no increase in attack surface from this > proposed series. APIs confined to the ownership model that vfio already enforces might absolutely present a more limited attack surface than some new interface intended to provide universal sva resource access. We don't know until we see it. The real argument would be whether we have a more hardened interface due to more review from more users. > > This series is a blueprint within the context of the ownership and > > permission model that VFIO already provides. It doesn't seem like we > > can pluck that out on its own, nor is it necessarily the case that VFIO > > wouldn't want to provide PASID services within its own API even if we > > did have this undefined /dev/sva interface. > > I don't see what you do - VFIO does not own PASID, and in this > vfio-mdev mode it does not own the PCI device/IOMMU either. So why > would this need to be part of the VFIO owernship and permission model? Doesn't the PASID model essentially just augment the requester ID IOMMU model so as to manage the IOVAs for a subdevice of a RID? The vfio model builds on a user's access to a vfio group to entitle them to allocate IOMMU resources, or in this case PASIDs. What elevates a user to be able to allocate such resources in this new proposal? Do they need a device at all? It's not clear to me why RID based IOMMU management fits within vfio's scope, but PASID based does not. Seems like that would chip away at aux domains in general. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, I thought we discussed this at LPC, but still seems to be going in circles :-(. On Mon, Sep 14, 2020 at 04:00:57PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote: > > On Mon, 14 Sep 2020 14:41:21 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote: > > > > > > > "its own special way" is arguable, VFIO is just making use of what's > > > > being proposed as the uapi via its existing IOMMU interface. > > > > > > I mean, if we have a /dev/sva then it makes no sense to extend the > > > VFIO interfaces with the same stuff. VFIO should simply accept a PASID > > > created from /dev/sva and use it just like any other user-DMA driver > > > would. > > > > I don't think that's absolutely true. By the same logic, we could say > > that pci-sysfs provides access to PCI BAR and config space > > resources, > > No, it is the reverse, VFIO is a better version of pci-sysfs, so > pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva > would be the superset interface for PASID, so whatver VFIO has would > be obsoleted. As you had suggested earlier in the mail thread could Jason Wang maybe build out what it takes to have a full fledged /dev/sva interface for vDPA and figure out how the interfaces should emerge? otherwise it appears everyone is talking very high level and with that limited understanding of how things work at the moment. As Kevin pointed out there are several aspects, and a real prototype from interested people would be the best way to understand the easy/hard aspects of moving between the proposals. - PASID allocation and life cycle management Managing both 1-1 (as its done today) and also support a guest PASID space. (Supporting guest PASID range is required for migration I suppose) - Page request processing. - Interaction with vIOMMU, vSVA requires vIOMMU for supporting invalidations, forwarding prq and such. - Supporting ENQCMD in guest. (Today its just in Intel products, but its also submitted to PCIe SIG) and if you are a member should be able to see that. FWIW, it might already be open for public review, it not now maybe pretty soon. For Intel we have some KVM interaction setting up the guest pasid->host pasid interaces. This has to move ahead of these email discussions, hoping somone with the right ideas would help move this forward. Cheers, Ashok _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, September 12, 2020 4:55 AM > > On Thu, 10 Sep 2020 03:45:21 -0700 > Liu Yi L <yi.l.liu@intel.com> wrote: > > > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing > > multiple process virtual address spaces with the device for simplified > > programming model. PASID is used to tag an virtual address space in > > DMA requests and to identify the related translation structure in > > IOMMU. When a PASID-capable device is assigned to a VM, we want the > > same capability of using PASID to tag guest process virtual address > > spaces to achieve virtual SVA (vSVA). > > > > PASID management for guest is vendor specific. Some vendors (e.g. > > Intel > > VT-d) requires system-wide managed PASIDs across all devices, > > regardless of whether a device is used by host or assigned to guest. > > Other vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus > > could be fully delegated to the guest for assigned devices. > > > > For system-wide managed PASIDs, this patch introduces a vfio module to > > handle explicit PASID alloc/free requests from guest. Allocated PASIDs > > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm > > object is introduced to track mm_struct. Multiple VFIO containers > > within a process share the same vfio_mm object. > > > > A quota mechanism is provided to prevent malicious user from > > exhausting available PASIDs. Currently the quota is a global parameter > > applied to all VFIO devices. In the future per-device quota might be supported too. > > > > Cc: Kevin Tian <kevin.tian@intel.com> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Cc: Joerg Roedel <joro@8bytes.org> > > Cc: Lu Baolu <baolu.lu@linux.intel.com> > > Suggested-by: Alex Williamson <alex.williamson@redhat.com> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > Reviewed-by: Eric Auger <eric.auger@redhat.com> > > --- > > v6 -> v7: > > *) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger. > > > > v5 -> v6: > > *) address comments from Eric. Add vfio_unlink_pasid() to be consistent > > with vfio_unlink_dma(). Add a comment in vfio_pasid_exit(). > > > > v4 -> v5: > > *) address comments from Eric Auger. > > *) address the comments from Alex on the pasid free range support. Added > > per vfio_mm pasid r-b tree. > > https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/ > > > > v3 -> v4: > > *) fix lock leam in vfio_mm_get_from_task() > > *) drop pasid_quota field in struct vfio_mm > > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when > > !CONFIG_VFIO_PASID > > > > v1 -> v2: > > *) added in v2, split from the pasid alloc/free support of v1 > > --- > > drivers/vfio/Kconfig | 5 + > > drivers/vfio/Makefile | 1 + > > drivers/vfio/vfio_pasid.c | 247 > ++++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/vfio.h | 28 ++++++ > > 4 files changed, 281 insertions(+) > > create mode 100644 drivers/vfio/vfio_pasid.c > > > > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index > > fd17db9..3d8a108 100644 > > --- a/drivers/vfio/Kconfig > > +++ b/drivers/vfio/Kconfig > > @@ -19,6 +19,11 @@ config VFIO_VIRQFD > > depends on VFIO && EVENTFD > > default n > > > > +config VFIO_PASID > > + tristate > > + depends on IOASID && VFIO > > + default n > > + > > menuconfig VFIO > > tristate "VFIO Non-Privileged userspace driver framework" > > depends on IOMMU_API > > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index > > de67c47..bb836a3 100644 > > --- a/drivers/vfio/Makefile > > +++ b/drivers/vfio/Makefile > > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o > > > > obj-$(CONFIG_VFIO) += vfio.o > > obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o > > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o > > obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o > > obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o > > obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git > > a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode > > 100644 index 0000000..44ecdd5 > > --- /dev/null > > +++ b/drivers/vfio/vfio_pasid.c > > @@ -0,0 +1,247 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* > > + * Copyright (C) 2020 Intel Corporation. > > + * Author: Liu Yi L <yi.l.liu@intel.com> > > + * > > + */ > > + > > +#include <linux/vfio.h> > > +#include <linux/file.h> > > +#include <linux/module.h> > > +#include <linux/slab.h> > > +#include <linux/sched/mm.h> > > + > > +#define DRIVER_VERSION "0.1" > > +#define DRIVER_AUTHOR "Liu Yi L <yi.l.liu@intel.com>" > > +#define DRIVER_DESC "PASID management for VFIO bus drivers" > > + > > +#define VFIO_DEFAULT_PASID_QUOTA 1000 > > I'm not sure we really need a macro to define this since it's only used once, but a > comment discussing the basis for this default value would be useful. yep, may remove the macro. 1000 is actually a value come from an offline discussion with Jacob. And was first mentioned in below link. Since we don't have much data to decide a default quota today, so we'd like to make 1000 be default quota as a start. In future we would give administrator the ability to tune the quota. https://lore.kernel.org/kvm/A2975661238FB949B60364EF0F2C25743A0F8CB4@SHSMSX104.ccr.corp.intel.com/ > Also, since > Matthew Rosato is finding it necessary to expose the available DMA mapping > counter to userspace, is this also a limitation that userspace might be > interested in knowing such that we should plumb it through an IOMMU info > capability? agreed. it would be helpful. I'll add it. > > +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA; > > +module_param_named(pasid_quota, pasid_quota, uint, 0444); > > +MODULE_PARM_DESC(pasid_quota, > > + "Set the quota for max number of PASIDs that an application is > > +allowed to request (default 1000)"); > > + > > +struct vfio_mm_token { > > + unsigned long long val; > > +}; > > + > > +struct vfio_mm { > > + struct kref kref; > > + struct ioasid_set *ioasid_set; > > + struct mutex pasid_lock; > > + struct rb_root pasid_list; > > + struct list_head next; > > + struct vfio_mm_token token; > > +}; > > + > > +static struct mutex vfio_mm_lock; > > +static struct list_head vfio_mm_list; > > + > > +struct vfio_pasid { > > + struct rb_node node; > > + ioasid_t pasid; > > +}; > > + > > +static void vfio_remove_all_pasids(struct vfio_mm *vmm); > > + > > +/* called with vfio.vfio_mm_lock held */ > > > s/vfio.// got it. thanks for spotting it. > > > > +static void vfio_mm_release(struct kref *kref) { > > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref); > > + > > + list_del(&vmm->next); > > + mutex_unlock(&vfio_mm_lock); > > + vfio_remove_all_pasids(vmm); > > + ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set > after allocation? > > > Is the question whether each pasid should hold a reference to the set? no, I was considering whether vfio_pasid needs to hold a reference on the ioasid_set. But after checking ioasid_alloc_set(), the answer is "no" since a successful ioasid_alloc_set() calling will atomically increase the refcnt of the returned set. So no need to take another reference in vfio_pasid. > That really seems like a question internal to the ioasid_alloc/free, but > this FIXME needs to be resolved. I should have removed it before sending it out. sorry for the confusion. :-( > > > > + kfree(vmm); > > +} > > + > > +void vfio_mm_put(struct vfio_mm *vmm) { > > + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); } > > + > > +static void vfio_mm_get(struct vfio_mm *vmm) { > > + kref_get(&vmm->kref); > > +} > > + > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) { > > + struct mm_struct *mm = get_task_mm(task); > > + struct vfio_mm *vmm; > > + unsigned long long val = (unsigned long long)mm; > > + int ret; > > + > > + mutex_lock(&vfio_mm_lock); > > + /* Search existing vfio_mm with current mm pointer */ > > + list_for_each_entry(vmm, &vfio_mm_list, next) { > > + if (vmm->token.val == val) { > > + vfio_mm_get(vmm); > > + goto out; > > + } > > + } > > + > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL); > > + if (!vmm) { > > + vmm = ERR_PTR(-ENOMEM); > > + goto out; > > + } > > + > > + /* > > + * IOASID core provides a 'IOASID set' concept to track all > > + * PASIDs associated with a token. Here we use mm_struct as > > + * the token and create a IOASID set per mm_struct. All the > > + * containers of the process share the same IOASID set. > > + */ > > + vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota, > IOASID_SET_TYPE_MM); > > + if (IS_ERR(vmm->ioasid_set)) { > > + ret = PTR_ERR(vmm->ioasid_set); > > + kfree(vmm); > > + vmm = ERR_PTR(ret); > > + goto out; > > This would be a little less convoluted if we had a separate variable to store > ioasid_set so that we could free vmm without stashing the error in a temporary > variable. Or at least make the stash more obvious by defining the stash variable as > something like "tmp" within the scope of this branch. I see. also the "ret" is not necessary as only used only once. so it would be like below: tmp = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM); if (IS_ERR(tmp)) { kfree(vmm); vmm = ERR_PTR(ret); goto out; } > > + } > > + > > + kref_init(&vmm->kref); > > + vmm->token.val = val; > > + mutex_init(&vmm->pasid_lock); > > + vmm->pasid_list = RB_ROOT; > > + > > + list_add(&vmm->next, &vfio_mm_list); > > +out: > > + mutex_unlock(&vfio_mm_lock); > > + mmput(mm); > > + return vmm; > > +} > > + > > +/* > > + * Find PASID within @min and @max > > + */ > > +static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm, > > + ioasid_t min, ioasid_t max) > > +{ > > + struct rb_node *node = vmm->pasid_list.rb_node; > > + > > + while (node) { > > + struct vfio_pasid *vid = rb_entry(node, > > + struct vfio_pasid, node); > > + > > + if (max < vid->pasid) > > + node = node->rb_left; > > + else if (min > vid->pasid) > > + node = node->rb_right; > > + else > > + return vid; > > + } > > + > > + return NULL; > > +} > > + > > +static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid > > +*new) { > > + struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL; > > + struct vfio_pasid *vid; > > + > > + while (*link) { > > + parent = *link; > > + vid = rb_entry(parent, struct vfio_pasid, node); > > + > > + if (new->pasid <= vid->pasid) > > + link = &(*link)->rb_left; > > + else > > + link = &(*link)->rb_right; > > + } > > + > > + rb_link_node(&new->node, parent, link); > > + rb_insert_color(&new->node, &vmm->pasid_list); } > > + > > +static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid > > +*old) { > > + rb_erase(&old->node, &vmm->pasid_list); } > > + > > +static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid > > +*vid) { > > + vfio_unlink_pasid(vmm, vid); > > + ioasid_free(vmm->ioasid_set, vid->pasid); > > + kfree(vid); > > +} > > + > > +static void vfio_remove_all_pasids(struct vfio_mm *vmm) { > > + struct rb_node *node; > > + > > + mutex_lock(&vmm->pasid_lock); > > + while ((node = rb_first(&vmm->pasid_list))) > > + vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node)); > > + mutex_unlock(&vmm->pasid_lock); > > +} > > + > > +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) > > I might have asked before, but why doesn't this return an ioasid_t and require > ioasid_t args? Our free function below uses an ioasid_t range, seems rather > inconsistent. yep, will fix it. > We can use a BUILD_BUG_ON if we need to test that an ioasid_t fits > within our uapi. perhaps not. vfio_pasid_alloc() should return INVALID_IOASID for allocation failure. vfio_iommu_type1 should check the return value of vfio_pasid_alloc() and return a proper result to userspace. > > > +{ > > + ioasid_t pasid; > > + struct vfio_pasid *vid; > > + > > + pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL); > > + if (pasid == INVALID_IOASID) > > + return -ENOSPC; > > + > > + vid = kzalloc(sizeof(*vid), GFP_KERNEL); > > + if (!vid) { > > + ioasid_free(vmm->ioasid_set, pasid); > > + return -ENOMEM; > > + } > > + > > + vid->pasid = pasid; > > + > > + mutex_lock(&vmm->pasid_lock); > > + vfio_link_pasid(vmm, vid); > > + mutex_unlock(&vmm->pasid_lock); > > + > > + return pasid; > > +} > > + > > +void vfio_pasid_free_range(struct vfio_mm *vmm, > > + ioasid_t min, ioasid_t max) > > +{ > > + struct vfio_pasid *vid = NULL; > > + > > + /* > > + * IOASID core will notify PASID users (e.g. IOMMU driver) to > > + * teardown necessary structures depending on the to-be-freed > > + * PASID. > > + */ > > + mutex_lock(&vmm->pasid_lock); > > + while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) > > != NULL is not necessary and isn't consistent with the same time of test in the above > rb_first() loop. got it. thanks for the guiding. > > > + vfio_remove_pasid(vmm, vid); > > + mutex_unlock(&vmm->pasid_lock); > > +} > > + > > +static int __init vfio_pasid_init(void) { > > + mutex_init(&vfio_mm_lock); > > + INIT_LIST_HEAD(&vfio_mm_list); > > + return 0; > > +} > > + > > +static void __exit vfio_pasid_exit(void) { > > + /* > > + * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1 > > + * and may be other module. once vfio_pasid_exit() is triggered, > > + * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed. > > + * All the vfio_mm instances should have been released. If not, > > + * means there is vfio_mm leak, should be a bug of user module. > > + * So just warn here. > > + */ > > + WARN_ON(!list_empty(&vfio_mm_list)); > > Do we need to be using try_module_get/module_put to enforce that we cannot be > removed while in use or does that already work correctly via the function references > and this is just paranoia? If we do exit, I'm not sure what good it does to keep the > remaining list entries. Thanks, I did a test before, and it's true the module dependency is enforced via function references and cannot remove a module before removing other modules which have referred its function. BTW., for the WARN_ON, I referred the handling against vfio.group_list in vfio_cleanup(). :-) Regards, Yi Liu > Alex > > > +} > > + > > +module_init(vfio_pasid_init); > > +module_exit(vfio_pasid_exit); > > + > > +MODULE_VERSION(DRIVER_VERSION); > > +MODULE_LICENSE("GPL v2"); > > +MODULE_AUTHOR(DRIVER_AUTHOR); > > +MODULE_DESCRIPTION(DRIVER_DESC); > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h index > > 38d3c6a..31472a9 100644 > > --- a/include/linux/vfio.h > > +++ b/include/linux/vfio.h > > @@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct > > vfio_iommu_driver_ops *ops); extern void vfio_unregister_iommu_driver( > > const struct vfio_iommu_driver_ops *ops); > > > > +struct vfio_mm; > > +#if IS_ENABLED(CONFIG_VFIO_PASID) > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct > > +*task); extern void vfio_mm_put(struct vfio_mm *vmm); extern int > > +vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); extern void > > +vfio_pasid_free_range(struct vfio_mm *vmm, > > + ioasid_t min, ioasid_t max); > > +#else > > +static inline struct vfio_mm *vfio_mm_get_from_task(struct > > +task_struct *task) { > > + return ERR_PTR(-ENOTTY); > > +} > > + > > +static inline void vfio_mm_put(struct vfio_mm *vmm) { } > > + > > +static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int > > +max) { > > + return -ENOTTY; > > +} > > + > > +static inline void vfio_pasid_free_range(struct vfio_mm *vmm, > > + ioasid_t min, ioasid_t max) > > +{ > > +} > > +#endif /* CONFIG_VFIO_PASID */ > > + > > /* > > * External user API > > */ _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 03:44:38PM -0700, Raj, Ashok wrote: > Hi Jason, > > I thought we discussed this at LPC, but still seems to be going in > circles :-(. We discused mdev at LPC, not PASID. PASID applies widely to many device and needs to be introduced with a wide community agreement so all scenarios will be supportable. > As you had suggested earlier in the mail thread could Jason Wang maybe > build out what it takes to have a full fledged /dev/sva interface for vDPA > and figure out how the interfaces should emerge? otherwise it appears > everyone is talking very high level and with that limited understanding of > how things work at the moment. You want Jason Wang to do the work to get Intel PASID support merged? Seems a bit of strange request. > This has to move ahead of these email discussions, hoping somone with the > right ideas would help move this forward. Why not try yourself to come up with a proposal? Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, Sep 14, 2020 at 04:33:10PM -0600, Alex Williamson wrote: > Can you explain that further, or spit-ball what you think this /dev/sva > interface looks like and how a user might interact between vfio and > this new interface? When you open it you get some container, inside the container the user can create PASIDs. PASIDs outside that container cannot be reached. Creating a PASID, or the guest PASID range would be the entry point for doing all the operations against a PASID or range that this patch series imagines: - Map process VA mappings to the PASID's DMA virtual address space - Catch faults - Setup any special HW stuff like Intel's two level thing, ARM stuff, etc - Expose resource controls, cgroup, whatever - Migration special stuff (allocate fixed PASIDs) A PASID is a handle for an IOMMU page table, and the tools to manipulate it. Within /dev/sva the page table is just 'floating' and not linked to any PCI functions The open /dev/sva FD holding the allocated PASIDs would be passed to a kernel driver. This is a security authorization that the specified PASID can be assigned to a PCI device by the kernel. At this point the kernel driver would have the IOMMU permit its bus/device/function to use the PASID. The PASID can be passed to multiple drivers of any driver flavour so table re-use is possible. Now the IOMMU page table is linked to a device. The kernel device driver would also do the device specific programming to setup the PASID in the device, attach it to some device object and expose the device for user DMA. For instance IDXD's char dev would map the queue memory and associate the PASID with that queue and setup the HW to be ready for the new enque instruction. The IDXD mdev would link to its emulated PCI BAR and ensure the guest can only use PASID's included in the /dev/sva container. The qemu control plane for vIOMMU related to PASID would run over /dev/sva. I think the design could go further where a 'PASID' is just an abstract idea of a page table, then vfio-pci could consume it too as a IOMMU page table handle even though there is no actual PASID. So qemu could end up with one API to universally control the vIOMMU, an API that can be shared between subsystems and is not tied to VFIO. > allocating pasids and associating them with page tables for that > two-stage IOMMU setup, performing cache invalidations based on page > table updates, etc. How does it make more sense for a vIOMMU to > setup some aspects of the IOMMU through vfio and others through a > TBD interface? vfio's IOMMU interface is about RID based full device ownership, and fixed mappings. PASID is about mediation, shared ownership and page faulting. Does PASID overlap with the existing IOMMU RID interface beyond both are using the IOMMU? > The IOMMU needs to allocate PASIDs, so in that sense it enforces a > quota via the architectural limits, but is the IOMMU layer going to > distinguish in-kernel versus user limits? A cgroup limit seems like a > good idea, but that's not really at the IOMMU layer either and I don't > see that a /dev/sva and vfio interface couldn't both support a cgroup > type quota. It is all good questions. PASID is new, this stuff needs to be sketched out more. A lot of in-kernel users of IOMMU PASID are probably going to be triggered by userspace actions. I think a cgroup quota would end up near the IOMMU layer, so vfio, sva, and any other driver char devs would all be restricted by the cgroup as peers. > And it's not clear that they'll have compatible requirements. A > userspace idxd driver might have limited needs versus a vIOMMU backend. > Does a single quota model adequately support both or are we back to the > differences between access to a device and ownership of a device? At the end of the day a PASID is just a number and the drivers only use of it is to program it into HW. All these other differences deal with the IOMMU side of the PASID, how pages are mapped into it, how page fault works, etc, etc. Keeping the two concerns seperated seems very clean. A device driver shouldn't care how the PASID is setup. > > > This series is a blueprint within the context of the ownership and > > > permission model that VFIO already provides. It doesn't seem like we > > > can pluck that out on its own, nor is it necessarily the case that VFIO > > > wouldn't want to provide PASID services within its own API even if we > > > did have this undefined /dev/sva interface. > > > > I don't see what you do - VFIO does not own PASID, and in this > > vfio-mdev mode it does not own the PCI device/IOMMU either. So why > > would this need to be part of the VFIO owernship and permission model? > > Doesn't the PASID model essentially just augment the requester ID IOMMU > model so as to manage the IOVAs for a subdevice of a RID? I'd say not really.. PASID is very different from RID because PASID must always be mediated by the kernel. vfio-pci doesn't know how to use PASID because it doesn't know how to program the PASID into a specific device. While RID is fully self contained with vfio-pci. Further, with the SVA models, the mediated devices are highly likely to be shared between a vfio-mdev and a normal driver, as IDXD shows. Userspace will get PASID's for SVA and share the device equally with vfio-mdev. > What elevates a user to be able to allocate such resources in this > new proposal? AFAIK the target for the current SVA model is no limitation. User processes can open their devices, establish SVA and go ahead with their workload. If you are asking about iommu groups.. For PASID the PCI bus/device/function that is the 'control point' for PASID must be secure and owned by the kernel. ie only the kernel can progam the device to use a given PASID. P2P access from other devices under non-kernel control must not be allowed, as they could program a device to use a PASID the kernel would not authorize. All of this has to be done regardless of VFIO's involvement.. > Do they need a device at all? It's not clear to me why RID based > IOMMU management fits within vfio's scope, but PASID based does not. In RID mode vfio-pci completely owns the PCI function, so it is more natural that VFIO, as the sole device owner, would own the DMA mapping machinery. Further, the RID IOMMU mode is rarely used outside of VFIO so there is not much reason to try and disaggregate the API. PASID on the other hand, is shared. vfio-mdev drivers will share the device with other kernel drivers. PASID and DMA will be concurrent with VFIO and other kernel drivers/etc. Thus it makes more sense here to have the control plane for PASID also be shared and not tied exclusively to VFIO. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 08:33:41AM -0300, Jason Gunthorpe wrote: > On Mon, Sep 14, 2020 at 03:44:38PM -0700, Raj, Ashok wrote: > > Hi Jason, > > > > I thought we discussed this at LPC, but still seems to be going in > > circles :-(. > > We discused mdev at LPC, not PASID. > > PASID applies widely to many device and needs to be introduced with a > wide community agreement so all scenarios will be supportable. True, reading some of the earlier replies I was clearly confused as I thought you were talking about mdev again. But now that you stay it, you have moved past mdev and its the PASID interfaces correct? > > > As you had suggested earlier in the mail thread could Jason Wang maybe > > build out what it takes to have a full fledged /dev/sva interface for vDPA > > and figure out how the interfaces should emerge? otherwise it appears > > everyone is talking very high level and with that limited understanding of > > how things work at the moment. > > You want Jason Wang to do the work to get Intel PASID support merged? > Seems a bit of strange request. I was reading mdev in my head. Not PASID, sorry. For the native user applications have just 1 PASID per process. There is no need for a quota management. VFIO being the one used for guest where there is more PASID's per guest is where this is enforced today. IIUC, you are asking that part of the interface to move to a API interface that potentially the new /dev/sva and VFIO could share? I think the API's for PASID management themselves are generic (Jean's patchset + Jacob's ioasid set management). Possibly what you need is already available, but not in a specific way that you expect maybe? Let me check with Jacob and let him/Jean pick that up. Cheers, Ashok _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote: > > PASID applies widely to many device and needs to be introduced with a > > wide community agreement so all scenarios will be supportable. > > True, reading some of the earlier replies I was clearly confused as I > thought you were talking about mdev again. But now that you stay it, you > have moved past mdev and its the PASID interfaces correct? Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID. > For the native user applications have just 1 PASID per > process. There is no need for a quota management. Yes, there is. There is a limited pool of HW PASID's. If one user fork bombs it can easially claim an unreasonable number from that pool as each process will claim a PASID. That can DOS the rest of the system. If PASID DOS is a worry then it must be solved at the IOMMU level for all user applications that might trigger a PASID allocation. VFIO is not special. > IIUC, you are asking that part of the interface to move to a API interface > that potentially the new /dev/sva and VFIO could share? I think the API's > for PASID management themselves are generic (Jean's patchset + Jacob's > ioasid set management). Yes, the in kernel APIs are pretty generic now, and can be used by many types of drivers. As JasonW kicked this off, VDPA will need all this identical stuff too. We already know this, and I think Intel VDPA HW will need it, so it should concern you too :) A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a reasonable starting point for discussion. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 03:45:10PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote: > > > PASID applies widely to many device and needs to be introduced with a > > > wide community agreement so all scenarios will be supportable. > > > > True, reading some of the earlier replies I was clearly confused as I > > thought you were talking about mdev again. But now that you stay it, you > > have moved past mdev and its the PASID interfaces correct? > > Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID. > > > For the native user applications have just 1 PASID per > > process. There is no need for a quota management. > > Yes, there is. There is a limited pool of HW PASID's. If one user fork > bombs it can easially claim an unreasonable number from that pool as > each process will claim a PASID. That can DOS the rest of the system. Not sure how you had this played out.. For PASID used in ENQCMD today for our SVM usages, we *DO* not automatically propagate or allocate new PASIDs. The new process needs to bind to get a PASID for its own use. For threads of same process the PASID is inherited. For forks(), we do not auto-allocate them. Since PASID isn't a sharable resource much like how you would not pass mmio mmap's to forked processes that cannot be shared correct? Such as your doorbell space for e.g. > > If PASID DOS is a worry then it must be solved at the IOMMU level for > all user applications that might trigger a PASID allocation. VFIO is > not special. Feels like you can simply avoid the PASID DOS rather than permit it to happen. > > > IIUC, you are asking that part of the interface to move to a API interface > > that potentially the new /dev/sva and VFIO could share? I think the API's > > for PASID management themselves are generic (Jean's patchset + Jacob's > > ioasid set management). > > Yes, the in kernel APIs are pretty generic now, and can be used by > many types of drivers. Good, so there is no new requirements here I suppose. > > As JasonW kicked this off, VDPA will need all this identical stuff > too. We already know this, and I think Intel VDPA HW will need it, so > it should concern you too :) This is one of those things that I would disagree and commit :-).. > > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID > control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a > reasonable starting point for discussion. Looks like now we are getting closer to what we need. :-) Given that PASID api's are general purpose today and any driver can use it to take advantage. VFIO fortunately or unfortunately has the IOMMU things abstracted. I suppose that support is also mostly built on top of the generic iommu* api abstractions in a vendor neutral way? I'm still lost on what is missing that vDPA can't build on top of what is available? Cheers, Ashok _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Tue, 15 Sep 2020 15:45:10 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote: > > > PASID applies widely to many device and needs to be introduced with a > > > wide community agreement so all scenarios will be supportable. > > > > True, reading some of the earlier replies I was clearly confused as I > > thought you were talking about mdev again. But now that you stay it, you > > have moved past mdev and its the PASID interfaces correct? > > Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID. > > > For the native user applications have just 1 PASID per > > process. There is no need for a quota management. > > Yes, there is. There is a limited pool of HW PASID's. If one user fork > bombs it can easially claim an unreasonable number from that pool as > each process will claim a PASID. That can DOS the rest of the system. > > If PASID DOS is a worry then it must be solved at the IOMMU level for > all user applications that might trigger a PASID allocation. VFIO is > not special. > > > IIUC, you are asking that part of the interface to move to a API > > interface that potentially the new /dev/sva and VFIO could share? I > > think the API's for PASID management themselves are generic (Jean's > > patchset + Jacob's ioasid set management). > > Yes, the in kernel APIs are pretty generic now, and can be used by > many types of drivers. > Right, IOMMU UAPIs are not VFIO specific, we pass user pointer to the IOMMU layer to process. Similarly for PASID management, the IOASID extensions we are proposing will handle ioasid_set (groups/pools), quota, permissions, and notifications in the IOASID core. There is nothing VFIO specific. https://lkml.org/lkml/2020/8/22/12 > As JasonW kicked this off, VDPA will need all this identical stuff > too. We already know this, and I think Intel VDPA HW will need it, so > it should concern you too :) > > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID > control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a > reasonable starting point for discussion. > I am not sure what can really be consolidated in /dev/sva. VFIO and VDPA will have their own kerne-user interfaces anyway for their usage models. They are just providing the specific transport while sharing generic IOMMU UAPIs and IOASID management. As I mentioned PASID management is already consolidated in the IOASID layer, so for VDPA or other users, it just matter of create its own ioasid_set, doing allocation. IOASID is also available to the in-kernel users which does not need /dev/sva AFAICT. For bare metal SVA, I don't see a need to create this 'floating' state of the PASID when created by /dev/sva. PASID allocation could happen behind the scene when users need to bind page tables to a device DMA stream. Security authorization of the PASID is natively enforced when user try to bind page table, there is no need to pass the FD handle of the PASID back to the kernel as you suggested earlier. Thanks, Jacob _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 12:26:32PM -0700, Raj, Ashok wrote: > > Yes, there is. There is a limited pool of HW PASID's. If one user fork > > bombs it can easially claim an unreasonable number from that pool as > > each process will claim a PASID. That can DOS the rest of the system. > > Not sure how you had this played out.. For PASID used in ENQCMD today for > our SVM usages, we *DO* not automatically propagate or allocate new PASIDs. > > The new process needs to bind to get a PASID for its own use. For threads > of same process the PASID is inherited. For forks(), we do not > auto-allocate them. Auto-allocate doesn't matter, the PASID is tied to the mm_struct, after fork the program will get a new mm_struct, and it can manually re-trigger PASID allocation for that mm_struct from any SVA kernel driver. 64k processes, each with their own mm_struct, all triggering SVA, will allocate 64k PASID's and use up the whole 16 bit space. > Given that PASID api's are general purpose today and any driver can use it > to take advantage. VFIO fortunately or unfortunately has the IOMMU things > abstracted. I suppose that support is also mostly built on top of the > generic iommu* api abstractions in a vendor neutral way? > > I'm still lost on what is missing that vDPA can't build on top of what is > available? I think it is basically everything in this patch.. Why duplicate all this uAPI? Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 03:08:51PM -0700, Jacob Pan wrote: > > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID > > control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a > > reasonable starting point for discussion. > > I am not sure what can really be consolidated in /dev/sva. More or less, everything in this patch. All the manipulations of PASID that are required for vIOMMU use case/etc. Basically all PASID control that is not just a 1:1 mapping of the mm_struct. > will have their own kerne-user interfaces anyway for their usage models. > They are just providing the specific transport while sharing generic IOMMU > UAPIs and IOASID management. > As I mentioned PASID management is already consolidated in the IOASID layer, > so for VDPA or other users, it just matter of create its own ioasid_set, > doing allocation. Creating the PASID is not the problem, managing what the PASID maps to is the issue. That is all uAPI that we don't really have today. > IOASID is also available to the in-kernel users which does not > need /dev/sva AFAICT. For bare metal SVA, I don't see a need to create this > 'floating' state of the PASID when created by /dev/sva. PASID allocation > could happen behind the scene when users need to bind page tables to a > device DMA stream. My point is I would like to see one set of uAPI ioctls to bind page tables. I don't want to have VFIO, VDPA, etc, etc uAPIs to do the exact same things only slightly differently. If user space wants to bind page tables, create the PASID with /dev/sva, use ioctls there to setup the page table the way it wants, then pass the now configured PASID to a driver that can use it. Driver does not do page table binding. Do not duplicate all the control plane uAPI in every driver. PASID managment and binding is seperated from the driver(s) that are using the PASID. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Tue, 15 Sep 2020 20:51:26 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Sep 15, 2020 at 03:08:51PM -0700, Jacob Pan wrote: > > > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a > > > PASID control char dev (eg /dev/sva, or maybe /dev/iommu) seems > > > like a reasonable starting point for discussion. > > > > I am not sure what can really be consolidated in /dev/sva. > > More or less, everything in this patch. All the manipulations of PASID > that are required for vIOMMU use case/etc. Basically all PASID control > that is not just a 1:1 mapping of the mm_struct. > > > will have their own kerne-user interfaces anyway for their usage > > models. They are just providing the specific transport while > > sharing generic IOMMU UAPIs and IOASID management. > > > As I mentioned PASID management is already consolidated in the > > IOASID layer, so for VDPA or other users, it just matter of create > > its own ioasid_set, doing allocation. > > Creating the PASID is not the problem, managing what the PASID maps to > is the issue. That is all uAPI that we don't really have today. > > > IOASID is also available to the in-kernel users which does not > > need /dev/sva AFAICT. For bare metal SVA, I don't see a need to > > create this 'floating' state of the PASID when created by /dev/sva. > > PASID allocation could happen behind the scene when users need to > > bind page tables to a device DMA stream. > > My point is I would like to see one set of uAPI ioctls to bind page > tables. I don't want to have VFIO, VDPA, etc, etc uAPIs to do the > exact same things only slightly differently. > Got your point. I am not familiar with VDPA but for VFIO UAPI, it is very thin, mostly passthrough IOMMU UAPI struct as opaque data. > If user space wants to bind page tables, create the PASID with > /dev/sva, use ioctls there to setup the page table the way it wants, > then pass the now configured PASID to a driver that can use it. > Are we talking about bare metal SVA? If so, I don't see the need for userspace to know there is a PASID. All user space need is that my current mm is bound to a device by the driver. So it can be a one-step process for user instead of two. > Driver does not do page table binding. Do not duplicate all the > control plane uAPI in every driver. > > PASID managment and binding is seperated from the driver(s) that are > using the PASID. > Why separate? Drivers need to be involved in PASID life cycle management. For example, when tearing down a PASID, the driver needs to stop DMA, IOMMU driver needs to unbind, etc. If driver is the control point, then things are just in order. I am referring to bare metal SVA. For guest SVA, I agree that binding is separate from PASID allocation. Could you review this doc. in terms of life cycle? https://lkml.org/lkml/2020/8/22/13 My point is that /dev/sda has no value for bare metal SVA, we are just talking about if guest SVA UAPIs can be consolidated. Or am I missing something? > Jason Thanks, Jacob _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, September 15, 2020 10:29 PM > > > Do they need a device at all? It's not clear to me why RID based > > IOMMU management fits within vfio's scope, but PASID based does not. > > In RID mode vfio-pci completely owns the PCI function, so it is more > natural that VFIO, as the sole device owner, would own the DMA mapping > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO > so there is not much reason to try and disaggregate the API. It is also used by vDPA. > > PASID on the other hand, is shared. vfio-mdev drivers will share the > device with other kernel drivers. PASID and DMA will be concurrent > with VFIO and other kernel drivers/etc. > Looks you are equating PASID to host-side sharing, while ignoring another valid usage that a PASID-capable device is passed through to the guest through vfio-pci and then PASID is used by the guest for guest-side sharing. In such case, it is an exclusive usage in host side and then what is the problem for VFIO to manage PASID given that vfio-pci completely owns the function? Thanks Kevin _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 9/16/20 8:22 AM, Jacob Pan (Jun) wrote: >> If user space wants to bind page tables, create the PASID with >> /dev/sva, use ioctls there to setup the page table the way it wants, >> then pass the now configured PASID to a driver that can use it. >> > Are we talking about bare metal SVA? If so, I don't see the need for > userspace to know there is a PASID. All user space need is that my > current mm is bound to a device by the driver. So it can be a one-step > process for user instead of two. > >> Driver does not do page table binding. Do not duplicate all the >> control plane uAPI in every driver. >> >> PASID managment and binding is seperated from the driver(s) that are >> using the PASID. >> > Why separate? Drivers need to be involved in PASID life cycle > management. For example, when tearing down a PASID, the driver needs to > stop DMA, IOMMU driver needs to unbind, etc. If driver is the control > point, then things are just in order. I am referring to bare metal SVA. > > For guest SVA, I agree that binding is separate from PASID allocation. > Could you review this doc. in terms of life cycle? > https://lkml.org/lkml/2020/8/22/13 > > My point is that /dev/sda has no value for bare metal SVA, we are just > talking about if guest SVA UAPIs can be consolidated. Or am I missing > something? > Not only bare metal SVA, but also subdevice passthrough (Intel Scalable IOV and ARM SubStream ID) also consumes PASID which has nothing to do with user space, hence the /dev/sva is unsuited. Best regards, baolu _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/14 下午9:31, Jean-Philippe Brucker wrote: >> If it's possible, I would suggest a generic uAPI instead of a VFIO specific >> one. > A large part of this work is already generic uAPI, in > include/uapi/linux/iommu.h. This is not what I read from this series, all the following uAPI is VFIO specific: struct vfio_iommu_type1_nesting_op; struct vfio_iommu_type1_pasid_request; And include/uapi/linux/iommu.h is not included in include/uapi/linux/vfio.h at all. > This patchset connects that generic interface > to the pre-existing VFIO uAPI that deals with IOMMU mappings of an > assigned device. But the bulk of the work is done by the IOMMU subsystem, > and is available to all device drivers. So any reason not introducing the uAPI to IOMMU drivers directly? > >> Jason suggest something like /dev/sva. There will be a lot of other >> subsystems that could benefit from this (e.g vDPA). > Do you have a more precise idea of the interface /dev/sva would provide, > how it would interact with VFIO and others? Can we replace the container fd with sva fd like: sva = open("/dev/sva", O_RDWR); group = open("/dev/vfio/26", O_RDWR); ioctl(group, VFIO_GROUP_SET_SVA, &sva); Then we can do all SVA stuffs through sva fd, and for other subsystems (like vDPA) it only need to implement the function that is equivalent to VFIO_GROUP_SET_SVA. > vDPA could transport the > generic iommu.h structures via its own uAPI, and call the IOMMU API > directly without going through an intermediate /dev/sva handle. Any value for those transporting? I think we have agreed that VFIO is not the only user for vSVA ... It's not hard to forecast that there would be more subsystems that want to benefit from vSVA, we don't want to duplicate the similar uAPIs in all of those subsystems. Thanks > > Thanks, > Jean > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/16 上午3:26, Raj, Ashok wrote: >>> IIUC, you are asking that part of the interface to move to a API interface >>> that potentially the new /dev/sva and VFIO could share? I think the API's >>> for PASID management themselves are generic (Jean's patchset + Jacob's >>> ioasid set management). >> Yes, the in kernel APIs are pretty generic now, and can be used by >> many types of drivers. > Good, so there is no new requirements here I suppose. The requirement is not for in-kernel APIs but a generic uAPIs. >> As JasonW kicked this off, VDPA will need all this identical stuff >> too. We already know this, and I think Intel VDPA HW will need it, so >> it should concern you too:) > This is one of those things that I would disagree and commit :-).. > >> A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID >> control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a >> reasonable starting point for discussion. > Looks like now we are getting closer to what we need.:-) > > Given that PASID api's are general purpose today and any driver can use it > to take advantage. VFIO fortunately or unfortunately has the IOMMU things > abstracted. I suppose that support is also mostly built on top of the > generic iommu* api abstractions in a vendor neutral way? > > I'm still lost on what is missing that vDPA can't build on top of what is > available? For sure it can, but we may end up duplicated (or similar) uAPIs which is bad. Thanks > > Cheers, > Ashok > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, September 15, 2020 10:29 PM > > > > > Do they need a device at all? It's not clear to me why RID based > > > IOMMU management fits within vfio's scope, but PASID based does not. > > > > In RID mode vfio-pci completely owns the PCI function, so it is more > > natural that VFIO, as the sole device owner, would own the DMA mapping > > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO > > so there is not much reason to try and disaggregate the API. > > It is also used by vDPA. > > > > > PASID on the other hand, is shared. vfio-mdev drivers will share the > > device with other kernel drivers. PASID and DMA will be concurrent > > with VFIO and other kernel drivers/etc. > > > > Looks you are equating PASID to host-side sharing, while ignoring > another valid usage that a PASID-capable device is passed through > to the guest through vfio-pci and then PASID is used by the guest > for guest-side sharing. In such case, it is an exclusive usage in host > side and then what is the problem for VFIO to manage PASID given > that vfio-pci completely owns the function? And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe): the PASID space of a PCI function cannot be shared between host and guest, so we assign the whole PASID table along with the RID. Since we need the BIND, INVALIDATE, and report APIs introduced here to support nested translation, a /dev/sva interface would need to support this mode as well. Thanks, Jean _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, September 15, 2020 10:29 PM > > > > > Do they need a device at all? It's not clear to me why RID based > > > IOMMU management fits within vfio's scope, but PASID based does not. > > > > In RID mode vfio-pci completely owns the PCI function, so it is more > > natural that VFIO, as the sole device owner, would own the DMA mapping > > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO > > so there is not much reason to try and disaggregate the API. > > It is also used by vDPA. A driver in VDPA, not VDPA itself. > > PASID on the other hand, is shared. vfio-mdev drivers will share the > > device with other kernel drivers. PASID and DMA will be concurrent > > with VFIO and other kernel drivers/etc. > > Looks you are equating PASID to host-side sharing, while ignoring > another valid usage that a PASID-capable device is passed through > to the guest through vfio-pci and then PASID is used by the guest > for guest-side sharing. In such case, it is an exclusive usage in host > side and then what is the problem for VFIO to manage PASID given > that vfio-pci completely owns the function? This is no different than vfio-pci being yet another client to /dev/sva Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote: > And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe): > the PASID space of a PCI function cannot be shared between host and guest, > so we assign the whole PASID table along with the RID. Since we need the > BIND, INVALIDATE, and report APIs introduced here to support nested > translation, a /dev/sva interface would need to support this mode as well. Well, that means this HW cannot support PASID capable 'SIOV' style devices in guests. I admit whole function PASID delegation might be something vfio-pci should handle - but only if it really doesn't fit in some /dev/sva after we cover the other PASID cases. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote: > > If user space wants to bind page tables, create the PASID with > > /dev/sva, use ioctls there to setup the page table the way it wants, > > then pass the now configured PASID to a driver that can use it. > > Are we talking about bare metal SVA? What a weird term. > If so, I don't see the need for userspace to know there is a > PASID. All user space need is that my current mm is bound to a > device by the driver. So it can be a one-step process for user > instead of two. You've missed the entire point of the conversation, VDPA already needs more than "my current mm is bound to a device" > > PASID managment and binding is seperated from the driver(s) that are > > using the PASID. > > Why separate? Drivers need to be involved in PASID life cycle > management. For example, when tearing down a PASID, the driver needs to > stop DMA, IOMMU driver needs to unbind, etc. If driver is the control > point, then things are just in order. I am referring to bare metal SVA. Drivers can be involved and still have the uAPIs seperate. It isn't hard. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote: > On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote: > > And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe): > > the PASID space of a PCI function cannot be shared between host and guest, > > so we assign the whole PASID table along with the RID. Since we need the > > BIND, INVALIDATE, and report APIs introduced here to support nested > > translation, a /dev/sva interface would need to support this mode as well. > > Well, that means this HW cannot support PASID capable 'SIOV' style > devices in guests. It does not yet support Intel SIOV, no. It does support the standards, though: PCI SR-IOV to partition a device and PASIDs in a guest. > I admit whole function PASID delegation might be something vfio-pci > should handle - but only if it really doesn't fit in some /dev/sva > after we cover the other PASID cases. Wouldn't that be the duplication you're trying to avoid? A second channel for bind, invalidate, capability and fault reporting mechanisms? If we extract SVA parts of vfio_iommu_type1 into a separate chardev, PASID table pass-through [1] will have to use that. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/20200320161911.27494-1-eric.auger@redhat.com/ _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 06:20:52PM +0200, Jean-Philippe Brucker wrote: > On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote: > > > And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe): > > > the PASID space of a PCI function cannot be shared between host and guest, > > > so we assign the whole PASID table along with the RID. Since we need the > > > BIND, INVALIDATE, and report APIs introduced here to support nested > > > translation, a /dev/sva interface would need to support this mode as well. > > > > Well, that means this HW cannot support PASID capable 'SIOV' style > > devices in guests. > > It does not yet support Intel SIOV, no. It does support the standards, > though: PCI SR-IOV to partition a device and PASIDs in a guest. SIOV is basically standards based, it is better thought of as a cookbook on how to use PASID and IOMMU together. > > I admit whole function PASID delegation might be something vfio-pci > > should handle - but only if it really doesn't fit in some /dev/sva > > after we cover the other PASID cases. > > Wouldn't that be the duplication you're trying to avoid? A second > channel for bind, invalidate, capability and fault reporting > mechanisms? Yes, which is why it seems like it would be nicer to avoid it. Why I said "might" :) > If we extract SVA parts of vfio_iommu_type1 into a separate chardev, > PASID table pass-through [1] will have to use that. Yes, '/dev/sva' (which is a terrible name) would want to be the uAPI entry point for controlling the vIOMMU related to PASID. Does anything in the [1] series have tight coupling to VFIO other than needing to know a bus/device/function? It looks like it is mostly exposing iommu_* functions as uAPI? Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote: > > > If user space wants to bind page tables, create the PASID with > > > /dev/sva, use ioctls there to setup the page table the way it wants, > > > then pass the now configured PASID to a driver that can use it. > > > > Are we talking about bare metal SVA? > > What a weird term. Glad you noticed it at v7 :-) Any suggestions on something less weird than Shared Virtual Addressing? There is a reason why we moved from SVM to SVA. > > > If so, I don't see the need for userspace to know there is a > > PASID. All user space need is that my current mm is bound to a > > device by the driver. So it can be a one-step process for user > > instead of two. > > You've missed the entire point of the conversation, VDPA already needs > more than "my current mm is bound to a device" You mean current version of vDPA? or a potential future version of vDPA? Cheers, Ashok _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi, On 9/16/20 6:32 PM, Jason Gunthorpe wrote: > On Wed, Sep 16, 2020 at 06:20:52PM +0200, Jean-Philippe Brucker wrote: >> On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote: >>> On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote: >>>> And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe): >>>> the PASID space of a PCI function cannot be shared between host and guest, >>>> so we assign the whole PASID table along with the RID. Since we need the >>>> BIND, INVALIDATE, and report APIs introduced here to support nested >>>> translation, a /dev/sva interface would need to support this mode as well. >>> >>> Well, that means this HW cannot support PASID capable 'SIOV' style >>> devices in guests. >> >> It does not yet support Intel SIOV, no. It does support the standards, >> though: PCI SR-IOV to partition a device and PASIDs in a guest. > > SIOV is basically standards based, it is better thought of as a > cookbook on how to use PASID and IOMMU together. > >>> I admit whole function PASID delegation might be something vfio-pci >>> should handle - but only if it really doesn't fit in some /dev/sva >>> after we cover the other PASID cases. >> >> Wouldn't that be the duplication you're trying to avoid? A second >> channel for bind, invalidate, capability and fault reporting >> mechanisms? > > Yes, which is why it seems like it would be nicer to avoid it. Why I > said "might" :) > >> If we extract SVA parts of vfio_iommu_type1 into a separate chardev, >> PASID table pass-through [1] will have to use that. > > Yes, '/dev/sva' (which is a terrible name) would want to be the uAPI > entry point for controlling the vIOMMU related to PASID. > > Does anything in the [1] series have tight coupling to VFIO other than > needing to know a bus/device/function? It looks like it is mostly > exposing iommu_* functions as uAPI? this series does not use any PASID so it fits quite nicely into the VFIO framework I think. Besides cache invalidation that takes the struct device, other operations (MSI binding and PASID table passing operate on the iommu domain). Also we use the VFIO memory region and interrupt/eventfd registration mechanism to return faults. Thanks Eric > > Jason > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote: > > > > If user space wants to bind page tables, create the PASID with > > > > /dev/sva, use ioctls there to setup the page table the way it wants, > > > > then pass the now configured PASID to a driver that can use it. > > > > > > Are we talking about bare metal SVA? > > > > What a weird term. > > Glad you noticed it at v7 :-) > > Any suggestions on something less weird than > Shared Virtual Addressing? There is a reason why we moved from SVM > to SVA. SVA is fine, what is "bare metal" supposed to mean? PASID is about constructing an arbitary DMA IOVA map for PCI-E devices, being able to intercept device DMA faults, etc. SVA is doing DMA IOVA 1:1 with the mm_struct CPU VA. DMA faults trigger the same thing as CPU page faults. If is it not 1:1 then there is no "shared". When SVA is done using PCI-E PASID it is "PASID for SVA". Lots of existing devices already have SVA without PASID or IOMMU, so lets not muddy the terminology. vPASID/vIOMMU is allowing a guest to control the DMA IOVA map and manipulate the PASIDs. vSVA is when a guest uses a vPASID to provide SVA, not sure this is an informative term. This particular patch series seems to be about vPASID/vIOMMU for vfio-mdev vs the other vPASID/vIOMMU patch which was about vPASID for vfio-pci. > > > If so, I don't see the need for userspace to know there is a > > > PASID. All user space need is that my current mm is bound to a > > > device by the driver. So it can be a one-step process for user > > > instead of two. > > > > You've missed the entire point of the conversation, VDPA already needs > > more than "my current mm is bound to a device" > > You mean current version of vDPA? or a potential future version of vDPA? Future VDPA drivers, it was made clear this was important to Intel during the argument about VDPA as a mdev. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote: > > > > > If user space wants to bind page tables, create the PASID with > > > > > /dev/sva, use ioctls there to setup the page table the way it > > > > > wants, then pass the now configured PASID to a driver that > > > > > can use it. > > > > > > > > Are we talking about bare metal SVA? > > > > > > What a weird term. > > > > Glad you noticed it at v7 :-) > > > > Any suggestions on something less weird than > > Shared Virtual Addressing? There is a reason why we moved from SVM > > to SVA. > > SVA is fine, what is "bare metal" supposed to mean? > What I meant here is sharing virtual address between DMA and host process. This requires devices perform DMA request with PASID and use IOMMU first level/stage 1 page tables. This can be further divided into 1) user SVA 2) supervisor SVA (sharing init_mm) My point is that /dev/sva is not useful here since the driver can perform PASID allocation while doing SVA bind. > PASID is about constructing an arbitary DMA IOVA map for PCI-E > devices, being able to intercept device DMA faults, etc. > An arbitrary IOVA map does not need PASID. In IOVA, you do map/unmap explicitly, why you need to handle IO page fault? To me, PASID identifies an address space that is associated with a mm_struct. > SVA is doing DMA IOVA 1:1 with the mm_struct CPU VA. DMA faults > trigger the same thing as CPU page faults. If is it not 1:1 then there > is no "shared". When SVA is done using PCI-E PASID it is "PASID for > SVA". Lots of existing devices already have SVA without PASID or > IOMMU, so lets not muddy the terminology. > I agree. This conversation is about "PASID for SVA" not "SVA without PASID" > vPASID/vIOMMU is allowing a guest to control the DMA IOVA map and > manipulate the PASIDs. > > vSVA is when a guest uses a vPASID to provide SVA, not sure this is > an informative term. > I agree. > This particular patch series seems to be about vPASID/vIOMMU for > vfio-mdev vs the other vPASID/vIOMMU patch which was about vPASID for > vfio-pci. > Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will be introduced later. > > > > If so, I don't see the need for userspace to know there is a > > > > PASID. All user space need is that my current mm is bound to a > > > > device by the driver. So it can be a one-step process for user > > > > instead of two. > > > > > > You've missed the entire point of the conversation, VDPA already > > > needs more than "my current mm is bound to a device" > > > > You mean current version of vDPA? or a potential future version of > > vDPA? > > Future VDPA drivers, it was made clear this was important to Intel > during the argument about VDPA as a mdev. > > Jason Thanks, Jacob _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote: > Hi Jason, > On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com> > wrote: > > > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: > > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote: > > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote: > > > > > > If user space wants to bind page tables, create the PASID with > > > > > > /dev/sva, use ioctls there to setup the page table the way it > > > > > > wants, then pass the now configured PASID to a driver that > > > > > > can use it. > > > > > > > > > > Are we talking about bare metal SVA? > > > > > > > > What a weird term. > > > > > > Glad you noticed it at v7 :-) > > > > > > Any suggestions on something less weird than > > > Shared Virtual Addressing? There is a reason why we moved from SVM > > > to SVA. > > > > SVA is fine, what is "bare metal" supposed to mean? > > > What I meant here is sharing virtual address between DMA and host > process. This requires devices perform DMA request with PASID and use > IOMMU first level/stage 1 page tables. > This can be further divided into 1) user SVA 2) supervisor SVA (sharing > init_mm) > > My point is that /dev/sva is not useful here since the driver can > perform PASID allocation while doing SVA bind. No, you are thinking too small. Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA. When VDPA is used by DPDK it makes sense that the PASID will be SVA and 1:1 with the mm_struct. When VDPA is used by qemu it makes sense that the PASID will be an arbitary IOVA map constructed to be 1:1 with the guest vCPU physical map. /dev/sva allows a single uAPI to do this kind of setup, and qemu can support it while supporting a range of SVA kernel drivers. VDPA and vfio-mdev are obvious initial targets. *BOTH* are needed. In general any uAPI for PASID should have the option to use either the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually nothing to implement this in the driver as PASID is just a number, and gives so much more flexability. > Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will be > introduced later. Last patch is: vfio/type1: Add vSVA support for IOMMU-backed mdevs So pretty hard to see how this is not about vfio-mdev, at least a little.. Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote: > > Hi Jason, > > On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com> > > wrote: > > > > > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: > > > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe > > > > wrote: > > > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) > > > > > wrote: > > > > > > > If user space wants to bind page tables, create the PASID > > > > > > > with /dev/sva, use ioctls there to setup the page table > > > > > > > the way it wants, then pass the now configured PASID to a > > > > > > > driver that can use it. > > > > > > > > > > > > Are we talking about bare metal SVA? > > > > > > > > > > What a weird term. > > > > > > > > Glad you noticed it at v7 :-) > > > > > > > > Any suggestions on something less weird than > > > > Shared Virtual Addressing? There is a reason why we moved from > > > > SVM to SVA. > > > > > > SVA is fine, what is "bare metal" supposed to mean? > > > > > What I meant here is sharing virtual address between DMA and host > > process. This requires devices perform DMA request with PASID and > > use IOMMU first level/stage 1 page tables. > > This can be further divided into 1) user SVA 2) supervisor SVA > > (sharing init_mm) > > > > My point is that /dev/sva is not useful here since the driver can > > perform PASID allocation while doing SVA bind. > > No, you are thinking too small. > > Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA. > Could you point to me the SVA UAPI? I couldn't find it in the mainline. Seems VDPA uses VHOST interface? > When VDPA is used by DPDK it makes sense that the PASID will be SVA > and 1:1 with the mm_struct. > I still don't see why bare metal DPDK needs to get a handle of the PASID. Perhaps the SVA patch would explain. Or are you talking about vDPA DPDK process that is used to support virtio-net-pmd in the guest? > When VDPA is used by qemu it makes sense that the PASID will be an > arbitary IOVA map constructed to be 1:1 with the guest vCPU physical > map. /dev/sva allows a single uAPI to do this kind of setup, and qemu > can support it while supporting a range of SVA kernel drivers. VDPA > and vfio-mdev are obvious initial targets. > > *BOTH* are needed. > > In general any uAPI for PASID should have the option to use either the > mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually > nothing to implement this in the driver as PASID is just a number, and > gives so much more flexability. > Not really nothing in terms of PASID life cycles. For example, if user uses uacce interface to open an accelerator, it gets an FD_acc. Then it opens /dev/sva to allocate PASID then get another FD_pasid. Then we pass FD_pasid to the driver to bind page tables, perhaps multiple drivers. Now we have to worry about If FD_pasid gets closed before FD_acc(s) closed and all these race conditions. If we do not expose FD_pasid to the user, the teardown is much simpler and streamlined. Following each FD_acc close, PASID unbind is performed. > > Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will > > be introduced later. > > Last patch is: > > vfio/type1: Add vSVA support for IOMMU-backed mdevs > > So pretty hard to see how this is not about vfio-mdev, at least a > little.. > > Jason Thanks, Jacob _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote: > Hi Jason, > On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com> > wrote: > >> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote: >>> Hi Jason, >>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com> >>> wrote: >>> >>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: >>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe >>>>> wrote: >>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) >>>>>> wrote: >>>>>>>> If user space wants to bind page tables, create the PASID >>>>>>>> with /dev/sva, use ioctls there to setup the page table >>>>>>>> the way it wants, then pass the now configured PASID to a >>>>>>>> driver that can use it. >>>>>>> Are we talking about bare metal SVA? >>>>>> What a weird term. >>>>> Glad you noticed it at v7 :-) >>>>> >>>>> Any suggestions on something less weird than >>>>> Shared Virtual Addressing? There is a reason why we moved from >>>>> SVM to SVA. >>>> SVA is fine, what is "bare metal" supposed to mean? >>>> >>> What I meant here is sharing virtual address between DMA and host >>> process. This requires devices perform DMA request with PASID and >>> use IOMMU first level/stage 1 page tables. >>> This can be further divided into 1) user SVA 2) supervisor SVA >>> (sharing init_mm) >>> >>> My point is that /dev/sva is not useful here since the driver can >>> perform PASID allocation while doing SVA bind. >> No, you are thinking too small. >> >> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA. >> > Could you point to me the SVA UAPI? I couldn't find it in the mainline. > Seems VDPA uses VHOST interface? It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h. > >> When VDPA is used by DPDK it makes sense that the PASID will be SVA >> and 1:1 with the mm_struct. >> > I still don't see why bare metal DPDK needs to get a handle of the > PASID. My understanding is that it may: - have a unified uAPI with vSVA: alloc, bind, unbind, free - leave the binding policy to userspace instead of the using a implied one in the kenrel > Perhaps the SVA patch would explain. Or are you talking about > vDPA DPDK process that is used to support virtio-net-pmd in the guest? > >> When VDPA is used by qemu it makes sense that the PASID will be an >> arbitary IOVA map constructed to be 1:1 with the guest vCPU physical >> map. /dev/sva allows a single uAPI to do this kind of setup, and qemu >> can support it while supporting a range of SVA kernel drivers. VDPA >> and vfio-mdev are obvious initial targets. >> >> *BOTH* are needed. >> >> In general any uAPI for PASID should have the option to use either the >> mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually >> nothing to implement this in the driver as PASID is just a number, and >> gives so much more flexability. >> > Not really nothing in terms of PASID life cycles. For example, if user > uses uacce interface to open an accelerator, it gets an FD_acc. Then it > opens /dev/sva to allocate PASID then get another FD_pasid. Then we > pass FD_pasid to the driver to bind page tables, perhaps multiple > drivers. Now we have to worry about If FD_pasid gets closed before > FD_acc(s) closed and all these race conditions. I'm not sure I understand this. But this demonstrates the flexibility of an unified uAPI. E.g it allows vDPA and VFIO device to use the same PAISD which can be shared with a process in the guest. For the race condition, it could be probably solved with refcnt. Thanks > > If we do not expose FD_pasid to the user, the teardown is much simpler > and streamlined. Following each FD_acc close, PASID unbind is performed. > >>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will >>> be introduced later. >> Last patch is: >> >> vfio/type1: Add vSVA support for IOMMU-backed mdevs >> >> So pretty hard to see how this is not about vfio-mdev, at least a >> little.. >> >> Jason > > Thanks, > > Jacob > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, September 16, 2020 10:45 PM > > On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, September 15, 2020 10:29 PM > > > > > > > Do they need a device at all? It's not clear to me why RID based > > > > IOMMU management fits within vfio's scope, but PASID based does not. > > > > > > In RID mode vfio-pci completely owns the PCI function, so it is more > > > natural that VFIO, as the sole device owner, would own the DMA > mapping > > > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO > > > so there is not much reason to try and disaggregate the API. > > > > It is also used by vDPA. > > A driver in VDPA, not VDPA itself. what is the difference? It is still the example of using RID IOMMU mode outside of VFIO (and just implies that vDPA even doesn't do a good abstraction internally). > > > > PASID on the other hand, is shared. vfio-mdev drivers will share the > > > device with other kernel drivers. PASID and DMA will be concurrent > > > with VFIO and other kernel drivers/etc. > > > > Looks you are equating PASID to host-side sharing, while ignoring > > another valid usage that a PASID-capable device is passed through > > to the guest through vfio-pci and then PASID is used by the guest > > for guest-side sharing. In such case, it is an exclusive usage in host > > side and then what is the problem for VFIO to manage PASID given > > that vfio-pci completely owns the function? > > This is no different than vfio-pci being yet another client to > /dev/sva > My comment was to echo Alex's question about "why RID based IOMMU management fits within vfio's scope, but PASID based does not". and when talking about generalization we should look bigger beyond sva. What really matters here is the iommu_domain which is about everything related to DMA mapping. The domain associated with a passthru device is marked as "unmanaged" in kernel and allows userspace to manage DMA mapping of this device through a set of iommu_ops: - alloc/free domain; - attach/detach device/subdevice; - map/unmap a memory region; - bind/unbind page table and invalidate iommu cache; - ... (and lots of other callbacks) map/unmap or bind/unbind are just different ways of managing DMAs in an iommu domain. The passthrough framework (VFIO or VDPA) has been providing its uAPI to manage every aspect of iommu_domain so far, and sva is just a natural extension following this design. If we really want to generalize something, it needs to be /dev/iommu as an unified interface for managing every aspect of iommu_domain. Asking SVA abstraction alone just causes unnecessary mess to both kernel (sync domain/device association between /dev/vfio and /dev/sva) and userspace (talk to two interfaces even for same vfio-pci device). Then it sounds like more like a bandaid for saving development effort in VDPA (which instead should go proposing /dev/iommu when it was invented instead of reinventing its own bits until such effort is unaffordable and then ask for partial abstraction to fix its gap). Thanks Kevin _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Thu, Sep 17, 2020 at 11:53:49AM +0800, Jason Wang wrote: > > > When VDPA is used by qemu it makes sense that the PASID will be an > > > arbitary IOVA map constructed to be 1:1 with the guest vCPU physical > > > map. /dev/sva allows a single uAPI to do this kind of setup, and qemu > > > can support it while supporting a range of SVA kernel drivers. VDPA > > > and vfio-mdev are obvious initial targets. > > > > > > *BOTH* are needed. > > > > > > In general any uAPI for PASID should have the option to use either the > > > mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually > > > nothing to implement this in the driver as PASID is just a number, and > > > gives so much more flexability. > > > > > Not really nothing in terms of PASID life cycles. For example, if user > > uses uacce interface to open an accelerator, it gets an FD_acc. Then it > > opens /dev/sva to allocate PASID then get another FD_pasid. Then we > > pass FD_pasid to the driver to bind page tables, perhaps multiple > > drivers. Now we have to worry about If FD_pasid gets closed before > > FD_acc(s) closed and all these race conditions. > > > I'm not sure I understand this. But this demonstrates the flexibility of an > unified uAPI. E.g it allows vDPA and VFIO device to use the same PAISD which > can be shared with a process in the guest. > > For the race condition, it could be probably solved with refcnt. Yep Jason _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Jason, On Thu, 17 Sep 2020 11:53:49 +0800, Jason Wang <jasowang@redhat.com> wrote: > On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote: > > Hi Jason, > > On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com> > > wrote: > > > >> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote: > >>> Hi Jason, > >>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe > >>> <jgg@nvidia.com> wrote: > >>> > >>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: > >>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe > >>>>> wrote: > >>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) > >>>>>> wrote: > >>>>>>>> If user space wants to bind page tables, create the PASID > >>>>>>>> with /dev/sva, use ioctls there to setup the page table > >>>>>>>> the way it wants, then pass the now configured PASID to a > >>>>>>>> driver that can use it. > >>>>>>> Are we talking about bare metal SVA? > >>>>>> What a weird term. > >>>>> Glad you noticed it at v7 :-) > >>>>> > >>>>> Any suggestions on something less weird than > >>>>> Shared Virtual Addressing? There is a reason why we moved from > >>>>> SVM to SVA. > >>>> SVA is fine, what is "bare metal" supposed to mean? > >>>> > >>> What I meant here is sharing virtual address between DMA and host > >>> process. This requires devices perform DMA request with PASID and > >>> use IOMMU first level/stage 1 page tables. > >>> This can be further divided into 1) user SVA 2) supervisor SVA > >>> (sharing init_mm) > >>> > >>> My point is that /dev/sva is not useful here since the driver can > >>> perform PASID allocation while doing SVA bind. > >> No, you are thinking too small. > >> > >> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the > >> SVA. > > Could you point to me the SVA UAPI? I couldn't find it in the > > mainline. Seems VDPA uses VHOST interface? > > > It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h. > Thanks for the pointer, for complete vSVA functionality we would need 1 TLB flush (IOTLB and PASID cache etc.) 2 PASID alloc/free 3 bind/unbind page tables or PASID tables 4 Page request service Seems vhost_iotlb_msg can be used for #1 partially. And the proposal is to pluck out the rest into /dev/sda? Seems awkward as Alex pointed out earlier for similar situation in VFIO. > > > > >> When VDPA is used by DPDK it makes sense that the PASID will be SVA > >> and 1:1 with the mm_struct. > >> > > I still don't see why bare metal DPDK needs to get a handle of the > > PASID. > > > My understanding is that it may: > > - have a unified uAPI with vSVA: alloc, bind, unbind, free Got your point, but vSVA needs more than these > - leave the binding policy to userspace instead of the using a > implied one in the kenrel > Only if necessary. > > > Perhaps the SVA patch would explain. Or are you talking about > > vDPA DPDK process that is used to support virtio-net-pmd in the > > guest? > >> When VDPA is used by qemu it makes sense that the PASID will be an > >> arbitary IOVA map constructed to be 1:1 with the guest vCPU > >> physical map. /dev/sva allows a single uAPI to do this kind of > >> setup, and qemu can support it while supporting a range of SVA > >> kernel drivers. VDPA and vfio-mdev are obvious initial targets. > >> > >> *BOTH* are needed. > >> > >> In general any uAPI for PASID should have the option to use either > >> the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs > >> virtually nothing to implement this in the driver as PASID is just > >> a number, and gives so much more flexability. > >> > > Not really nothing in terms of PASID life cycles. For example, if > > user uses uacce interface to open an accelerator, it gets an > > FD_acc. Then it opens /dev/sva to allocate PASID then get another > > FD_pasid. Then we pass FD_pasid to the driver to bind page tables, > > perhaps multiple drivers. Now we have to worry about If FD_pasid > > gets closed before FD_acc(s) closed and all these race conditions. > > > I'm not sure I understand this. But this demonstrates the flexibility > of an unified uAPI. E.g it allows vDPA and VFIO device to use the > same PAISD which can be shared with a process in the guest. > This is for user DMA not for vSVA. I was contending that /dev/sva creates unnecessary steps for such usage. For vSVA, I think vDPA and VFIO can potentially share but I am not seeing convincing benefits. If a guest process wants to do SVA with a VFIO assigned device and a vDPA-backed virtio-net at the same time, it might be a limitation if PASID is not managed via a common interface. But I am not sure how vDPA SVA support will look like, does it support gIOVA? need virtio IOMMU? > For the race condition, it could be probably solved with refcnt. > Agreed but the best solution might be not to have the problem in the first place :) > Thanks > > > > > > If we do not expose FD_pasid to the user, the teardown is much > > simpler and streamlined. Following each FD_acc close, PASID unbind > > is performed. > >>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev > >>> will be introduced later. > >> Last patch is: > >> > >> vfio/type1: Add vSVA support for IOMMU-backed mdevs > >> > >> So pretty hard to see how this is not about vfio-mdev, at least a > >> little.. > >> > >> Jason > > > > Thanks, > > > > Jacob > > > Thanks, Jacob _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
On 2020/9/18 上午2:17, Jacob Pan (Jun) wrote: > Hi Jason, > On Thu, 17 Sep 2020 11:53:49 +0800, Jason Wang <jasowang@redhat.com> > wrote: > >> On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote: >>> Hi Jason, >>> On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com> >>> wrote: >>> >>>> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote: >>>>> Hi Jason, >>>>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe >>>>> <jgg@nvidia.com> wrote: >>>>> >>>>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote: >>>>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe >>>>>>> wrote: >>>>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) >>>>>>>> wrote: >>>>>>>>>> If user space wants to bind page tables, create the PASID >>>>>>>>>> with /dev/sva, use ioctls there to setup the page table >>>>>>>>>> the way it wants, then pass the now configured PASID to a >>>>>>>>>> driver that can use it. >>>>>>>>> Are we talking about bare metal SVA? >>>>>>>> What a weird term. >>>>>>> Glad you noticed it at v7 :-) >>>>>>> >>>>>>> Any suggestions on something less weird than >>>>>>> Shared Virtual Addressing? There is a reason why we moved from >>>>>>> SVM to SVA. >>>>>> SVA is fine, what is "bare metal" supposed to mean? >>>>>> >>>>> What I meant here is sharing virtual address between DMA and host >>>>> process. This requires devices perform DMA request with PASID and >>>>> use IOMMU first level/stage 1 page tables. >>>>> This can be further divided into 1) user SVA 2) supervisor SVA >>>>> (sharing init_mm) >>>>> >>>>> My point is that /dev/sva is not useful here since the driver can >>>>> perform PASID allocation while doing SVA bind. >>>> No, you are thinking too small. >>>> >>>> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the >>>> SVA. >>> Could you point to me the SVA UAPI? I couldn't find it in the >>> mainline. Seems VDPA uses VHOST interface? >> >> It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h. >> > Thanks for the pointer, for complete vSVA functionality we would need > 1 TLB flush (IOTLB and PASID cache etc.) > 2 PASID alloc/free > 3 bind/unbind page tables or PASID tables > 4 Page request service > > Seems vhost_iotlb_msg can be used for #1 partially. And the > proposal is to pluck out the rest into /dev/sda? Seems awkward as Alex > pointed out earlier for similar situation in VFIO. Consider it doesn't have any PASID support yet, my understanding is that if we go with /dev/sva: - vhost uAPI will still keep the uAPI for associating an ASID to a specific virtqueue - except for this, we can use /dev/sva for all the rest (P)ASID operations > >>> >>>> When VDPA is used by DPDK it makes sense that the PASID will be SVA >>>> and 1:1 with the mm_struct. >>>> >>> I still don't see why bare metal DPDK needs to get a handle of the >>> PASID. >> >> My understanding is that it may: >> >> - have a unified uAPI with vSVA: alloc, bind, unbind, free > Got your point, but vSVA needs more than these Yes it's just a subset of what vSVA required. > >> - leave the binding policy to userspace instead of the using a >> implied one in the kenrel >> > Only if necessary. Yes, I think it's all about visibility(flexibility) and**manageability. Consider device has queue A, B, C. We will only dedicated queue A, B for one PASID(for vSVA) and C with another PASID(for SVA). It looks to me the current sva_bind() API doesn't support this. We still need an API for allocating a PASID for SVA and assign it to the (mediated) device. This case is pretty common for implementing a shadow queue for a guest. > >>> Perhaps the SVA patch would explain. Or are you talking about >>> vDPA DPDK process that is used to support virtio-net-pmd in the >>> guest? >>>> When VDPA is used by qemu it makes sense that the PASID will be an >>>> arbitary IOVA map constructed to be 1:1 with the guest vCPU >>>> physical map. /dev/sva allows a single uAPI to do this kind of >>>> setup, and qemu can support it while supporting a range of SVA >>>> kernel drivers. VDPA and vfio-mdev are obvious initial targets. >>>> >>>> *BOTH* are needed. >>>> >>>> In general any uAPI for PASID should have the option to use either >>>> the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs >>>> virtually nothing to implement this in the driver as PASID is just >>>> a number, and gives so much more flexability. >>>> >>> Not really nothing in terms of PASID life cycles. For example, if >>> user uses uacce interface to open an accelerator, it gets an >>> FD_acc. Then it opens /dev/sva to allocate PASID then get another >>> FD_pasid. Then we pass FD_pasid to the driver to bind page tables, >>> perhaps multiple drivers. Now we have to worry about If FD_pasid >>> gets closed before FD_acc(s) closed and all these race conditions. >> >> I'm not sure I understand this. But this demonstrates the flexibility >> of an unified uAPI. E.g it allows vDPA and VFIO device to use the >> same PAISD which can be shared with a process in the guest. >> > This is for user DMA not for vSVA. I was contending that /dev/sva > creates unnecessary steps for such usage. A question here is where the PASID management is expected to be done. I'm not quite sure the silent 1:1 binding done in intel_svm_bind_mm() can satisfy the requirement for management layer. > > For vSVA, I think vDPA and VFIO can potentially share but I am not > seeing convincing benefits. > > If a guest process wants to do SVA with a VFIO assigned device and a > vDPA-backed virtio-net at the same time, it might be a limitation if > PASID is not managed via a common interface. Yes. > But I am not sure how vDPA > SVA support will look like, does it support gIOVA? need virtio IOMMU? Yes, it supports gIOVA and it should work with any type of vIOMMU. I think vDPA will start from Intel vIOMMU support in Qemu. For virtio IOMMU, we will probably support it in the future consider it doesn't have any SVA capability, and it doesn't use a page table that can be nested via a hardware IOMMU. >> For the race condition, it could be probably solved with refcnt. >> > Agreed but the best solution might be not to have the problem in the > first place :) I agree, it's only worth to bother if it has real benefits. Thanks > >> Thanks >> >> >>> If we do not expose FD_pasid to the user, the teardown is much >>> simpler and streamlined. Following each FD_acc close, PASID unbind >>> is performed. >>>>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev >>>>> will be introduced later. >>>> Last patch is: >>>> >>>> vfio/type1: Add vSVA support for IOMMU-backed mdevs >>>> >>>> So pretty hard to see how this is not about vfio-mdev, at least a >>>> little.. >>>> >>>> Jason >>> Thanks, >>> >>> Jacob >>> > > Thanks, > > Jacob > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Yi, On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote: > > This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING, > iommu_domain_get_attr() should return an iommu_nesting_info handle. For > now, return an empty nesting info struct for now as true nesting is not > yet supported by the SMMUs. > > Cc: Will Deacon <will@kernel.org> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Eric Auger <eric.auger@redhat.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > Reviewed-by: Eric Auger <eric.auger@redhat.com> > --- > v5 -> v6: > *) add review-by from Eric Auger. > > v4 -> v5: > *) address comments from Eric Auger. > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++++++++++++++++++++++++++-- > drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 +++++++++++++++++++++++++++-- > 2 files changed, 54 insertions(+), 4 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 7196207..016e2e5 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) > return group; > } > > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain, > + void *data) > +{ > + struct iommu_nesting_info *info = (struct iommu_nesting_info *)data; > + unsigned int size; > + > + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) > + return -ENODEV; > + > + size = sizeof(struct iommu_nesting_info); > + > + /* > + * if provided buffer size is smaller than expected, should > + * return 0 and also the expected buffer size to caller. > + */ > + if (info->argsz < size) { > + info->argsz = size; > + return 0; > + } > + > + /* report an empty iommu_nesting_info for now */ > + memset(info, 0x0, size); > + info->argsz = size; > + return 0; > +} > + > static int arm_smmu_domain_get_attr(struct iommu_domain *domain, > enum iommu_attr attr, void *data) > { > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain, > case IOMMU_DOMAIN_UNMANAGED: > switch (attr) { > case DOMAIN_ATTR_NESTING: > - *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); > - return 0; > + return arm_smmu_domain_nesting_info(smmu_domain, data); Thanks for the patch. This would unnecessarily overflow 'data' for any caller that's expecting only an int data. Dump from one such issue that I was seeing when testing this change along with local kvmtool changes is pasted below [1]. I could get around with the issue by adding another (iommu_attr) - DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info). Thanks & regards Vivek [1]-------------- [ 811.756516] vfio-pci 0000:08:00.1: vfio_ecap_init: hiding ecap 0x1b@0x108 [ 811.756516] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 [ 811.756516] CPU: 0 PID: 175 Comm: lkvm-cleanup-ne Not tainted 5.10.0-rc5-00096-gf015061e14cf #43 [ 811.756516] Call trace: [ 811.756516] dump_backtrace+0x0/0x1b0 [ 811.756516] show_stack+0x18/0x68 [ 811.756516] dump_stack+0xd8/0x134 [ 811.756516] panic+0x174/0x33c [ 811.756516] __stack_chk_fail+0x3c/0x40 [ 811.756516] vfio_pci_open+0x644/0x648 [ 811.756516] vfio_group_fops_unl_ioctl+0x4bc/0x648 [ 811.756516] 0x0 [ 811.756516] SMP: stopping secondary CPUs [ 811.756597] Kernel Offset: disabled [ 811.756597] CPU features: 0x0040006,6a00aa38 [ 811.756602] Memory Limit: none [ 811.768497] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 ] ------------- _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Vivek, > From: Vivek Gautam <vivek.gautam@arm.com> > Sent: Tuesday, January 12, 2021 2:50 PM > > Hi Yi, > > > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote: > > > > This patch is added as instead of returning a boolean for > DOMAIN_ATTR_NESTING, > > iommu_domain_get_attr() should return an iommu_nesting_info handle. > For > > now, return an empty nesting info struct for now as true nesting is not > > yet supported by the SMMUs. > > > > Cc: Will Deacon <will@kernel.org> > > Cc: Robin Murphy <robin.murphy@arm.com> > > Cc: Eric Auger <eric.auger@redhat.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Reviewed-by: Eric Auger <eric.auger@redhat.com> > > --- > > v5 -> v6: > > *) add review-by from Eric Auger. > > > > v4 -> v5: > > *) address comments from Eric Auger. > > --- > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 > +++++++++++++++++++++++++++-- > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 > +++++++++++++++++++++++++++-- > > 2 files changed, 54 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > index 7196207..016e2e5 100644 > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > @@ -3019,6 +3019,32 @@ static struct iommu_group > *arm_smmu_device_group(struct device *dev) > > return group; > > } > > > > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain > *smmu_domain, > > + void *data) > > +{ > > + struct iommu_nesting_info *info = (struct iommu_nesting_info > *)data; > > + unsigned int size; > > + > > + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) > > + return -ENODEV; > > + > > + size = sizeof(struct iommu_nesting_info); > > + > > + /* > > + * if provided buffer size is smaller than expected, should > > + * return 0 and also the expected buffer size to caller. > > + */ > > + if (info->argsz < size) { > > + info->argsz = size; > > + return 0; > > + } > > + > > + /* report an empty iommu_nesting_info for now */ > > + memset(info, 0x0, size); > > + info->argsz = size; > > + return 0; > > +} > > + > > static int arm_smmu_domain_get_attr(struct iommu_domain *domain, > > enum iommu_attr attr, void *data) > > { > > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct > iommu_domain *domain, > > case IOMMU_DOMAIN_UNMANAGED: > > switch (attr) { > > case DOMAIN_ATTR_NESTING: > > - *(int *)data = (smmu_domain->stage == > ARM_SMMU_DOMAIN_NESTED); > > - return 0; > > + return arm_smmu_domain_nesting_info(smmu_domain, > data); > > Thanks for the patch. > This would unnecessarily overflow 'data' for any caller that's expecting only > an int data. Dump from one such issue that I was seeing when testing > this change along with local kvmtool changes is pasted below [1]. > > I could get around with the issue by adding another (iommu_attr) - > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info). nice to hear from you. At first, we planned to have a separate iommu_attr for getting nesting_info. However, we considered there is no existing user which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu nesting info. Could you share me the code base you are using? If the error you encountered is due to this change, so there should be a place which gets DOMAIN_ATTR_NESTING. Regards, Yi Liu > Thanks & regards > Vivek > > [1]-------------- > [ 811.756516] vfio-pci 0000:08:00.1: vfio_ecap_init: hiding ecap > 0x1b@0x108 > [ 811.756516] Kernel panic - not syncing: stack-protector: Kernel > stack is corrupted in: vfio_pci_open+0x644/0x648 > [ 811.756516] CPU: 0 PID: 175 Comm: lkvm-cleanup-ne Not tainted > 5.10.0-rc5-00096-gf015061e14cf #43 > [ 811.756516] Call trace: > [ 811.756516] dump_backtrace+0x0/0x1b0 > [ 811.756516] show_stack+0x18/0x68 > [ 811.756516] dump_stack+0xd8/0x134 > [ 811.756516] panic+0x174/0x33c > [ 811.756516] __stack_chk_fail+0x3c/0x40 > [ 811.756516] vfio_pci_open+0x644/0x648 > [ 811.756516] vfio_group_fops_unl_ioctl+0x4bc/0x648 > [ 811.756516] 0x0 > [ 811.756516] SMP: stopping secondary CPUs > [ 811.756597] Kernel Offset: disabled > [ 811.756597] CPU features: 0x0040006,6a00aa38 > [ 811.756602] Memory Limit: none > [ 811.768497] ---[ end Kernel panic - not syncing: stack-protector: > Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 ] > ------------- _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Yi, On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote: > > Hi Vivek, > > > From: Vivek Gautam <vivek.gautam@arm.com> > > Sent: Tuesday, January 12, 2021 2:50 PM > > > > Hi Yi, > > > > > > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote: > > > > > > This patch is added as instead of returning a boolean for > > DOMAIN_ATTR_NESTING, > > > iommu_domain_get_attr() should return an iommu_nesting_info handle. > > For > > > now, return an empty nesting info struct for now as true nesting is not > > > yet supported by the SMMUs. > > > > > > Cc: Will Deacon <will@kernel.org> > > > Cc: Robin Murphy <robin.murphy@arm.com> > > > Cc: Eric Auger <eric.auger@redhat.com> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > > Reviewed-by: Eric Auger <eric.auger@redhat.com> > > > --- > > > v5 -> v6: > > > *) add review-by from Eric Auger. > > > > > > v4 -> v5: > > > *) address comments from Eric Auger. > > > --- > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 > > +++++++++++++++++++++++++++-- > > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 > > +++++++++++++++++++++++++++-- > > > 2 files changed, 54 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > index 7196207..016e2e5 100644 > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > @@ -3019,6 +3019,32 @@ static struct iommu_group > > *arm_smmu_device_group(struct device *dev) > > > return group; > > > } > > > > > > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain > > *smmu_domain, > > > + void *data) > > > +{ > > > + struct iommu_nesting_info *info = (struct iommu_nesting_info > > *)data; > > > + unsigned int size; > > > + > > > + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) > > > + return -ENODEV; > > > + > > > + size = sizeof(struct iommu_nesting_info); > > > + > > > + /* > > > + * if provided buffer size is smaller than expected, should > > > + * return 0 and also the expected buffer size to caller. > > > + */ > > > + if (info->argsz < size) { > > > + info->argsz = size; > > > + return 0; > > > + } > > > + > > > + /* report an empty iommu_nesting_info for now */ > > > + memset(info, 0x0, size); > > > + info->argsz = size; > > > + return 0; > > > +} > > > + > > > static int arm_smmu_domain_get_attr(struct iommu_domain *domain, > > > enum iommu_attr attr, void *data) > > > { > > > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct > > iommu_domain *domain, > > > case IOMMU_DOMAIN_UNMANAGED: > > > switch (attr) { > > > case DOMAIN_ATTR_NESTING: > > > - *(int *)data = (smmu_domain->stage == > > ARM_SMMU_DOMAIN_NESTED); > > > - return 0; > > > + return arm_smmu_domain_nesting_info(smmu_domain, > > data); > > > > Thanks for the patch. > > This would unnecessarily overflow 'data' for any caller that's expecting only > > an int data. Dump from one such issue that I was seeing when testing > > this change along with local kvmtool changes is pasted below [1]. > > > > I could get around with the issue by adding another (iommu_attr) - > > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info). > > nice to hear from you. At first, we planned to have a separate iommu_attr > for getting nesting_info. However, we considered there is no existing user > which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu nesting > info. Could you share me the code base you are using? If the error you > encountered is due to this change, so there should be a place which gets > DOMAIN_ATTR_NESTING. I am currently working on top of Eric's tree for nested stage support [1]. My best guess was that the vfio_pci_dma_fault_init() method [2] that is requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption. That's when I added a new attribute. I will soon publish my patches to the list for review. Let me know your thoughts. [1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13 [2] https://github.com/eauger/linux/blob/5.10-rc4-2stage-v13/drivers/vfio/pci/vfio_pci.c#L494 Thanks Vivek > > Regards, > Yi Liu [snip] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Vivek, > From: Vivek Gautam <vivek.gautam@arm.com> > Sent: Tuesday, January 12, 2021 7:06 PM > > Hi Yi, > > > On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote: > > > > Hi Vivek, > > > > > From: Vivek Gautam <vivek.gautam@arm.com> > > > Sent: Tuesday, January 12, 2021 2:50 PM > > > > > > Hi Yi, > > > > > > > > > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote: > > > > > > > > This patch is added as instead of returning a boolean for > > > DOMAIN_ATTR_NESTING, > > > > iommu_domain_get_attr() should return an iommu_nesting_info > handle. > > > For > > > > now, return an empty nesting info struct for now as true nesting is not > > > > yet supported by the SMMUs. > > > > > > > > Cc: Will Deacon <will@kernel.org> > > > > Cc: Robin Murphy <robin.murphy@arm.com> > > > > Cc: Eric Auger <eric.auger@redhat.com> > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> > > > > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com> > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > > > Reviewed-by: Eric Auger <eric.auger@redhat.com> > > > > --- > > > > v5 -> v6: > > > > *) add review-by from Eric Auger. > > > > > > > > v4 -> v5: > > > > *) address comments from Eric Auger. > > > > --- > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 > > > +++++++++++++++++++++++++++-- > > > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 > > > +++++++++++++++++++++++++++-- > > > > 2 files changed, 54 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > > index 7196207..016e2e5 100644 > > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > > @@ -3019,6 +3019,32 @@ static struct iommu_group > > > *arm_smmu_device_group(struct device *dev) > > > > return group; > > > > } > > > > > > > > +static int arm_smmu_domain_nesting_info(struct > arm_smmu_domain > > > *smmu_domain, > > > > + void *data) > > > > +{ > > > > + struct iommu_nesting_info *info = (struct iommu_nesting_info > > > *)data; > > > > + unsigned int size; > > > > + > > > > + if (!info || smmu_domain->stage != > ARM_SMMU_DOMAIN_NESTED) > > > > + return -ENODEV; > > > > + > > > > + size = sizeof(struct iommu_nesting_info); > > > > + > > > > + /* > > > > + * if provided buffer size is smaller than expected, should > > > > + * return 0 and also the expected buffer size to caller. > > > > + */ > > > > + if (info->argsz < size) { > > > > + info->argsz = size; > > > > + return 0; > > > > + } > > > > + > > > > + /* report an empty iommu_nesting_info for now */ > > > > + memset(info, 0x0, size); > > > > + info->argsz = size; > > > > + return 0; > > > > +} > > > > + > > > > static int arm_smmu_domain_get_attr(struct iommu_domain > *domain, > > > > enum iommu_attr attr, void *data) > > > > { > > > > @@ -3028,8 +3054,7 @@ static int > arm_smmu_domain_get_attr(struct > > > iommu_domain *domain, > > > > case IOMMU_DOMAIN_UNMANAGED: > > > > switch (attr) { > > > > case DOMAIN_ATTR_NESTING: > > > > - *(int *)data = (smmu_domain->stage == > > > ARM_SMMU_DOMAIN_NESTED); > > > > - return 0; > > > > + return > arm_smmu_domain_nesting_info(smmu_domain, > > > data); > > > > > > Thanks for the patch. > > > This would unnecessarily overflow 'data' for any caller that's expecting > only > > > an int data. Dump from one such issue that I was seeing when testing > > > this change along with local kvmtool changes is pasted below [1]. > > > > > > I could get around with the issue by adding another (iommu_attr) - > > > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info). > > > > nice to hear from you. At first, we planned to have a separate iommu_attr > > for getting nesting_info. However, we considered there is no existing user > > which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu > nesting > > info. Could you share me the code base you are using? If the error you > > encountered is due to this change, so there should be a place which gets > > DOMAIN_ATTR_NESTING. > > I am currently working on top of Eric's tree for nested stage support [1]. > My best guess was that the vfio_pci_dma_fault_init() method [2] that is > requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption. > That's when I added a new attribute. I see. I think there needs a change in the code there. Should also expect a nesting_info returned instead of an int anymore. @Eric, how about your opinion? domain = iommu_get_domain_for_dev(&vdev->pdev->dev); ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &info); if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { /* * No need go futher as no page request service support. */ return 0; } https://github.com/luxis1999/linux-vsva/blob/vsva-linux-5.9-rc6-v8%2BPRQ/drivers/vfio/pci/vfio_pci.c Regards, Yi Liu > I will soon publish my patches to the list for review. Let me know > your thoughts. > > [1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13 > [2] https://github.com/eauger/linux/blob/5.10-rc4-2stage- > v13/drivers/vfio/pci/vfio_pci.c#L494 > > Thanks > Vivek > > > > > Regards, > > Yi Liu > > [snip] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Yi, Vivek, On 1/13/21 6:56 AM, Liu, Yi L wrote: > Hi Vivek, > >> From: Vivek Gautam <vivek.gautam@arm.com> >> Sent: Tuesday, January 12, 2021 7:06 PM >> >> Hi Yi, >> >> >> On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote: >>> >>> Hi Vivek, >>> >>>> From: Vivek Gautam <vivek.gautam@arm.com> >>>> Sent: Tuesday, January 12, 2021 2:50 PM >>>> >>>> Hi Yi, >>>> >>>> >>>> On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote: >>>>> >>>>> This patch is added as instead of returning a boolean for >>>> DOMAIN_ATTR_NESTING, >>>>> iommu_domain_get_attr() should return an iommu_nesting_info >> handle. >>>> For >>>>> now, return an empty nesting info struct for now as true nesting is not >>>>> yet supported by the SMMUs. >>>>> >>>>> Cc: Will Deacon <will@kernel.org> >>>>> Cc: Robin Murphy <robin.murphy@arm.com> >>>>> Cc: Eric Auger <eric.auger@redhat.com> >>>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> >>>>> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> >>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com> >>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> >>>>> Reviewed-by: Eric Auger <eric.auger@redhat.com> >>>>> --- >>>>> v5 -> v6: >>>>> *) add review-by from Eric Auger. >>>>> >>>>> v4 -> v5: >>>>> *) address comments from Eric Auger. >>>>> --- >>>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 >>>> +++++++++++++++++++++++++++-- >>>>> drivers/iommu/arm/arm-smmu/arm-smmu.c | 29 >>>> +++++++++++++++++++++++++++-- >>>>> 2 files changed, 54 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>>> index 7196207..016e2e5 100644 >>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>>> @@ -3019,6 +3019,32 @@ static struct iommu_group >>>> *arm_smmu_device_group(struct device *dev) >>>>> return group; >>>>> } >>>>> >>>>> +static int arm_smmu_domain_nesting_info(struct >> arm_smmu_domain >>>> *smmu_domain, >>>>> + void *data) >>>>> +{ >>>>> + struct iommu_nesting_info *info = (struct iommu_nesting_info >>>> *)data; >>>>> + unsigned int size; >>>>> + >>>>> + if (!info || smmu_domain->stage != >> ARM_SMMU_DOMAIN_NESTED) >>>>> + return -ENODEV; >>>>> + >>>>> + size = sizeof(struct iommu_nesting_info); >>>>> + >>>>> + /* >>>>> + * if provided buffer size is smaller than expected, should >>>>> + * return 0 and also the expected buffer size to caller. >>>>> + */ >>>>> + if (info->argsz < size) { >>>>> + info->argsz = size; >>>>> + return 0; >>>>> + } >>>>> + >>>>> + /* report an empty iommu_nesting_info for now */ >>>>> + memset(info, 0x0, size); >>>>> + info->argsz = size; >>>>> + return 0; >>>>> +} >>>>> + >>>>> static int arm_smmu_domain_get_attr(struct iommu_domain >> *domain, >>>>> enum iommu_attr attr, void *data) >>>>> { >>>>> @@ -3028,8 +3054,7 @@ static int >> arm_smmu_domain_get_attr(struct >>>> iommu_domain *domain, >>>>> case IOMMU_DOMAIN_UNMANAGED: >>>>> switch (attr) { >>>>> case DOMAIN_ATTR_NESTING: >>>>> - *(int *)data = (smmu_domain->stage == >>>> ARM_SMMU_DOMAIN_NESTED); >>>>> - return 0; >>>>> + return >> arm_smmu_domain_nesting_info(smmu_domain, >>>> data); >>>> >>>> Thanks for the patch. >>>> This would unnecessarily overflow 'data' for any caller that's expecting >> only >>>> an int data. Dump from one such issue that I was seeing when testing >>>> this change along with local kvmtool changes is pasted below [1]. >>>> >>>> I could get around with the issue by adding another (iommu_attr) - >>>> DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info). >>> >>> nice to hear from you. At first, we planned to have a separate iommu_attr >>> for getting nesting_info. However, we considered there is no existing user >>> which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu >> nesting >>> info. Could you share me the code base you are using? If the error you >>> encountered is due to this change, so there should be a place which gets >>> DOMAIN_ATTR_NESTING. >> >> I am currently working on top of Eric's tree for nested stage support [1]. >> My best guess was that the vfio_pci_dma_fault_init() method [2] that is >> requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption. >> That's when I added a new attribute. > > I see. I think there needs a change in the code there. Should also expect > a nesting_info returned instead of an int anymore. @Eric, how about your > opinion? > > domain = iommu_get_domain_for_dev(&vdev->pdev->dev); > ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &info); > if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { > /* > * No need go futher as no page request service support. > */ > return 0; > } Sure I think it is "just" a matter of synchro between the 2 series. Yi, do you have plans to respin part of [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs or would you allow me to embed this patch in my series. Thanks Eric > > https://github.com/luxis1999/linux-vsva/blob/vsva-linux-5.9-rc6-v8%2BPRQ/drivers/vfio/pci/vfio_pci.c > > Regards, > Yi Liu > >> I will soon publish my patches to the list for review. Let me know >> your thoughts. >> >> [1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13 >> [2] https://github.com/eauger/linux/blob/5.10-rc4-2stage- >> v13/drivers/vfio/pci/vfio_pci.c#L494 >> >> Thanks >> Vivek >> >>> >>> Regards, >>> Yi Liu >> >> [snip] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Eric, > From: Auger Eric <eric.auger@redhat.com> > Sent: Tuesday, January 19, 2021 6:03 PM > > Hi Yi, Vivek, > [...] > > I see. I think there needs a change in the code there. Should also expect > > a nesting_info returned instead of an int anymore. @Eric, how about your > > opinion? > > > > domain = iommu_get_domain_for_dev(&vdev->pdev->dev); > > ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, > &info); > > if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { > > /* > > * No need go futher as no page request service support. > > */ > > return 0; > > } > Sure I think it is "just" a matter of synchro between the 2 series. Yi, exactly. > do you have plans to respin part of > [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs > or would you allow me to embed this patch in my series. My v7 hasn’t touch the prq change yet. So I think it's better for you to embed it to your series. ^_^ Regards, Yi Liu _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Yi, On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote: > > Hi Eric, > > > From: Auger Eric <eric.auger@redhat.com> > > Sent: Tuesday, January 19, 2021 6:03 PM > > > > Hi Yi, Vivek, > > > [...] > > > I see. I think there needs a change in the code there. Should also expect > > > a nesting_info returned instead of an int anymore. @Eric, how about your > > > opinion? > > > > > > domain = iommu_get_domain_for_dev(&vdev->pdev->dev); > > > ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, > > &info); > > > if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { > > > /* > > > * No need go futher as no page request service support. > > > */ > > > return 0; > > > } > > Sure I think it is "just" a matter of synchro between the 2 series. Yi, > > exactly. > > > do you have plans to respin part of > > [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs > > or would you allow me to embed this patch in my series. > > My v7 hasn’t touch the prq change yet. So I think it's better for you to > embed it to your series. ^_^ > Can you please let me know if you have an updated series of these patches? It will help me to work with virtio-iommu/arm side changes. Thanks & regards Vivek _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Vivek, Yi, On 2/12/21 8:14 AM, Vivek Gautam wrote: > Hi Yi, > > > On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote: >> >> Hi Eric, >> >>> From: Auger Eric <eric.auger@redhat.com> >>> Sent: Tuesday, January 19, 2021 6:03 PM >>> >>> Hi Yi, Vivek, >>> >> [...] >>>> I see. I think there needs a change in the code there. Should also expect >>>> a nesting_info returned instead of an int anymore. @Eric, how about your >>>> opinion? >>>> >>>> domain = iommu_get_domain_for_dev(&vdev->pdev->dev); >>>> ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, >>> &info); >>>> if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { >>>> /* >>>> * No need go futher as no page request service support. >>>> */ >>>> return 0; >>>> } >>> Sure I think it is "just" a matter of synchro between the 2 series. Yi, >> >> exactly. >> >>> do you have plans to respin part of >>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs >>> or would you allow me to embed this patch in my series. >> >> My v7 hasn’t touch the prq change yet. So I think it's better for you to >> embed it to your series. ^_^>> > > Can you please let me know if you have an updated series of these > patches? It will help me to work with virtio-iommu/arm side changes. As per the previous discussion, I plan to take those 2 patches in my SMMUv3 nested stage series: [PATCH v7 01/16] iommu: Report domain nesting info [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info we need to upgrade both since we do not want to report an empty nesting info anymore, for arm. Thanks Eric > > Thanks & regards > Vivek > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Eric, On 2/12/21 3:27 PM, Auger Eric wrote: > Hi Vivek, Yi, > > On 2/12/21 8:14 AM, Vivek Gautam wrote: >> Hi Yi, >> >> >> On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote: >>> >>> Hi Eric, >>> >>>> From: Auger Eric <eric.auger@redhat.com> >>>> Sent: Tuesday, January 19, 2021 6:03 PM >>>> >>>> Hi Yi, Vivek, >>>> >>> [...] >>>>> I see. I think there needs a change in the code there. Should also expect >>>>> a nesting_info returned instead of an int anymore. @Eric, how about your >>>>> opinion? >>>>> >>>>> domain = iommu_get_domain_for_dev(&vdev->pdev->dev); >>>>> ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, >>>> &info); >>>>> if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { >>>>> /* >>>>> * No need go futher as no page request service support. >>>>> */ >>>>> return 0; >>>>> } >>>> Sure I think it is "just" a matter of synchro between the 2 series. Yi, >>> >>> exactly. >>> >>>> do you have plans to respin part of >>>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs >>>> or would you allow me to embed this patch in my series. >>> >>> My v7 hasn’t touch the prq change yet. So I think it's better for you to >>> embed it to your series. ^_^>> >> >> Can you please let me know if you have an updated series of these >> patches? It will help me to work with virtio-iommu/arm side changes. > > As per the previous discussion, I plan to take those 2 patches in my > SMMUv3 nested stage series: > > [PATCH v7 01/16] iommu: Report domain nesting info > [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info > > we need to upgrade both since we do not want to report an empty nesting > info anymore, for arm. Absolutely. Let me send the couple of patches that I have been using, that add arm configuration. Best regards Vivek > > Thanks > > Eric >> >> Thanks & regards >> Vivek >> > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Eric, On 2/12/21 3:48 PM, Vivek Kumar Gautam wrote: > Hi Eric, > > > On 2/12/21 3:27 PM, Auger Eric wrote: >> Hi Vivek, Yi, >> >> On 2/12/21 8:14 AM, Vivek Gautam wrote: >>> Hi Yi, >>> >>> >>> On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote: >>>> >>>> Hi Eric, >>>> >>>>> From: Auger Eric <eric.auger@redhat.com> >>>>> Sent: Tuesday, January 19, 2021 6:03 PM >>>>> >>>>> Hi Yi, Vivek, >>>>> >>>> [...] >>>>>> I see. I think there needs a change in the code there. Should also >>>>>> expect >>>>>> a nesting_info returned instead of an int anymore. @Eric, how >>>>>> about your >>>>>> opinion? >>>>>> >>>>>> domain = iommu_get_domain_for_dev(&vdev->pdev->dev); >>>>>> ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, >>>>> &info); >>>>>> if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { >>>>>> /* >>>>>> * No need go futher as no page request service support. >>>>>> */ >>>>>> return 0; >>>>>> } >>>>> Sure I think it is "just" a matter of synchro between the 2 series. >>>>> Yi, >>>> >>>> exactly. >>>> >>>>> do you have plans to respin part of >>>>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs >>>>> or would you allow me to embed this patch in my series. >>>> >>>> My v7 hasn’t touch the prq change yet. So I think it's better for >>>> you to >>>> embed it to your series. ^_^>> >>> >>> Can you please let me know if you have an updated series of these >>> patches? It will help me to work with virtio-iommu/arm side changes. >> >> As per the previous discussion, I plan to take those 2 patches in my >> SMMUv3 nested stage series: >> >> [PATCH v7 01/16] iommu: Report domain nesting info >> [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info >> >> we need to upgrade both since we do not want to report an empty nesting >> info anymore, for arm. > > Absolutely. Let me send the couple of patches that I have been using, > that add arm configuration. Posted the couple of patches that I have been using - https://lore.kernel.org/linux-iommu/20210212105859.8445-1-vivek.gautam@arm.com/T/#t Thanks & regards Vivek > > Best regards > Vivek > >> >> Thanks >> >> Eric >>> >>> Thanks & regards >>> Vivek >>> >> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Hi Eric, > From: Auger Eric <eric.auger@redhat.com> > Sent: Friday, February 12, 2021 5:58 PM > > Hi Vivek, Yi, > > On 2/12/21 8:14 AM, Vivek Gautam wrote: > > Hi Yi, > > > > > > On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote: > >> > >> Hi Eric, > >> > >>> From: Auger Eric <eric.auger@redhat.com> > >>> Sent: Tuesday, January 19, 2021 6:03 PM > >>> > >>> Hi Yi, Vivek, > >>> > >> [...] > >>>> I see. I think there needs a change in the code there. Should also expect > >>>> a nesting_info returned instead of an int anymore. @Eric, how about your > >>>> opinion? > >>>> > >>>> domain = iommu_get_domain_for_dev(&vdev->pdev->dev); > >>>> ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, > >>> &info); > >>>> if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) { > >>>> /* > >>>> * No need go futher as no page request service support. > >>>> */ > >>>> return 0; > >>>> } > >>> Sure I think it is "just" a matter of synchro between the 2 series. Yi, > >> > >> exactly. > >> > >>> do you have plans to respin part of > >>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs > >>> or would you allow me to embed this patch in my series. > >> > >> My v7 hasn’t touch the prq change yet. So I think it's better for you to > >> embed it to your series. ^_^>> > > > > Can you please let me know if you have an updated series of these > > patches? It will help me to work with virtio-iommu/arm side changes. > > As per the previous discussion, I plan to take those 2 patches in my > SMMUv3 nested stage series: > > [PATCH v7 01/16] iommu: Report domain nesting info > [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info > > we need to upgrade both since we do not want to report an empty nesting > info anymore, for arm. sorry for the late response. I've sent out the updated version. Also, yeah, please feel free to take the patch in your series. https://lore.kernel.org/linux-iommu/20210302203545.436623-2-yi.l.liu@intel.com/ Regards, Yi Liu > Thanks > > Eric > > > > Thanks & regards > > Vivek > > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu