iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support
@ 2020-03-20 23:27 Jacob Pan
  2020-03-20 23:27 ` [PATCH V10 01/11] iommu/vt-d: Move domain helper to header Jacob Pan
                   ` (10 more replies)
  0 siblings, 11 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to enable SVA virtualization, i.e. enable use of SVA
within a guest user application.

This is the remaining portion of the original patchset that is based on
Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)

Only IOMMU portion of the changes are included in this series. Additional
support is needed in VFIO and QEMU (will be submitted separately) to complete
this functionality.

To make incremental changes and reduce the size of each patchset. This series
does not inlcude support for page request services.

In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

This is the remaining VT-d only portion of V5 since the uAPIs and IOASID common
code have been applied to Joerg's IOMMU core branch.
(https://lkml.org/lkml/2019/10/2/833)

The complete set with VFIO patches are here:
https://github.com/jacobpan/linux.git:siov_sva

The complete nested SVA upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. Page Request Services (PRS) support
    3. Mediated device assignment

With this set and the accompanied VFIO code, we will achieve phase #1.

Thanks,

Jacob

ChangeLog:
	- v10
	  - Addressed Eric's review in v7 and v9. Most fixes are in 3/10 and
	    6/10. Extra condition checks and consolidation of duplicated codes.

	- v9
	  - Addressed Baolu's comments for v8 for IOTLB flush consolidation,
	    bug fixes
	  - Removed IOASID notifier code which will be submitted separately
	    to address PASID life cycle management with multiple users.

	- v8
	  - Extracted cleanup patches from V7 and accepted into maintainer's
	    tree (https://lkml.org/lkml/2019/12/2/514).
	  - Added IOASID notifier and VT-d handler for termination of PASID
	    IOMMU context upon free. This will ensure success of VFIO IOASID
	    free API regardless PASID is in use.
	    (https://lore.kernel.org/linux-iommu/1571919983-3231-1-git-send-email-yi.l.liu@intel.com/)

	- V7
	  - Respect vIOMMU PASID range in virtual command PASID/IOASID allocator
	  - Caching virtual command capabilities to avoid runtime checks that
	    could cause vmexits.

	- V6
	  - Rebased on top of Joerg's core branch
	  (git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
	  - Adapt to new uAPIs and IOASID allocators

	- V5
	  Rebased on v5.3-rc4 which has some of the IOMMU fault APIs merged.
 	  Addressed v4 review comments from Eric Auger, Baolu Lu, and
	    Jonathan Cameron. Specific changes are as follows:
	  - Refined custom IOASID allocator to support multiple vIOMMU, hotplug
	    cases.
	  - Extracted vendor data from IOMMU guest PASID bind data, for VT-d
	    will support all necessary guest PASID entry fields for PASID
	    bind.
	  - Support non-identity host-guest PASID mapping
	  - Exception handling in various cases

	- V4
	  - Redesigned IOASID allocator such that it can support custom
	  allocators with shared helper functions. Use separate XArray
	  to store IOASIDs per allocator. Took advice from Eric Auger to
	  have default allocator use the generic allocator structure.
	  Combined into one patch in that the default allocator is just
	  "another" allocator now. Can be built as a module in case of
	  driver use without IOMMU.
	  - Extended bind guest PASID data to support SMMU and non-identity
	  guest to host PASID mapping https://lkml.org/lkml/2019/5/21/802
	  - Rebased on Jean's sva/api common tree, new patches starts with
	   [PATCH v4 10/22]

	- V3
	  - Addressed thorough review comments from Eric Auger (Thank you!)
	  - Moved IOASID allocator from driver core to IOMMU code per
	    suggestion by Christoph Hellwig
	    (https://lkml.org/lkml/2019/4/26/462)
	  - Rebased on top of Jean's SVA API branch and Eric's v7[1]
	    (git://linux-arm.org/linux-jpb.git sva/api)
	  - All IOMMU APIs are unmodified (except the new bind guest PASID
	    call in patch 9/16)

	- V2
	  - Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
	  - Integrated with Eric Auger's new v7 series for common APIs
	  (https://github.com/eauger/linux/tree/v5.1-rc3-2stage-v7)
	  - Addressed review comments from Andy Shevchenko and Alex Williamson on
	    IOASID custom allocator.
	  - Support multiple custom IOASID allocators (vIOMMUs) and dynamic
	    registration.


Jacob Pan (10):
  iommu/vt-d: Move domain helper to header
  iommu/uapi: Define a mask for bind data
  iommu/vt-d: Add a helper function to skip agaw
  iommu/vt-d: Use helper function to skip agaw for SL
  iommu/vt-d: Add nested translation helper function
  iommu/vt-d: Add bind guest PASID support
  iommu/vt-d: Support flushing more translation cache types
  iommu/vt-d: Add svm/sva invalidate function
  iommu/vt-d: Cache virtual command capability register
  iommu/vt-d: Add custom allocator for IOASID

Lu Baolu (1):
  iommu/vt-d: Enlightened PASID allocation

 drivers/iommu/dmar.c        |  37 +++++
 drivers/iommu/intel-iommu.c | 276 +++++++++++++++++++++++++++++++++++-
 drivers/iommu/intel-pasid.c | 336 ++++++++++++++++++++++++++++++++++++++++++--
 drivers/iommu/intel-pasid.h |  25 +++-
 drivers/iommu/intel-svm.c   | 224 +++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  45 +++++-
 include/linux/intel-svm.h   |  17 +++
 include/uapi/linux/iommu.h  |   5 +-
 8 files changed, 938 insertions(+), 27 deletions(-)

-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH V10 01/11] iommu/vt-d: Move domain helper to header
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-27 11:48   ` Tian, Kevin
  2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Move domain helper to header to be used by SVA code.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/iommu/intel-iommu.c | 6 ------
 include/linux/intel-iommu.h | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4be549478691..e599b2537b1c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -446,12 +446,6 @@ static void init_translation_status(struct intel_iommu *iommu)
 		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
 }
 
-/* Convert generic 'struct iommu_domain to private struct dmar_domain */
-static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
-{
-	return container_of(dom, struct dmar_domain, domain);
-}
-
 static int __init intel_iommu_setup(char *str)
 {
 	if (!str)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 980234ae0312..ed7171d2ae1f 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -595,6 +595,12 @@ static inline void __iommu_flush_cache(
 		clflush_cache_range(addr, size);
 }
 
+/* Convert generic struct iommu_domain to private struct dmar_domain */
+static inline struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
+{
+	return container_of(dom, struct dmar_domain, domain);
+}
+
 /*
  * 0: readable
  * 1: writable
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
  2020-03-20 23:27 ` [PATCH V10 01/11] iommu/vt-d: Move domain helper to header Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-22  1:29   ` Lu Baolu
                     ` (2 more replies)
  2020-03-20 23:27 ` [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw Jacob Pan
                   ` (8 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Memory type related flags can be grouped together for one simple check.

---
v9 renamed from EMT to MTS since these are memory type support flags.
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/uapi/linux/iommu.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4ad3496e5c43..d7bcbc5f79b0 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
 	__u32 pat;
 	__u32 emt;
 };
-
+#define IOMMU_SVA_VTD_GPASID_MTS_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
 /**
  * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
  * @version:	Version of this data structure
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
  2020-03-20 23:27 ` [PATCH V10 01/11] iommu/vt-d: Move domain helper to header Jacob Pan
  2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-27 11:53   ` Tian, Kevin
  2020-03-20 23:27 ` [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL Jacob Pan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 22b30f10b396..191508c7c03e 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 }
 
 /*
+ * Skip top levels of page tables for iommu which has less agaw
+ * than default. Unnecessary for PT mode.
+ */
+static inline int iommu_skip_agaw(struct dmar_domain *domain,
+				  struct intel_iommu *iommu,
+				  struct dma_pte **pgd)
+{
+	int agaw;
+
+	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
+		*pgd = phys_to_virt(dma_pte_addr(*pgd));
+		if (!dma_pte_present(*pgd)) {
+			return -EINVAL;
+		}
+	}
+	pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n", __func__, (u64)*pgd,
+		iommu->agaw, domain->agaw);
+
+	return agaw;
+}
+
+/*
  * Set up the scalable mode pasid entry for second only translation type.
  */
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (2 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-27 11:55   ` Tian, Kevin
  2020-03-20 23:27 ` [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function Jacob Pan
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 191508c7c03e..9bdb7ee228b6 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -544,17 +544,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 		return -EINVAL;
 	}
 
-	/*
-	 * Skip top levels of page tables for iommu which has less agaw
-	 * than default. Unnecessary for PT mode.
-	 */
 	pgd = domain->pgd;
-	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
-		pgd = phys_to_virt(dma_pte_addr(pgd));
-		if (!dma_pte_present(pgd)) {
-			dev_err(dev, "Invalid domain page table\n");
-			return -EINVAL;
-		}
+	agaw = iommu_skip_agaw(domain, iommu, &pgd);
+	if (agaw < 0) {
+		dev_err(dev, "Invalid domain page table\n");
+		return -EINVAL;
 	}
 
 	pgd_val = virt_to_phys(pgd);
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (3 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-27 12:21   ` Tian, Kevin
  2020-03-29 11:35   ` Auger Eric
  2020-03-20 23:27 ` [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support Jacob Pan
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Tian, Kevin, Raj Ashok, Liu, Jonathan Cameron

Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
With PASID granular translation type set to 0x11b, translation
result from the first level(FL) also subject to a second level(SL)
page table translation. This mode is used for SVA virtualization,
where FL performs guest virtual to guest physical translation and
SL performs guest physical to host physical translation.

This patch adds a helper function for setting up nested translation
where second level comes from a domain and first level comes from
a guest PGD.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 240 +++++++++++++++++++++++++++++++++++++++++++-
 drivers/iommu/intel-pasid.h |  12 +++
 include/linux/intel-iommu.h |   3 +
 3 files changed, 252 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 9bdb7ee228b6..10c7856afc6b 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
 	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
 }
 
+/*
+ * Setup the Extended Memory Type(EMT) field (Bits 91-93)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emt(struct pasid_entry *pe, u64 value)
+{
+	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
+}
+
+/*
+ * Setup the Page Attribute Table (PAT) field (Bits 96-127)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pat(struct pasid_entry *pe, u64 value)
+{
+	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
+}
+
+/*
+ * Setup the Cache Disable (CD) field (Bit 89)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_cd(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
+}
+
+/*
+ * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emte(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
+}
+
+/*
+ * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_eafe(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
+}
+
+/*
+ * Setup the Page-level Cache Disable (PCD) field (Bit 95)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pcd(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
+}
+
+/*
+ * Setup the Page-level Write-Through (PWT)) field (Bit 94)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pwt(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
+}
+
 static void
 pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
 				    u16 did, int pasid)
@@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
 	/* Setup Present and PASID Granular Transfer Type: */
-	pasid_set_translation_type(pte, 1);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
 	pasid_set_present(pte);
 	pasid_flush_caches(iommu, pte, pasid, did);
 
@@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	pasid_set_domain_id(pte, did);
 	pasid_set_slptr(pte, pgd_val);
 	pasid_set_address_width(pte, agaw);
-	pasid_set_translation_type(pte, 2);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
@@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_address_width(pte, iommu->agaw);
-	pasid_set_translation_type(pte, 4);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
@@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 
 	return 0;
 }
+
+static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
+				struct pasid_entry *pte,
+				struct iommu_gpasid_bind_data_vtd *pasid_data)
+{
+	/*
+	 * Not all guest PASID table entry fields are passed down during bind,
+	 * here we only set up the ones that are dependent on guest settings.
+	 * Execution related bits such as NXE, SMEP are not meaningful to IOMMU,
+	 * therefore not set. Other fields, such as snoop related, are set based
+	 * on host needs regardless of guest settings.
+	 */
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
+		if (!ecap_srs(iommu->ecap)) {
+			pr_err("No supervisor request support on %s\n",
+			       iommu->name);
+			return -EINVAL;
+		}
+		pasid_set_sre(pte);
+	}
+
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
+		if (!ecap_eafs(iommu->ecap)) {
+			pr_err("No extended access flag support on %s\n",
+				iommu->name);
+			return -EINVAL;
+		}
+		pasid_set_eafe(pte);
+	}
+
+	/*
+	 * Memory type is only applicable to devices inside processor coherent
+	 * domain. PCIe devices are not included. We can skip the rest of the
+	 * flags if IOMMU does not support MTS.
+	 */
+	if (ecap_mts(iommu->ecap)) {
+		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
+			pasid_set_emte(pte);
+			pasid_set_emt(pte, pasid_data->emt);
+		}
+		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
+			pasid_set_pcd(pte);
+		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
+			pasid_set_pwt(pte);
+		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
+			pasid_set_cd(pte);
+		pasid_set_pat(pte, pasid_data->pat);
+	} else if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK) {
+		pr_err("No memory type support for bind guest PASID on %s\n",
+			iommu->name);
+		return -EINVAL;
+	}
+
+	return 0;
+
+}
+
+/**
+ * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
+ * This could be used for guest shared virtual address. In this case, the
+ * first level page tables are used for GVA-GPA translation in the guest,
+ * second level page tables are used for GPA-HPA translation.
+ *
+ * @iommu:      IOMMU which the device belong to
+ * @dev:        Device to be set up for translation
+ * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
+ * @pasid:      PASID to be programmed in the device PASID table
+ * @pasid_data: Additional PASID info from the guest bind request
+ * @domain:     Domain info for setting up second level page tables
+ * @addr_width: Address width of the first level (guest)
+ */
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *gpgd,
+			int pasid, struct iommu_gpasid_bind_data_vtd *pasid_data,
+			struct dmar_domain *domain,
+			int addr_width)
+{
+	struct pasid_entry *pte;
+	struct dma_pte *pgd;
+	int ret = 0;
+	u64 pgd_val;
+	int agaw;
+	u16 did;
+
+	if (!ecap_nest(iommu->ecap)) {
+		pr_err("IOMMU: %s: No nested translation support\n",
+		       iommu->name);
+		return -EINVAL;
+	}
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	/*
+	 * Caller must ensure PASID entry is not in use, i.e. not bind the
+	 * same PASID to the same device twice.
+	 */
+	if (pasid_pte_is_present(pte))
+		return -EBUSY;
+
+	pasid_clear_entry(pte);
+
+	/* Sanity checking performed by caller to make sure address
+	 * width matching in two dimensions:
+	 * 1. CPU vs. IOMMU
+	 * 2. Guest vs. Host.
+	 */
+	switch (addr_width) {
+	case ADDR_WIDTH_5LEVEL:
+		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+			cap_5lp_support(iommu->cap)) {
+			pasid_set_flpm(pte, 1);
+		} else {
+			dev_err(dev, "5-level paging not supported\n");
+			return -EINVAL;
+		}
+		break;
+	case ADDR_WIDTH_4LEVEL:
+		pasid_set_flpm(pte, 0);
+		break;
+	default:
+		dev_err(dev, "Invalid guest address width %d\n", addr_width);
+		return -EINVAL;
+	}
+
+	/* First level PGD is in GPA, must be supported by the second level */
+	if ((u64)gpgd > domain->max_addr) {
+		dev_err(dev, "Guest PGD %llx not supported, max %llx\n",
+			(u64)gpgd, domain->max_addr);
+		return -EINVAL;
+	}
+	pasid_set_flptr(pte, (u64)gpgd);
+
+	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
+	if (ret) {
+		dev_err(dev, "Guest PASID bind data not supported\n");
+		return ret;
+	}
+
+	/* Setup the second level based on the given domain */
+	pgd = domain->pgd;
+
+	agaw = iommu_skip_agaw(domain, iommu, &pgd);
+	if (agaw < 0) {
+		dev_err(dev, "Invalid domain page table\n");
+		return -EINVAL;
+	}
+	pgd_val = virt_to_phys(pgd);
+	pasid_set_slptr(pte, pgd_val);
+	pasid_set_fault_enable(pte);
+
+	did = domain->iommu_did[iommu->seq_id];
+	pasid_set_domain_id(pte, did);
+
+	pasid_set_address_width(pte, agaw);
+	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
+	pasid_set_present(pte);
+	pasid_flush_caches(iommu, pte, pasid, did);
+
+	return ret;
+}
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 92de6df24ccb..698015ee3f04 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -36,6 +36,7 @@
  * to vmalloc or even module mappings.
  */
 #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
+#define PASID_FLAG_NESTED		BIT(1)
 
 /*
  * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
@@ -51,6 +52,11 @@ struct pasid_entry {
 	u64 val[8];
 };
 
+#define PASID_ENTRY_PGTT_FL_ONLY	(1)
+#define PASID_ENTRY_PGTT_SL_ONLY	(2)
+#define PASID_ENTRY_PGTT_NESTED		(3)
+#define PASID_ENTRY_PGTT_PT		(4)
+
 /* The representative of a PASID table */
 struct pasid_table {
 	void			*table;		/* pasid table pointer */
@@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, int pasid);
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *pgd,
+			int pasid,
+			struct iommu_gpasid_bind_data_vtd *pasid_data,
+			struct dmar_domain *domain,
+			int addr_width);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ed7171d2ae1f..eda1d6687144 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -42,6 +42,9 @@
 #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define ADDR_WIDTH_5LEVEL	(57)
+#define ADDR_WIDTH_4LEVEL	(48)
+
 #define CONTEXT_TT_MULTI_LEVEL	0
 #define CONTEXT_TT_DEV_IOTLB	1
 #define CONTEXT_TT_PASS_THROUGH 2
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (4 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-28  8:02   ` Tian, Kevin
  2020-03-29 13:40   ` Auger Eric
  2020-03-20 23:27 ` [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types Jacob Pan
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Tian, Kevin, Raj Ashok, Liu, Jonathan Cameron

When supporting guest SVA with emulated IOMMU, the guest PASID
table is shadowed in VMM. Updates to guest vIOMMU PASID table
will result in PASID cache flush which will be passed down to
the host as bind guest PASID calls.

For the SL page tables, it will be harvested from device's
default domain (request w/o PASID), or aux domain in case of
mediated device.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c |   4 +
 drivers/iommu/intel-svm.c   | 224 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |   8 +-
 include/linux/intel-svm.h   |  17 ++++
 4 files changed, 252 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e599b2537b1c..b1477cd423dd 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.dev_disable_feat	= intel_iommu_dev_disable_feat,
 	.is_attach_deferred	= intel_iommu_is_attach_deferred,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.sva_bind_gpasid	= intel_svm_bind_gpasid,
+	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
+#endif
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d7f2a5358900..47c0deb5ae56 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
 	list_for_each_entry((sdev), &(svm)->devs, list)	\
 		if ((d) != (sdev)->dev) {} else
 
+int intel_svm_bind_gpasid(struct iommu_domain *domain,
+			struct device *dev,
+			struct iommu_gpasid_bind_data *data)
+{
+	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+	struct dmar_domain *ddomain;
+	struct intel_svm_dev *sdev;
+	struct intel_svm *svm;
+	int ret = 0;
+
+	if (WARN_ON(!iommu) || !data)
+		return -EINVAL;
+
+	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
+	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
+		return -EINVAL;
+
+	if (dev_is_pci(dev)) {
+		/* VT-d supports devices with full 20 bit PASIDs only */
+		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
+			return -EINVAL;
+	} else {
+		return -ENOTSUPP;
+	}
+
+	/*
+	 * We only check host PASID range, we have no knowledge to check
+	 * guest PASID range nor do we use the guest PASID.
+	 */
+	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
+		return -EINVAL;
+
+	ddomain = to_dmar_domain(domain);
+
+	/* Sanity check paging mode support match between host and guest */
+	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
+	    !cap_5lp_support(iommu->cap)) {
+		pr_err("Cannot support 5 level paging requested by guest!\n");
+		return -EINVAL;
+	}
+
+	mutex_lock(&pasid_mutex);
+	svm = ioasid_find(NULL, data->hpasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
+	if (svm) {
+		/*
+		 * If we found svm for the PASID, there must be at
+		 * least one device bond, otherwise svm should be freed.
+		 */
+		if (WARN_ON(list_empty(&svm->devs))) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (svm->mm == get_task_mm(current) &&
+		    data->hpasid == svm->pasid &&
+		    data->gpasid == svm->gpasid) {
+			pr_warn("Cannot bind the same guest-host PASID for the same process\n");
+			mmput(svm->mm);
+			ret = -EINVAL;
+			goto out;
+		}
+		mmput(current->mm);
+
+		for_each_svm_dev(sdev, svm, dev) {
+			/* In case of multiple sub-devices of the same pdev
+			 * assigned, we should allow multiple bind calls with
+			 * the same PASID and pdev.
+			 */
+			sdev->users++;
+			goto out;
+		}
+	} else {
+		/* We come here when PASID has never been bond to a device. */
+		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+		if (!svm) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
+		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
+		 * ownership.
+		 */
+		svm->mm = get_task_mm(current);
+		svm->pasid = data->hpasid;
+		if (data->flags & IOMMU_SVA_GPASID_VAL) {
+			svm->gpasid = data->gpasid;
+			svm->flags |= SVM_FLAG_GUEST_PASID;
+		}
+		ioasid_set_data(data->hpasid, svm);
+		INIT_LIST_HEAD_RCU(&svm->devs);
+		mmput(svm->mm);
+	}
+	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+	if (!sdev) {
+		if (list_empty(&svm->devs)) {
+			ioasid_set_data(data->hpasid, NULL);
+			kfree(svm);
+		}
+		ret = -ENOMEM;
+		goto out;
+	}
+	sdev->dev = dev;
+	sdev->users = 1;
+
+	/* Set up device context entry for PASID if not enabled already */
+	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+	if (ret) {
+		dev_err(dev, "Failed to enable PASID capability\n");
+		kfree(sdev);
+		/*
+		 * If this this a new PASID that never bond to a device, then
+		 * the device list must be empty which indicates struct svm
+		 * was allocated in this function.
+		 */
+		if (list_empty(&svm->devs)) {
+			ioasid_set_data(data->hpasid, NULL);
+			kfree(svm);
+		}
+		goto out;
+	}
+
+	/*
+	 * For guest bind, we need to set up PASID table entry as follows:
+	 * - FLPM matches guest paging mode
+	 * - turn on nested mode
+	 * - SL guest address width matching
+	 */
+	ret = intel_pasid_setup_nested(iommu,
+				       dev,
+				       (pgd_t *)data->gpgd,
+				       data->hpasid,
+				       &data->vtd,
+				       ddomain,
+				       data->addr_width);
+	if (ret) {
+		dev_err(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
+			data->hpasid, ret);
+		/*
+		 * PASID entry should be in cleared state if nested mode
+		 * set up failed. So we only need to clear IOASID tracking
+		 * data such that free call will succeed.
+		 */
+		kfree(sdev);
+		if (list_empty(&svm->devs)) {
+			ioasid_set_data(data->hpasid, NULL);
+			kfree(svm);
+		}
+		goto out;
+	}
+	svm->flags |= SVM_FLAG_GUEST_MODE;
+
+	init_rcu_head(&sdev->rcu);
+	list_add_rcu(&sdev->list, &svm->devs);
+ out:
+	mutex_unlock(&pasid_mutex);
+	return ret;
+}
+
+int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+{
+	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+	struct intel_svm_dev *sdev;
+	struct intel_svm *svm;
+	int ret = -EINVAL;
+
+	if (WARN_ON(!iommu))
+		return -EINVAL;
+
+	mutex_lock(&pasid_mutex);
+	svm = ioasid_find(NULL, pasid, NULL);
+	if (!svm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
+	for_each_svm_dev(sdev, svm, dev) {
+		ret = 0;
+		sdev->users--;
+		if (!sdev->users) {
+			list_del_rcu(&sdev->list);
+			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+			/* TODO: Drain in flight PRQ for the PASID since it
+			 * may get reused soon, we don't want to
+			 * confuse with its previous life.
+			 * intel_svm_drain_prq(dev, pasid);
+			 */
+			kfree_rcu(sdev, rcu);
+
+			if (list_empty(&svm->devs)) {
+				/*
+				 * We do not free PASID here until explicit call
+				 * from VFIO to free. The PASID life cycle
+				 * management is largely tied to VFIO management
+				 * of assigned device life cycles. In case of
+				 * guest exit without a explicit free PASID call,
+				 * the responsibility lies in VFIO layer to free
+				 * the PASIDs allocated for the guest.
+				 * For security reasons, VFIO has to track the
+				 * PASID ownership per guest anyway to ensure
+				 * that PASID allocated by one guest cannot be
+				 * used by another.
+				 */
+				ioasid_set_data(pasid, NULL);
+				kfree(svm);
+			}
+		}
+		break;
+	}
+out:
+	mutex_unlock(&pasid_mutex);
+
+	return ret;
+}
+
 int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index eda1d6687144..85b05120940e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device *dev);
 extern void intel_svm_check(struct intel_iommu *iommu);
 extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
-
+extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
+		struct device *dev, struct iommu_gpasid_bind_data *data);
+extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
 struct svm_dev_ops;
 
 struct intel_svm_dev {
@@ -698,9 +700,13 @@ struct intel_svm_dev {
 struct intel_svm {
 	struct mmu_notifier notifier;
 	struct mm_struct *mm;
+
 	struct intel_iommu *iommu;
 	int flags;
 	int pasid;
+	int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
+		     * to guest PASID mapping.
+		     */
 	struct list_head devs;
 	struct list_head list;
 };
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index d7c403d0dd27..c19690937540 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -44,6 +44,23 @@ struct svm_dev_ops {
  * do such IOTLB flushes automatically.
  */
 #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
+/*
+ * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
+ * In this case the mm_struct is in the guest kernel or userspace, its life
+ * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
+ * means to bind/unbind guest CR3 with PASIDs allocated for a device.
+ */
+#define SVM_FLAG_GUEST_MODE	(1<<2)
+/*
+ * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
+ * which requires guest and host PASID translation at both directions. We keep
+ * track of guest PASID in order to provide lookup service to device drivers.
+ * One such example is a physical function (PF) driver that supports mediated
+ * device (mdev) assignment. Guest programming of mdev configuration space can
+ * only be done with guest PASID, therefore PF driver needs to find the matching
+ * host PASID to program the real hardware.
+ */
+#define SVM_FLAG_GUEST_PASID	(1<<3)
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (5 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-27 14:46   ` Auger Eric
  2020-03-20 23:27 ` [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

When Shared Virtual Memory is exposed to a guest via vIOMMU, scalable
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v9 -> v10:
Fix off by 1 in pasid device iotlb flush

Address v7 missed review from Eric

---
---
 drivers/iommu/dmar.c        | 36 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.c |  3 ++-
 include/linux/intel-iommu.h | 20 ++++++++++++++++----
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f77dae7ba7d4..4d6b7b5b37ee 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
+/* PASID-based device IOTLB Invalidate */
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size_order, u64 granu)
+{
+	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
+	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
+
+	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+		QI_DEV_IOTLB_PFSID(pfsid);
+	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
+
+	/*
+	 * If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the invalidation address
+	 * range. VT-d spec 6.5.2.6.
+	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
+	 * size order = 0 is PAGE_SIZE 4KB
+	 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
+	 * ECAP.
+	 */
+	desc.qw1 |= addr & ~mask;
+	if (size_order)
+		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
+
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
+
+	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_TYPE;
+	qi_submit_sync(&desc, iommu);
+}
+
 /*
  * Disable Queued Invalidation interface.
  */
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 10c7856afc6b..9f6d07410722 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
 {
 	struct qi_desc desc;
 
-	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
+	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
+		QI_PC_PASID(pasid) | QI_PC_TYPE;
 	desc.qw1 = 0;
 	desc.qw2 = 0;
 	desc.qw3 = 0;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 85b05120940e..43539713b3b3 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -334,7 +334,7 @@ enum {
 #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
 #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
 #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
-#define QI_IOTLB_AM(am)		(((u8)am))
+#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
 
 #define QI_CC_FM(fm)		(((u64)fm) << 48)
 #define QI_CC_SID(sid)		(((u64)sid) << 32)
@@ -353,16 +353,21 @@ enum {
 #define QI_PC_DID(did)		(((u64)did) << 16)
 #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
 
-#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS	0
+#define QI_PC_PASID_SEL		1
 
 #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
-#define QI_EIOTLB_AM(am)	(((u64)am))
+#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
 #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
 #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
 #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
 
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
+
 #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
 #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
 #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
@@ -662,8 +667,15 @@ extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
 extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask);
+
 void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
 		     unsigned long npages, bool ih);
+
+extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u32 pasid, u16 qdep, u64 addr, unsigned size_order, u64 granu);
+
+extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
+
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
 extern int dmar_ir_support(void);
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (6 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-28 10:01   ` Tian, Kevin
  2020-03-29 16:05   ` Auger Eric
  2020-03-20 23:27 ` [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register Jacob Pan
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Liu, Jonathan Cameron

When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

---
v7 review fixed in v10
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/intel-iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 182 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b1477cd423dd..a76afb0fd51a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5619,6 +5619,187 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
 	aux_domain_remove_dev(to_dmar_domain(domain), dev);
 }
 
+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned devices guest
+ * owns the first level page tables. Invalidations of translation caches in the
+ * guest are trapped and passed down to the host.
+ *
+ * vIOMMU in the guest will only expose first level page tables, therefore
+ * we do not include IOTLB granularity for request without PASID (second level).
+ *
+ * For example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by iommu cache type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
+ *
+ * Granu_map array indicates validity of the table. 1: valid, 0: invalid
+ *
+ */
+const static int inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
+	/*
+	 * PASID based IOTLB invalidation: PASID selective (per PASID),
+	 * page selective (address granularity)
+	 */
+	{0, 1, 1},
+	/* PASID based dev TLBs, only support all PASIDs or single PASID */
+	{1, 1, 0},
+	/* PASID cache */
+	{1, 1, 0}
+};
+
+const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
+	/* PASID based IOTLB */
+	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+	/* PASID based dev TLBs */
+	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
+	/* PASID cache */
+	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
+};
+
+static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
+{
+	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >= IOMMU_INV_GRANU_NR ||
+		!inv_type_granu_map[type][granu])
+		return -EINVAL;
+
+	*vtd_granu = inv_type_granu_table[type][granu];
+
+	return 0;
+}
+
+static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
+{
+	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
+
+	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
+	 * granu size in contiguous memory.
+	 */
+	return order_base_2(nr_pages);
+}
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct intel_iommu *iommu;
+	unsigned long flags;
+	int cache_type;
+	u8 bus, devfn;
+	u16 did, sid;
+	int ret = 0;
+	u64 size = 0;
+
+	if (!inv_info || !dmar_domain ||
+		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+		return -EINVAL;
+
+	if (!dev || !dev_is_pci(dev))
+		return -ENODEV;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	spin_lock(&iommu->lock);
+	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+	if (!info) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	sid = PCI_DEVID(bus, devfn);
+
+	/* Size is only valid in non-PASID selective invalidation */
+	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
+		size = to_vtd_size(inv_info->addr_info.granule_size,
+				   inv_info->addr_info.nb_granules);
+
+	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
+		int granu = 0;
+		u64 pasid = 0;
+
+		ret = to_vtd_granularity(cache_type, inv_info->granularity, &granu);
+		if (ret) {
+			pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
+				inv_info->granularity);
+			break;
+		}
+
+		/* PASID is stored in different locations based on granularity */
+		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
+			inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
+			pasid = inv_info->pasid_info.pasid;
+		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
+			inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
+			pasid = inv_info->addr_info.pasid;
+		else {
+			pr_err("Cannot find PASID for given cache type and granularity\n");
+			break;
+		}
+
+		switch (BIT(cache_type)) {
+		case IOMMU_CACHE_INV_TYPE_IOTLB:
+			if ((inv_info->granularity != IOMMU_INV_GRANU_PASID) &&
+				size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
+				pr_err("Address out of range, 0x%llx, size order %llu\n",
+					inv_info->addr_info.addr, size);
+				ret = -ERANGE;
+				goto out_unlock;
+			}
+
+			qi_flush_piotlb(iommu, did,
+					pasid,
+					mm_to_dma_pfn(inv_info->addr_info.addr),
+					(granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
+					inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
+
+			/*
+			 * Always flush device IOTLB if ATS is enabled since guest
+			 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
+			 * down.
+			 */
+			if (info->ats_enabled) {
+				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+						pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			}
+			break;
+		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
+			if (info->ats_enabled) {
+				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+						inv_info->addr_info.pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			} else
+				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
+
+			break;
+		case IOMMU_CACHE_INV_TYPE_PASID:
+			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
+
+			break;
+		default:
+			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
+				cache_type);
+			ret = -EINVAL;
+		}
+	}
+out_unlock:
+	spin_unlock(&iommu->lock);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+#endif
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot, gfp_t gfp)
@@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.is_attach_deferred	= intel_iommu_is_attach_deferred,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
 #ifdef CONFIG_INTEL_IOMMU_SVM
+	.cache_invalidate	= intel_iommu_sva_invalidate,
 	.sva_bind_gpasid	= intel_svm_bind_gpasid,
 	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
 #endif
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (7 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-28 10:04   ` Tian, Kevin
  2020-03-20 23:27 ` [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation Jacob Pan
  2020-03-20 23:27 ` [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Virtual command registers are used in the guest only, to prevent
vmexit cost, we cache the capability and store it during initialization.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

---
v7 Reviewed by Eric & Baolu
---
---
 drivers/iommu/dmar.c        | 1 +
 include/linux/intel-iommu.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 4d6b7b5b37ee..3b36491c8bbb 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
 		warn_invalid_dmar(phys_addr, " returns all ones");
 		goto unmap;
 	}
+	iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
 
 	/* the registers might be more than one page */
 	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 43539713b3b3..ccbf164fb711 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -194,6 +194,9 @@
 #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
 #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping Control */
 
+/* Virtual command interface capabilities */
+#define vccap_pasid(v)		((v & DMA_VCS_PAS)) /* PASID allocation */
+
 /* IOTLB_REG */
 #define DMA_TLB_FLUSH_GRANU_OFFSET  60
 #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
@@ -287,6 +290,7 @@
 
 /* PRS_REG */
 #define DMA_PRS_PPR	((u32)1)
+#define DMA_VCS_PAS	((u64)1)
 
 #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)			\
 do {									\
@@ -537,6 +541,7 @@ struct intel_iommu {
 	u64		reg_size; /* size of hw register set */
 	u64		cap;
 	u64		ecap;
+	u64		vccap;
 	u32		gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
 	raw_spinlock_t	register_lock; /* protect register handling */
 	int		seq_id;	/* sequence id of the iommu */
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (8 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-28 10:08   ` Tian, Kevin
  2020-03-20 23:27 ` [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

From: Lu Baolu <baolu.lu@linux.intel.com>

Enabling IOMMU in a guest requires communication with the host
driver for certain aspects. Use of PASID ID to enable Shared Virtual
Addressing (SVA) requires managing PASID's in the host. VT-d 3.0 spec
provides a Virtual Command Register (VCMD) to facilitate this.
Writes to this register in the guest are trapped by QEMU which
proxies the call to the host driver.

This virtual command interface consists of a capability register,
a virtual command register, and a virtual response register. Refer
to section 10.4.42, 10.4.43, 10.4.44 for more information.

This patch adds the enlightened PASID allocation/free interfaces
via the virtual command interface.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/iommu/intel-pasid.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.h | 13 ++++++++++-
 include/linux/intel-iommu.h |  1 +
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 9f6d07410722..e87ad67aad36 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -27,6 +27,63 @@
 static DEFINE_SPINLOCK(pasid_lock);
 u32 intel_pasid_max_id = PASID_MAX;
 
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
+{
+	unsigned long flags;
+	u8 status_code;
+	int ret = 0;
+	u64 res;
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	status_code = VCMD_VRSP_SC(res);
+	switch (status_code) {
+	case VCMD_VRSP_SC_SUCCESS:
+		*pasid = VCMD_VRSP_RESULT_PASID(res);
+		break;
+	case VCMD_VRSP_SC_NO_PASID_AVAIL:
+		pr_info("IOMMU: %s: No PASID available\n", iommu->name);
+		ret = -ENOSPC;
+		break;
+	default:
+		ret = -ENODEV;
+		pr_warn("IOMMU: %s: Unexpected error code %d\n",
+			iommu->name, status_code);
+	}
+
+	return ret;
+}
+
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
+{
+	unsigned long flags;
+	u8 status_code;
+	u64 res;
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
+		    VCMD_CMD_OPERAND(pasid) | VCMD_CMD_FREE);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	status_code = VCMD_VRSP_SC(res);
+	switch (status_code) {
+	case VCMD_VRSP_SC_SUCCESS:
+		break;
+	case VCMD_VRSP_SC_INVALID_PASID:
+		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
+		break;
+	default:
+		pr_warn("IOMMU: %s: Unexpected error code %d\n",
+			iommu->name, status_code);
+	}
+}
+
 /*
  * Per device pasid table management:
  */
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 698015ee3f04..cd3d63f3e936 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -23,6 +23,16 @@
 #define is_pasid_enabled(entry)		(((entry)->lo >> 3) & 0x1)
 #define get_pasid_dir_size(entry)	(1 << ((((entry)->lo >> 9) & 0x7) + 7))
 
+/* Virtual command interface for enlightened pasid management. */
+#define VCMD_CMD_ALLOC			0x1
+#define VCMD_CMD_FREE			0x2
+#define VCMD_VRSP_IP			0x1
+#define VCMD_VRSP_SC(e)			(((e) >> 1) & 0x3)
+#define VCMD_VRSP_SC_SUCCESS		0
+#define VCMD_VRSP_SC_NO_PASID_AVAIL	1
+#define VCMD_VRSP_SC_INVALID_PASID	1
+#define VCMD_VRSP_RESULT_PASID(e)	(((e) >> 8) & 0xfffff)
+#define VCMD_CMD_OPERAND(e)		((e) << 8)
 /*
  * Domain ID reserved for pasid entries programmed for first-level
  * only and pass-through transfer modes.
@@ -113,5 +123,6 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu,
 			int addr_width);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
-
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
 #endif /* __INTEL_PASID_H */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ccbf164fb711..9cbf5357138b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -169,6 +169,7 @@
 #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
 #define ecap_flts(e)		(((e) >> 47) & 0x1)
 #define ecap_slts(e)		(((e) >> 46) & 0x1)
+#define ecap_vcs(e)		(((e) >> 44) & 0x1)
 #define ecap_smts(e)		(((e) >> 43) & 0x1)
 #define ecap_dit(e)		((e >> 41) & 0x1)
 #define ecap_pasid(e)		((e >> 40) & 0x1)
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID
  2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (9 preceding siblings ...)
  2020-03-20 23:27 ` [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation Jacob Pan
@ 2020-03-20 23:27 ` Jacob Pan
  2020-03-28 10:22   ` Tian, Kevin
  10 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-20 23:27 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Liu, Jonathan Cameron

When VT-d driver runs in the guest, PASID allocation must be
performed via virtual command interface. This patch registers a
custom IOASID allocator which takes precedence over the default
XArray based allocator. The resulting IOASID allocation will always
come from the host. This ensures that PASID namespace is system-
wide.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 84 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  2 ++
 2 files changed, 86 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a76afb0fd51a..c1c0b0fb93c3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
 		if (ecap_prs(iommu->ecap))
 			intel_svm_finish_prq(iommu);
 	}
+	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
+		ioasid_unregister_allocator(&iommu->pasid_allocator);
+
 #endif
 }
 
@@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 	return ret;
 }
 
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
+{
+	struct intel_iommu *iommu = data;
+	ioasid_t ioasid;
+
+	if (!iommu)
+		return INVALID_IOASID;
+	/*
+	 * VT-d virtual command interface always uses the full 20 bit
+	 * PASID range. Host can partition guest PASID range based on
+	 * policies but it is out of guest's control.
+	 */
+	if (min < PASID_MIN || max > intel_pasid_max_id)
+		return INVALID_IOASID;
+
+	if (vcmd_alloc_pasid(iommu, &ioasid))
+		return INVALID_IOASID;
+
+	return ioasid;
+}
+
+static void intel_ioasid_free(ioasid_t ioasid, void *data)
+{
+	struct intel_iommu *iommu = data;
+
+	if (!iommu)
+		return;
+	/*
+	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
+	 * We can only free the PASID when all the devices are unbound.
+	 */
+	if (ioasid_find(NULL, ioasid, NULL)) {
+		pr_alert("Cannot free active IOASID %d\n", ioasid);
+		return;
+	}
+	vcmd_free_pasid(iommu, ioasid);
+}
+
+static void register_pasid_allocator(struct intel_iommu *iommu)
+{
+	/*
+	 * If we are running in the host, no need for custom allocator
+	 * in that PASIDs are allocated from the host system-wide.
+	 */
+	if (!cap_caching_mode(iommu->cap))
+		return;
+
+	if (!sm_supported(iommu)) {
+		pr_warn("VT-d Scalable Mode not enabled, no PASID allocation\n");
+		return;
+	}
+
+	/*
+	 * Register a custom PASID allocator if we are running in a guest,
+	 * guest PASID must be obtained via virtual command interface.
+	 * There can be multiple vIOMMUs in each guest but only one allocator
+	 * is active. All vIOMMU allocators will eventually be calling the same
+	 * host allocator.
+	 */
+	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
+		pr_info("Register custom PASID allocator\n");
+		iommu->pasid_allocator.alloc = intel_ioasid_alloc;
+		iommu->pasid_allocator.free = intel_ioasid_free;
+		iommu->pasid_allocator.pdata = (void *)iommu;
+		if (ioasid_register_allocator(&iommu->pasid_allocator)) {
+			pr_warn("Custom PASID allocator failed, scalable mode disabled\n");
+			/*
+			 * Disable scalable mode on this IOMMU if there
+			 * is no custom allocator. Mixing SM capable vIOMMU
+			 * and non-SM vIOMMU are not supported.
+			 */
+			intel_iommu_sm = 0;
+		}
+	}
+}
+#endif
+
 static int __init init_dmars(void)
 {
 	struct dmar_drhd_unit *drhd;
@@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
 	 */
 	for_each_active_iommu(iommu, drhd) {
 		iommu_flush_write_buffer(iommu);
+#ifdef CONFIG_INTEL_IOMMU_SVM
+		register_pasid_allocator(iommu);
+#endif
 		iommu_set_root_entry(iommu);
 		iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
 		iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 9cbf5357138b..9c357a325c72 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/dmar.h>
+#include <linux/ioasid.h>
 
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
@@ -563,6 +564,7 @@ struct intel_iommu {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	struct page_req_dsc *prq;
 	unsigned char prq_name[16];    /* Name for PRQ interrupt */
+	struct ioasid_allocator_ops pasid_allocator; /* Custom allocator for PASIDs */
 #endif
 	struct q_inval  *qi;            /* Queued invalidation info */
 	u32 *iommu_state; /* Store iommu states between suspend and resume.*/
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
@ 2020-03-22  1:29   ` Lu Baolu
  2020-03-23 19:37     ` Jacob Pan
  2020-03-27 11:50   ` Tian, Kevin
  2020-03-27 14:13   ` Auger Eric
  2 siblings, 1 reply; 67+ messages in thread
From: Lu Baolu @ 2020-03-22  1:29 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

On 2020/3/21 7:27, Jacob Pan wrote:
> Memory type related flags can be grouped together for one simple check.
> 
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>   include/uapi/linux/iommu.h | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>   	__u32 pat;
>   	__u32 emt;
>   };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
> +					 IOMMU_SVA_VTD_GPASID_EMTE | \
> +					 IOMMU_SVA_VTD_GPASID_PCD |  \
> +					 IOMMU_SVA_VTD_GPASID_PWT)

As name implies, can this move to intel-iommu.h?

Best regards,
baolu

>   /**
>    * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
>    * @version:	Version of this data structure
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-22  1:29   ` Lu Baolu
@ 2020-03-23 19:37     ` Jacob Pan
  2020-03-24  1:50       ` Lu Baolu
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-23 19:37 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Raj Ashok, David Woodhouse, iommu, LKML,
	Alex Williamson, Jean-Philippe Brucker, Jonathan Cameron

On Sun, 22 Mar 2020 09:29:32 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> On 2020/3/21 7:27, Jacob Pan wrote:
> > Memory type related flags can be grouped together for one simple
> > check.
> > 
> > ---
> > v9 renamed from EMT to MTS since these are memory type support
> > flags. ---
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >   include/uapi/linux/iommu.h | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 4ad3496e5c43..d7bcbc5f79b0 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
> >   	__u32 pat;
> >   	__u32 emt;
> >   };
> > -
> > +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
> > (IOMMU_SVA_VTD_GPASID_CD | \
> > +					 IOMMU_SVA_VTD_GPASID_EMTE
> > | \
> > +					 IOMMU_SVA_VTD_GPASID_PCD
> > |  \
> > +
> > IOMMU_SVA_VTD_GPASID_PWT)  
> 
> As name implies, can this move to intel-iommu.h?
> 
I also thought about this but the masks are in vendor specific part of
the UAPI.

> Best regards,
> baolu
> 
> >   /**
> >    * struct iommu_gpasid_bind_data - Information about device and
> > guest PASID binding
> >    * @version:	Version of this data structure
> >   

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-23 19:37     ` Jacob Pan
@ 2020-03-24  1:50       ` Lu Baolu
  0 siblings, 0 replies; 67+ messages in thread
From: Lu Baolu @ 2020-03-24  1:50 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On 2020/3/24 3:37, Jacob Pan wrote:
> On Sun, 22 Mar 2020 09:29:32 +0800> Lu Baolu<baolu.lu@linux.intel.com>  wrote:
> 
>> On 2020/3/21 7:27, Jacob Pan wrote:
>>> Memory type related flags can be grouped together for one simple
>>> check.
>>>
>>> ---
>>> v9 renamed from EMT to MTS since these are memory type support
>>> flags. ---
>>>
>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>> ---
>>>    include/uapi/linux/iommu.h | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 4ad3496e5c43..d7bcbc5f79b0 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>>>    	__u32 pat;
>>>    	__u32 emt;
>>>    };
>>> -
>>> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
>>> (IOMMU_SVA_VTD_GPASID_CD | \
>>> +					 IOMMU_SVA_VTD_GPASID_EMTE
>>> | \
>>> +					 IOMMU_SVA_VTD_GPASID_PCD
>>> |  \
>>> +
>>> IOMMU_SVA_VTD_GPASID_PWT)
>> As name implies, can this move to intel-iommu.h?
>>
> I also thought about this but the masks are in vendor specific part of
> the UAPI.
> 

I looked through this patch series. It looks good to me. I will do some
code style cleanup and take it to v5.7. I am not the right person to
decide whether include/uapi/linux/iommu.h is the right place for this,
so I will move it to Intel IOMMU driver for now.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 01/11] iommu/vt-d: Move domain helper to header
  2020-03-20 23:27 ` [PATCH V10 01/11] iommu/vt-d: Move domain helper to header Jacob Pan
@ 2020-03-27 11:48   ` Tian, Kevin
  0 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2020-03-27 11:48 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Move domain helper to header to be used by SVA code.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>  drivers/iommu/intel-iommu.c | 6 ------
>  include/linux/intel-iommu.h | 6 ++++++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 4be549478691..e599b2537b1c 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -446,12 +446,6 @@ static void init_translation_status(struct
> intel_iommu *iommu)
>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>  }
> 
> -/* Convert generic 'struct iommu_domain to private struct dmar_domain */
> -static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
> -{
> -	return container_of(dom, struct dmar_domain, domain);
> -}
> -
>  static int __init intel_iommu_setup(char *str)
>  {
>  	if (!str)
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 980234ae0312..ed7171d2ae1f 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -595,6 +595,12 @@ static inline void __iommu_flush_cache(
>  		clflush_cache_range(addr, size);
>  }
> 
> +/* Convert generic struct iommu_domain to private struct dmar_domain */
> +static inline struct dmar_domain *to_dmar_domain(struct iommu_domain
> *dom)
> +{
> +	return container_of(dom, struct dmar_domain, domain);
> +}
> +
>  /*
>   * 0: readable
>   * 1: writable
> --
> 2.7.4

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
  2020-03-22  1:29   ` Lu Baolu
@ 2020-03-27 11:50   ` Tian, Kevin
  2020-03-27 14:13   ` Auger Eric
  2 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2020-03-27 11:50 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Memory type related flags can be grouped together for one simple check.
> 
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/uapi/linux/iommu.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>  	__u32 pat;
>  	__u32 emt;
>  };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
> 	(IOMMU_SVA_VTD_GPASID_CD | \
> +					 IOMMU_SVA_VTD_GPASID_EMTE | \
> +					 IOMMU_SVA_VTD_GPASID_PCD |  \
> +					 IOMMU_SVA_VTD_GPASID_PWT)
>  /**
>   * struct iommu_gpasid_bind_data - Information about device and guest
> PASID binding
>   * @version:	Version of this data structure
> --
> 2.7.4

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw
  2020-03-20 23:27 ` [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw Jacob Pan
@ 2020-03-27 11:53   ` Tian, Kevin
  2020-03-29  7:20     ` Lu Baolu
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-27 11:53 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

could you elaborate in which scenario this helper function is required?
 
> ---
>  drivers/iommu/intel-pasid.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 22b30f10b396..191508c7c03e 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>  }
> 
>  /*
> + * Skip top levels of page tables for iommu which has less agaw
> + * than default. Unnecessary for PT mode.
> + */
> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> +				  struct intel_iommu *iommu,
> +				  struct dma_pte **pgd)
> +{
> +	int agaw;
> +
> +	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> +		*pgd = phys_to_virt(dma_pte_addr(*pgd));
> +		if (!dma_pte_present(*pgd)) {
> +			return -EINVAL;
> +		}
> +	}
> +	pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n",
> __func__, (u64)*pgd,
> +		iommu->agaw, domain->agaw);
> +
> +	return agaw;
> +}
> +
> +/*
>   * Set up the scalable mode pasid entry for second only translation type.
>   */
>  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL
  2020-03-20 23:27 ` [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL Jacob Pan
@ 2020-03-27 11:55   ` Tian, Kevin
  2020-03-27 16:05     ` Auger Eric
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-27 11:55 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-pasid.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 191508c7c03e..9bdb7ee228b6 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -544,17 +544,11 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>  		return -EINVAL;
>  	}
> 
> -	/*
> -	 * Skip top levels of page tables for iommu which has less agaw
> -	 * than default. Unnecessary for PT mode.
> -	 */
>  	pgd = domain->pgd;
> -	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> -		pgd = phys_to_virt(dma_pte_addr(pgd));
> -		if (!dma_pte_present(pgd)) {
> -			dev_err(dev, "Invalid domain page table\n");
> -			return -EINVAL;
> -		}
> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> +	if (agaw < 0) {
> +		dev_err(dev, "Invalid domain page table\n");
> +		return -EINVAL;
>  	}

ok, I see how it is used. possibly combine last and this one together since
it's mostly moving code...

> 
>  	pgd_val = virt_to_phys(pgd);
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-20 23:27 ` [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function Jacob Pan
@ 2020-03-27 12:21   ` Tian, Kevin
  2020-03-29  8:03     ` Lu Baolu
  2020-03-29 11:35   ` Auger Eric
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-27 12:21 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.

now the spec is already at rev3.1 😊

> With PASID granular translation type set to 0x11b, translation
> result from the first level(FL) also subject to a second level(SL)
> page table translation. This mode is used for SVA virtualization,
> where FL performs guest virtual to guest physical translation and
> SL performs guest physical to host physical translation.
> 
> This patch adds a helper function for setting up nested translation
> where second level comes from a domain and first level comes from
> a guest PGD.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-pasid.c | 240
> +++++++++++++++++++++++++++++++++++++++++++-
>  drivers/iommu/intel-pasid.h |  12 +++
>  include/linux/intel-iommu.h |   3 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 9bdb7ee228b6..10c7856afc6b 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
>  	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
>  }
> 
> +/*
> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
> +}
> +
> +/*
> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
> +}
> +
> +/*
> + * Setup the Cache Disable (CD) field (Bit 89)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_cd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> +}
> +
> +/*
> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emte(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> +}
> +
> +/*
> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_eafe(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> +}
> +
> +/*
> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pcd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> +}
> +
> +/*
> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pwt(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> +}
> +
>  static void
>  pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>  				    u16 did, int pasid)
> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
>  	/* Setup Present and PASID Granular Transfer Type: */
> -	pasid_set_translation_type(pte, 1);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
>  	pasid_set_present(pte);
>  	pasid_flush_caches(iommu, pte, pasid, did);
> 
> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_slptr(pte, pgd_val);
>  	pasid_set_address_width(pte, agaw);
> -	pasid_set_translation_type(pte, 2);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> intel_iommu *iommu,
>  	pasid_clear_entry(pte);
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_address_width(pte, iommu->agaw);
> -	pasid_set_translation_type(pte, 4);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> intel_iommu *iommu,
> 
>  	return 0;
>  }
> +
> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> +				struct pasid_entry *pte,
> +				struct iommu_gpasid_bind_data_vtd
> *pasid_data)
> +{
> +	/*
> +	 * Not all guest PASID table entry fields are passed down during bind,
> +	 * here we only set up the ones that are dependent on guest settings.
> +	 * Execution related bits such as NXE, SMEP are not meaningful to
> IOMMU,
> +	 * therefore not set. Other fields, such as snoop related, are set
> based
> +	 * on host needs regardless of guest settings.
> +	 */
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> +		if (!ecap_srs(iommu->ecap)) {
> +			pr_err("No supervisor request support on %s\n",
> +			       iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_sre(pte);
> +	}
> +
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> +		if (!ecap_eafs(iommu->ecap)) {
> +			pr_err("No extended access flag support on %s\n",
> +				iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_eafe(pte);
> +	}
> +
> +	/*
> +	 * Memory type is only applicable to devices inside processor
> coherent
> +	 * domain. PCIe devices are not included. We can skip the rest of the
> +	 * flags if IOMMU does not support MTS.

when you say that PCI devices are not included, is it simple for information
or should we impose some check to make sure below path not applied to
them?

> +	 */
> +	if (ecap_mts(iommu->ecap)) {
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
> +			pasid_set_emte(pte);
> +			pasid_set_emt(pte, pasid_data->emt);
> +		}
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> +			pasid_set_pcd(pte);
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> +			pasid_set_pwt(pte);
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> +			pasid_set_cd(pte);
> +		pasid_set_pat(pte, pasid_data->pat);
> +	} else if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK)
> {
> +		pr_err("No memory type support for bind guest PASID
> on %s\n",
> +			iommu->name);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +
> +}
> +
> +/**
> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
> + * This could be used for guest shared virtual address. In this case, the
> + * first level page tables are used for GVA-GPA translation in the guest,
> + * second level page tables are used for GPA-HPA translation.

GVA->GPA is just one example. It could be gIOVA->GPA too. Here the
point is that the first level is the translation table managed by the guest.

> + *
> + * @iommu:      IOMMU which the device belong to
> + * @dev:        Device to be set up for translation
> + * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
> + * @pasid:      PASID to be programmed in the device PASID table
> + * @pasid_data: Additional PASID info from the guest bind request
> + * @domain:     Domain info for setting up second level page tables
> + * @addr_width: Address width of the first level (guest)
> + */
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *gpgd,
> +			int pasid, struct iommu_gpasid_bind_data_vtd
> *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width)
> +{
> +	struct pasid_entry *pte;
> +	struct dma_pte *pgd;
> +	int ret = 0;
> +	u64 pgd_val;
> +	int agaw;
> +	u16 did;
> +
> +	if (!ecap_nest(iommu->ecap)) {
> +		pr_err("IOMMU: %s: No nested translation support\n",
> +		       iommu->name);
> +		return -EINVAL;
> +	}
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;

should we have intel_pasid_get_entry to return error which is then carried
here? Looking at that function there could be error conditions both being
invalid parameter and no memory...

> +
> +	/*
> +	 * Caller must ensure PASID entry is not in use, i.e. not bind the
> +	 * same PASID to the same device twice.
> +	 */
> +	if (pasid_pte_is_present(pte))
> +		return -EBUSY;

is any lock held outside of this function? curious whether any race
condition may happen in between.

> +
> +	pasid_clear_entry(pte);
> +
> +	/* Sanity checking performed by caller to make sure address
> +	 * width matching in two dimensions:
> +	 * 1. CPU vs. IOMMU
> +	 * 2. Guest vs. Host.
> +	 */
> +	switch (addr_width) {
> +	case ADDR_WIDTH_5LEVEL:
> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> +			cap_5lp_support(iommu->cap)) {
> +			pasid_set_flpm(pte, 1);

define a macro for 4lvl and 5lvl

> +		} else {
> +			dev_err(dev, "5-level paging not supported\n");
> +			return -EINVAL;
> +		}
> +		break;
> +	case ADDR_WIDTH_4LEVEL:
> +		pasid_set_flpm(pte, 0);
> +		break;
> +	default:
> +		dev_err(dev, "Invalid guest address width %d\n",
> addr_width);
> +		return -EINVAL;
> +	}
> +
> +	/* First level PGD is in GPA, must be supported by the second level */
> +	if ((u64)gpgd > domain->max_addr) {
> +		dev_err(dev, "Guest PGD %llx not supported, max %llx\n",
> +			(u64)gpgd, domain->max_addr);
> +		return -EINVAL;
> +	}
> +	pasid_set_flptr(pte, (u64)gpgd);
> +
> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> +	if (ret) {
> +		dev_err(dev, "Guest PASID bind data not supported\n");
> +		return ret;
> +	}
> +
> +	/* Setup the second level based on the given domain */
> +	pgd = domain->pgd;
> +
> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> +	if (agaw < 0) {
> +		dev_err(dev, "Invalid domain page table\n");
> +		return -EINVAL;
> +	}
> +	pgd_val = virt_to_phys(pgd);
> +	pasid_set_slptr(pte, pgd_val);
> +	pasid_set_fault_enable(pte);
> +
> +	did = domain->iommu_did[iommu->seq_id];
> +	pasid_set_domain_id(pte, did);
> +
> +	pasid_set_address_width(pte, agaw);
> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> +
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> +	pasid_set_present(pte);
> +	pasid_flush_caches(iommu, pte, pasid, did);
> +
> +	return ret;
> +}
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 92de6df24ccb..698015ee3f04 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -36,6 +36,7 @@
>   * to vmalloc or even module mappings.
>   */
>  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> +#define PASID_FLAG_NESTED		BIT(1)
> 
>  /*
>   * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
> @@ -51,6 +52,11 @@ struct pasid_entry {
>  	u64 val[8];
>  };
> 
> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> +#define PASID_ENTRY_PGTT_NESTED		(3)
> +#define PASID_ENTRY_PGTT_PT		(4)
> +
>  /* The representative of a PASID table */
>  struct pasid_table {
>  	void			*table;		/* pasid table pointer */
> @@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, int pasid);
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *pgd,
> +			int pasid,
> +			struct iommu_gpasid_bind_data_vtd *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
> 
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ed7171d2ae1f..eda1d6687144 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -42,6 +42,9 @@
>  #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
>  #define DMA_FL_PTE_XD		BIT_ULL(63)
> 
> +#define ADDR_WIDTH_5LEVEL	(57)
> +#define ADDR_WIDTH_4LEVEL	(48)
> +
>  #define CONTEXT_TT_MULTI_LEVEL	0
>  #define CONTEXT_TT_DEV_IOTLB	1
>  #define CONTEXT_TT_PASS_THROUGH 2
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data
  2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
  2020-03-22  1:29   ` Lu Baolu
  2020-03-27 11:50   ` Tian, Kevin
@ 2020-03-27 14:13   ` Auger Eric
  2 siblings, 0 replies; 67+ messages in thread
From: Auger Eric @ 2020-03-27 14:13 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Hi Jacob,

On 3/21/20 12:27 AM, Jacob Pan wrote:
> Memory type related flags can be grouped together for one simple check.
> 
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

> ---
>  include/uapi/linux/iommu.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>  	__u32 pat;
>  	__u32 emt;
>  };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
> +					 IOMMU_SVA_VTD_GPASID_EMTE | \
> +					 IOMMU_SVA_VTD_GPASID_PCD |  \
> +					 IOMMU_SVA_VTD_GPASID_PWT)
>  /**
>   * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
>   * @version:	Version of this data structure
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types
  2020-03-20 23:27 ` [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2020-03-27 14:46   ` Auger Eric
  2020-03-30 23:28     ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-27 14:46 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Hi Jacob,

On 3/21/20 12:27 AM, Jacob Pan wrote:
> When Shared Virtual Memory is exposed to a guest via vIOMMU, scalable
> IOTLB invalidation may be passed down from outside IOMMU subsystems.
> This patch adds invalidation functions that can be used for additional
> translation cache types.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> ---
> v9 -> v10:
> Fix off by 1 in pasid device iotlb flush
> 
> Address v7 missed review from Eric
> 
> ---
> ---
>  drivers/iommu/dmar.c        | 36 ++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel-pasid.c |  3 ++-
>  include/linux/intel-iommu.h | 20 ++++++++++++++++----
>  3 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index f77dae7ba7d4..4d6b7b5b37ee 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
> +/* PASID-based device IOTLB Invalidate */
> +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +		u32 pasid,  u16 qdep, u64 addr, unsigned size_order, u64 granu)
> +{
> +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
> +	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
> +
> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> +		QI_DEV_IOTLB_PFSID(pfsid);
> +	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
> +
> +	/*
> +	 * If S bit is 0, we only flush a single page. If S bit is set,
> +	 * The least significant zero bit indicates the invalidation address
> +	 * range. VT-d spec 6.5.2.6.
> +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
> +	 * size order = 0 is PAGE_SIZE 4KB
> +	 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
> +	 * ECAP.
> +	 */
> +	desc.qw1 |= addr & ~mask;
> +	if (size_order)
> +		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> +
> +	qi_submit_sync(&desc, iommu);
> +}
> +
> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
> +{
> +	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
> +
> +	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_TYPE;
> +	qi_submit_sync(&desc, iommu);
> +}
> +
>  /*
>   * Disable Queued Invalidation interface.
>   */
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 10c7856afc6b..9f6d07410722 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>  {
>  	struct qi_desc desc;
>  
> -	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
> +	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
> +		QI_PC_PASID(pasid) | QI_PC_TYPE;
Just a nit, this fix is not documented in the commit message.

Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

>  	desc.qw1 = 0;
>  	desc.qw2 = 0;
>  	desc.qw3 = 0;
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 85b05120940e..43539713b3b3 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -334,7 +334,7 @@ enum {
>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
>  #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
>  #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
> -#define QI_IOTLB_AM(am)		(((u8)am))
> +#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
>  
>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> @@ -353,16 +353,21 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PASID cache invalidation granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> -#define QI_EIOTLB_AM(am)	(((u64)am))
> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		1
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -662,8 +667,15 @@ extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  			  unsigned int size_order, u64 type);
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
> +
>  void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
>  		     unsigned long npages, bool ih);
> +
> +extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u32 pasid, u16 qdep, u64 addr, unsigned size_order, u64 granu);
> +
> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
> +
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
>  
>  extern int dmar_ir_support(void);
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL
  2020-03-27 11:55   ` Tian, Kevin
@ 2020-03-27 16:05     ` Auger Eric
  2020-03-29  7:35       ` Lu Baolu
  0 siblings, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-27 16:05 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

Hi Jacob,

On 3/27/20 12:55 PM, Tian, Kevin wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Sent: Saturday, March 21, 2020 7:28 AM
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> ---
>>  drivers/iommu/intel-pasid.c | 14 ++++----------
>>  1 file changed, 4 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
>> index 191508c7c03e..9bdb7ee228b6 100644
>> --- a/drivers/iommu/intel-pasid.c
>> +++ b/drivers/iommu/intel-pasid.c
>> @@ -544,17 +544,11 @@ int intel_pasid_setup_second_level(struct
>> intel_iommu *iommu,
>>  		return -EINVAL;
>>  	}
>>
>> -	/*
>> -	 * Skip top levels of page tables for iommu which has less agaw
>> -	 * than default. Unnecessary for PT mode.
>> -	 */
>>  	pgd = domain->pgd;
>> -	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
>> -		pgd = phys_to_virt(dma_pte_addr(pgd));
>> -		if (!dma_pte_present(pgd)) {
>> -			dev_err(dev, "Invalid domain page table\n");
>> -			return -EINVAL;
>> -		}
>> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
>> +	if (agaw < 0) {
>> +		dev_err(dev, "Invalid domain page table\n");
is the dev_err() really requested. I see in domain_setup_first_level(),
there is none.
>> +		return -EINVAL;
>>  	}
> 
> ok, I see how it is used. possibly combine last and this one together since
> it's mostly moving code...

I tend to agree with Kevin. May be better squash the 2 patches. Also not
sure the inline of iommu_skip_agaw() is meaningful then. Also Add commit
messages on the resulting patch.

Note domain_setup_first_level() also could use the helper while we are
it (if declaration moved to common helper). Only the error code differs
in case !dma_pte_present(pgd), ie. -ENOMEM. May be good to align.

Otherwise those stuff may be done in a fixup patch.

Thanks

Eric
> 
>>
>>  	pgd_val = virt_to_phys(pgd);
>> --
>> 2.7.4
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-20 23:27 ` [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2020-03-28  8:02   ` Tian, Kevin
  2020-03-30 20:51     ` Jacob Pan
  2020-03-29 13:40   ` Auger Eric
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-28  8:02 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
> 
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c |   4 +
>  drivers/iommu/intel-svm.c   | 224
> ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |   8 +-
>  include/linux/intel-svm.h   |  17 ++++
>  4 files changed, 252 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index e599b2537b1c..b1477cd423dd 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> +#endif
>  };
> 
>  static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index d7f2a5358900..47c0deb5ae56 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
>  	list_for_each_entry((sdev), &(svm)->devs, list)	\
>  		if ((d) != (sdev)->dev) {} else
> 
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +			struct device *dev,
> +			struct iommu_gpasid_bind_data *data)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct dmar_domain *ddomain;

what about the full name e.g. dmar_domain? though a bit longer
but clearer than ddomain.

> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = 0;
> +
> +	if (WARN_ON(!iommu) || !data)
> +		return -EINVAL;
> +
> +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> +		return -EINVAL;
> +
> +	if (dev_is_pci(dev)) {
> +		/* VT-d supports devices with full 20 bit PASIDs only */
> +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> +			return -EINVAL;
> +	} else {
> +		return -ENOTSUPP;
> +	}
> +
> +	/*
> +	 * We only check host PASID range, we have no knowledge to check
> +	 * guest PASID range nor do we use the guest PASID.
> +	 */
> +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> +		return -EINVAL;
> +
> +	ddomain = to_dmar_domain(domain);
> +
> +	/* Sanity check paging mode support match between host and guest
> */
> +	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> +	    !cap_5lp_support(iommu->cap)) {
> +		pr_err("Cannot support 5 level paging requested by
> guest!\n");
> +		return -EINVAL;
> +	}

-ENOTSUPP?

> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, data->hpasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	if (svm) {
> +		/*
> +		 * If we found svm for the PASID, there must be at
> +		 * least one device bond, otherwise svm should be freed.
> +		 */
> +		if (WARN_ON(list_empty(&svm->devs))) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (svm->mm == get_task_mm(current) &&
> +		    data->hpasid == svm->pasid &&
> +		    data->gpasid == svm->gpasid) {
> +			pr_warn("Cannot bind the same guest-host PASID for
> the same process\n");

Sorry I didn’t get the rationale here. Isn't this branch is for binding the same
PASID to multiple devices? In that case definitely it is binding the same 
guest-host PASID for the same process. otherwise if hpasid is different then
you'll hit a different intel_svm, while if gpasid is different how you can use
one intel_svm to hold multiple gpasids?

I feel the error condition should be the opposite. and suppose SVM_FLAG_
GUEST_PASID should be verified before checking gpasid.

> +			mmput(svm->mm);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		mmput(current->mm);
> +
> +		for_each_svm_dev(sdev, svm, dev) {
> +			/* In case of multiple sub-devices of the same pdev
> +			 * assigned, we should allow multiple bind calls with
> +			 * the same PASID and pdev.

Does sub-device mean mdev? I didn't find such notation in current iommu
directory.

and to make it clearer, "In case of multiple mdevs of the same pdev assigned
to the same guest process".

> +			 */
> +			sdev->users++;
> +			goto out;
> +		}
> +	} else {
> +		/* We come here when PASID has never been bond to a
> device. */
> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> +		if (!svm) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		/* REVISIT: upper layer/VFIO can track host process that bind
> the PASID.
> +		 * ioasid_set = mm might be sufficient for vfio to check pasid
> VMM
> +		 * ownership.
> +		 */

Above message is unclear about what should be revisited. Does it describe
the current implementation or the expected revision in the future? 

> +		svm->mm = get_task_mm(current);
> +		svm->pasid = data->hpasid;
> +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> +			svm->gpasid = data->gpasid;
> +			svm->flags |= SVM_FLAG_GUEST_PASID;
> +		}
> +		ioasid_set_data(data->hpasid, svm);
> +		INIT_LIST_HEAD_RCU(&svm->devs);
> +		mmput(svm->mm);
> +	}
> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> +	if (!sdev) {
> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sdev->dev = dev;
> +	sdev->users = 1;
> +
> +	/* Set up device context entry for PASID if not enabled already */
> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> +	if (ret) {
> +		dev_err(dev, "Failed to enable PASID capability\n");
> +		kfree(sdev);
> +		/*
> +		 * If this this a new PASID that never bond to a device, then
> +		 * the device list must be empty which indicates struct svm
> +		 * was allocated in this function.
> +		 */

the comment better move to the 1st occurrence when sdev allocation
fails. or even better put it in out label...

> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}
> +		goto out;
> +	}
> +
> +	/*
> +	 * For guest bind, we need to set up PASID table entry as follows:
> +	 * - FLPM matches guest paging mode
> +	 * - turn on nested mode
> +	 * - SL guest address width matching
> +	 */

looks above just explains the internal detail of intel_pasid_setup_nested,
which is not necessary to be here.

> +	ret = intel_pasid_setup_nested(iommu,
> +				       dev,
> +				       (pgd_t *)data->gpgd,
> +				       data->hpasid,
> +				       &data->vtd,
> +				       ddomain,
> +				       data->addr_width);

It's worthy of an explanation here that setup_nested is required for
every device (even when they are sharing same intel_svm) because
we allocate pasid table per device. Otherwise I made a mistake to
think that only the 1st device bound to a new hpasid requires this
step. 😊

> +	if (ret) {
> +		dev_err(dev, "Failed to set up PASID %llu in nested mode,
> Err %d\n",
> +			data->hpasid, ret);
> +		/*
> +		 * PASID entry should be in cleared state if nested mode
> +		 * set up failed. So we only need to clear IOASID tracking
> +		 * data such that free call will succeed.
> +		 */
> +		kfree(sdev);
> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}
> +		goto out;
> +	}
> +	svm->flags |= SVM_FLAG_GUEST_MODE;
> +
> +	init_rcu_head(&sdev->rcu);
> +	list_add_rcu(&sdev->list, &svm->devs);
> + out:
> +	mutex_unlock(&pasid_mutex);
> +	return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = -EINVAL;
> +
> +	if (WARN_ON(!iommu))
> +		return -EINVAL;
> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (!svm) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	for_each_svm_dev(sdev, svm, dev) {
> +		ret = 0;
> +		sdev->users--;
> +		if (!sdev->users) {
> +			list_del_rcu(&sdev->list);
> +			intel_pasid_tear_down_entry(iommu, dev, svm-
> >pasid);
> +			/* TODO: Drain in flight PRQ for the PASID since it
> +			 * may get reused soon, we don't want to
> +			 * confuse with its previous life.
> +			 * intel_svm_drain_prq(dev, pasid);
> +			 */
> +			kfree_rcu(sdev, rcu);
> +
> +			if (list_empty(&svm->devs)) {
> +				/*
> +				 * We do not free PASID here until explicit call
> +				 * from VFIO to free. The PASID life cycle
> +				 * management is largely tied to VFIO
> management
> +				 * of assigned device life cycles. In case of
> +				 * guest exit without a explicit free PASID call,
> +				 * the responsibility lies in VFIO layer to free
> +				 * the PASIDs allocated for the guest.
> +				 * For security reasons, VFIO has to track the
> +				 * PASID ownership per guest anyway to
> ensure
> +				 * that PASID allocated by one guest cannot
> be
> +				 * used by another.

As commented in other patches, VFIO is only one example user of this API... 

> +				 */
> +				ioasid_set_data(pasid, NULL);
> +				kfree(svm);
> +			}
> +		}
> +		break;
> +	}

what about no dev match? an -EINVAL is also required then.

> +out:
> +	mutex_unlock(&pasid_mutex);
> +
> +	return ret;
> +}
> +
>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct
> svm_dev_ops *ops)
>  {
>  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index eda1d6687144..85b05120940e 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device
> *dev);
>  extern void intel_svm_check(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_gpasid_bind_data *data);
> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>  struct svm_dev_ops;
> 
>  struct intel_svm_dev {
> @@ -698,9 +700,13 @@ struct intel_svm_dev {
>  struct intel_svm {
>  	struct mmu_notifier notifier;
>  	struct mm_struct *mm;
> +
>  	struct intel_iommu *iommu;
>  	int flags;
>  	int pasid;
> +	int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
> +		     * to guest PASID mapping.
> +		     */

we don't need to highlight identity or non-identity thing, since either way 
shares the same infrastructure here and it is not the knowledge that the
kernel driver should assume

>  	struct list_head devs;
>  	struct list_head list;
>  };
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index d7c403d0dd27..c19690937540 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>   * do such IOTLB flushes automatically.
>   */
>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a
> device.
> + * In this case the mm_struct is in the guest kernel or userspace, its life
> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API
> provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE	(1<<2)
> +/*
> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID
> space,
> + * which requires guest and host PASID translation at both directions. We
> keep
> + * track of guest PASID in order to provide lookup service to device drivers.
> + * One such example is a physical function (PF) driver that supports
> mediated
> + * device (mdev) assignment. Guest programming of mdev configuration
> space can
> + * only be done with guest PASID, therefore PF driver needs to find the
> matching
> + * host PASID to program the real hardware.
> + */
> +#define SVM_FLAG_GUEST_PASID	(1<<3)
> 
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> 
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-20 23:27 ` [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
@ 2020-03-28 10:01   ` Tian, Kevin
  2020-03-29 15:34     ` Auger Eric
                       ` (2 more replies)
  2020-03-29 16:05   ` Auger Eric
  1 sibling, 3 replies; 67+ messages in thread
From: Tian, Kevin @ 2020-03-28 10:01 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue

emulated IOMMU -> vIOMMU, since virito-iommu could use the
interface as well.

> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> ---
> v7 review fixed in v10
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 182
> ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 182 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index b1477cd423dd..a76afb0fd51a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5619,6 +5619,187 @@ static void
> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>  }
> 
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not include IOTLB granularity for request without PASID (second
> level).

I would revise above as "We do not support IOTLB granularity for request 
without PASID (second level), therefore any vIOMMU implementation that
exposes the SVA capability to the guest should only expose the first level
page tables, implying all invalidation requests from the guest will include
a valid PASID"

> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> NR] = {
> +	/*
> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
> +	 * page selective (address granularity)
> +	 */
> +	{0, 1, 1},
> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
> +	{1, 1, 0},

Is this combination correct? when single PASID is being specified, it is 
essentially a page-selective invalidation since you need provide Address
and Size. 

> +	/* PASID cache */

PASID cache is fully managed by the host. Guest PASID cache invalidation
is interpreted by vIOMMU for bind and unbind operations. I don't think
we should accept any PASID cache invalidation from userspace or guest.

> +	{1, 1, 0}
> +};
> +
> +const static int
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> _NR] = {
> +	/* PASID based IOTLB */
> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> +	/* PASID based dev TLBs */
> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> +	/* PASID cache */
> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> +{
> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> IOMMU_INV_GRANU_NR ||
> +		!inv_type_granu_map[type][granu])
> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][granu];
> +

btw do we really need both map and table here? Can't we just
use one table with unsupported granularity marked as a special
value?

> +	return 0;
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> +	 * IOMMU cache invalidate API passes granu_size in bytes, and
> number of
> +	 * granu size in contiguous memory.
> +	 */
> +	return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_cache_invalidate_info
> *inv_info)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct intel_iommu *iommu;
> +	unsigned long flags;
> +	int cache_type;
> +	u8 bus, devfn;
> +	u16 did, sid;
> +	int ret = 0;
> +	u64 size = 0;
> +
> +	if (!inv_info || !dmar_domain ||
> +		inv_info->version !=
> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> +		return -EINVAL;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	spin_lock(&iommu->lock);
> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +	if (!info) {
> +		ret = -EINVAL;
> +		goto out_unlock;

-ENOTSUPP?

> +	}
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +
> +	/* Size is only valid in non-PASID selective invalidation */
> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> +		size = to_vtd_size(inv_info->addr_info.granule_size,
> +				   inv_info->addr_info.nb_granules);
> +
> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache,
> IOMMU_CACHE_INV_TYPE_NR) {
> +		int granu = 0;
> +		u64 pasid = 0;
> +
> +		ret = to_vtd_granularity(cache_type, inv_info->granularity,
> &granu);
> +		if (ret) {
> +			pr_err("Invalid cache type and granu
> combination %d/%d\n", cache_type,
> +				inv_info->granularity);
> +			break;
> +		}
> +
> +		/* PASID is stored in different locations based on granularity
> */
> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> +			inv_info->pasid_info.flags &
> IOMMU_INV_PASID_FLAGS_PASID)
> +			pasid = inv_info->pasid_info.pasid;
> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR
> &&
> +			inv_info->addr_info.flags &
> IOMMU_INV_ADDR_FLAGS_PASID)
> +			pasid = inv_info->addr_info.pasid;
> +		else {
> +			pr_err("Cannot find PASID for given cache type and
> granularity\n");
> +			break;
> +		}
> +
> +		switch (BIT(cache_type)) {
> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> +			if ((inv_info->granularity !=
> IOMMU_INV_GRANU_PASID) &&

granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
why IOMMU_INV_GRANU_DOMAIN also needs size check.

> +				size && (inv_info->addr_info.addr &
> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> +				pr_err("Address out of range, 0x%llx, size
> order %llu\n",
> +					inv_info->addr_info.addr, size);
> +				ret = -ERANGE;
> +				goto out_unlock;
> +			}
> +
> +			qi_flush_piotlb(iommu, did,
> +					pasid,
> +					mm_to_dma_pfn(inv_info-
> >addr_info.addr),
> +					(granu == QI_GRAN_NONG_PASID) ? -
> 1 : 1 << size,
> +					inv_info->addr_info.flags &
> IOMMU_INV_ADDR_FLAGS_LEAF);
> +
> +			/*
> +			 * Always flush device IOTLB if ATS is enabled since
> guest
> +			 * vIOMMU exposes CM = 1, no device IOTLB flush
> will be passed
> +			 * down.
> +			 */

Does VT-d spec mention that no device IOTLB flush is required when CM=1?

> +			if (info->ats_enabled) {
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >pfsid,
> +						pasid, info->ats_qdep,
> +						inv_info->addr_info.addr,
> size,
> +						granu);
> +			}
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> +			if (info->ats_enabled) {
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >pfsid,
> +						inv_info->addr_info.pasid,
> info->ats_qdep,
> +						inv_info->addr_info.addr,
> size,
> +						granu);

I'm confused here. There are two granularities allowed for devtlb, but here
you only handle one of them?

> +			} else
> +				pr_warn("Passdown device IOTLB flush w/o
> ATS!\n");
> +
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_PASID:
> +			qi_flush_pasid_cache(iommu, did, granu, inv_info-
> >pasid_info.pasid);
> +

as earlier comment, we shouldn't allow userspace or guest to invalidate
PASID cache

> +			break;
> +		default:
> +			dev_err(dev, "Unsupported IOMMU invalidation
> type %d\n",
> +				cache_type);
> +			ret = -EINVAL;
> +		}
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +#endif
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot, gfp_t gfp)
> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>  #endif
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register
  2020-03-20 23:27 ` [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register Jacob Pan
@ 2020-03-28 10:04   ` Tian, Kevin
  2020-03-31 22:33     ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-28 10:04 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Virtual command registers are used in the guest only, to prevent
> vmexit cost, we cache the capability and store it during initialization.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
> ---
> v7 Reviewed by Eric & Baolu
> ---
> ---
>  drivers/iommu/dmar.c        | 1 +
>  include/linux/intel-iommu.h | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 4d6b7b5b37ee..3b36491c8bbb 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu,
> u64 phys_addr)
>  		warn_invalid_dmar(phys_addr, " returns all ones");
>  		goto unmap;
>  	}
> +	iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> 
>  	/* the registers might be more than one page */
>  	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 43539713b3b3..ccbf164fb711 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -194,6 +194,9 @@
>  #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
>  #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping Control */
> 
> +/* Virtual command interface capabilities */

capabilities -> capability

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> +#define vccap_pasid(v)		((v & DMA_VCS_PAS)) /* PASID
> allocation */
> +
>  /* IOTLB_REG */
>  #define DMA_TLB_FLUSH_GRANU_OFFSET  60
>  #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
> @@ -287,6 +290,7 @@
> 
>  /* PRS_REG */
>  #define DMA_PRS_PPR	((u32)1)
> +#define DMA_VCS_PAS	((u64)1)
> 
>  #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)
> 	\
>  do {									\
> @@ -537,6 +541,7 @@ struct intel_iommu {
>  	u64		reg_size; /* size of hw register set */
>  	u64		cap;
>  	u64		ecap;
> +	u64		vccap;
>  	u32		gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF
> */
>  	raw_spinlock_t	register_lock; /* protect register handling */
>  	int		seq_id;	/* sequence id of the iommu */
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation
  2020-03-20 23:27 ` [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation Jacob Pan
@ 2020-03-28 10:08   ` Tian, Kevin
  2020-03-31 22:37     ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-28 10:08 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> Enabling IOMMU in a guest requires communication with the host
> driver for certain aspects. Use of PASID ID to enable Shared Virtual
> Addressing (SVA) requires managing PASID's in the host. VT-d 3.0 spec
> provides a Virtual Command Register (VCMD) to facilitate this.
> Writes to this register in the guest are trapped by QEMU which
> proxies the call to the host driver.

Qemu -> vIOMMU

> 
> This virtual command interface consists of a capability register,
> a virtual command register, and a virtual response register. Refer
> to section 10.4.42, 10.4.43, 10.4.44 for more information.
> 
> This patch adds the enlightened PASID allocation/free interfaces
> via the virtual command interface.
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>  drivers/iommu/intel-pasid.c | 57
> +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel-pasid.h | 13 ++++++++++-
>  include/linux/intel-iommu.h |  1 +
>  3 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 9f6d07410722..e87ad67aad36 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -27,6 +27,63 @@
>  static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
> 
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
> +{
> +	unsigned long flags;
> +	u8 status_code;
> +	int ret = 0;
> +	u64 res;
> +
> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> VCMD_CMD_ALLOC);
> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +		      !(res & VCMD_VRSP_IP), res);
> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> +	status_code = VCMD_VRSP_SC(res);
> +	switch (status_code) {
> +	case VCMD_VRSP_SC_SUCCESS:
> +		*pasid = VCMD_VRSP_RESULT_PASID(res);
> +		break;
> +	case VCMD_VRSP_SC_NO_PASID_AVAIL:
> +		pr_info("IOMMU: %s: No PASID available\n", iommu->name);
> +		ret = -ENOSPC;
> +		break;
> +	default:
> +		ret = -ENODEV;
> +		pr_warn("IOMMU: %s: Unexpected error code %d\n",
> +			iommu->name, status_code);
> +	}
> +
> +	return ret;
> +}
> +
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> +{
> +	unsigned long flags;
> +	u8 status_code;
> +	u64 res;
> +
> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> +		    VCMD_CMD_OPERAND(pasid) | VCMD_CMD_FREE);
> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +		      !(res & VCMD_VRSP_IP), res);
> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> +	status_code = VCMD_VRSP_SC(res);
> +	switch (status_code) {
> +	case VCMD_VRSP_SC_SUCCESS:
> +		break;
> +	case VCMD_VRSP_SC_INVALID_PASID:
> +		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> +		break;
> +	default:
> +		pr_warn("IOMMU: %s: Unexpected error code %d\n",
> +			iommu->name, status_code);
> +	}
> +}
> +
>  /*
>   * Per device pasid table management:
>   */
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 698015ee3f04..cd3d63f3e936 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -23,6 +23,16 @@
>  #define is_pasid_enabled(entry)		(((entry)->lo >> 3) & 0x1)
>  #define get_pasid_dir_size(entry)	(1 << ((((entry)->lo >> 9) & 0x7) + 7))
> 
> +/* Virtual command interface for enlightened pasid management. */
> +#define VCMD_CMD_ALLOC			0x1
> +#define VCMD_CMD_FREE			0x2
> +#define VCMD_VRSP_IP			0x1
> +#define VCMD_VRSP_SC(e)			(((e) >> 1) & 0x3)
> +#define VCMD_VRSP_SC_SUCCESS		0
> +#define VCMD_VRSP_SC_NO_PASID_AVAIL	1
> +#define VCMD_VRSP_SC_INVALID_PASID	1
> +#define VCMD_VRSP_RESULT_PASID(e)	(((e) >> 8) & 0xfffff)
> +#define VCMD_CMD_OPERAND(e)		((e) << 8)
>  /*
>   * Domain ID reserved for pasid entries programmed for first-level
>   * only and pass-through transfer modes.
> @@ -113,5 +123,6 @@ int intel_pasid_setup_nested(struct intel_iommu
> *iommu,
>  			int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
> -
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
>  #endif /* __INTEL_PASID_H */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ccbf164fb711..9cbf5357138b 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -169,6 +169,7 @@
>  #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
>  #define ecap_flts(e)		(((e) >> 47) & 0x1)
>  #define ecap_slts(e)		(((e) >> 46) & 0x1)
> +#define ecap_vcs(e)		(((e) >> 44) & 0x1)
>  #define ecap_smts(e)		(((e) >> 43) & 0x1)
>  #define ecap_dit(e)		((e >> 41) & 0x1)
>  #define ecap_pasid(e)		((e >> 40) & 0x1)
> --
> 2.7.4

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID
  2020-03-20 23:27 ` [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
@ 2020-03-28 10:22   ` Tian, Kevin
  2020-04-01 15:47     ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-28 10:22 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When VT-d driver runs in the guest, PASID allocation must be
> performed via virtual command interface. This patch registers a
> custom IOASID allocator which takes precedence over the default
> XArray based allocator. The resulting IOASID allocation will always
> come from the host. This ensures that PASID namespace is system-
> wide.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 84
> +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |  2 ++
>  2 files changed, 86 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a76afb0fd51a..c1c0b0fb93c3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct intel_iommu
> *iommu)
>  		if (ecap_prs(iommu->ecap))
>  			intel_svm_finish_prq(iommu);
>  	}
> +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> +		ioasid_unregister_allocator(&iommu->pasid_allocator);
> +
>  #endif
>  }
> 
> @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> intel_iommu *iommu)
>  	return ret;
>  }
> 
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)

the name is too generic... can we add vcmd in the name to clarify
its purpose, e.g. intel_vcmd_ioasid_alloc?

> +{
> +	struct intel_iommu *iommu = data;
> +	ioasid_t ioasid;
> +
> +	if (!iommu)
> +		return INVALID_IOASID;
> +	/*
> +	 * VT-d virtual command interface always uses the full 20 bit
> +	 * PASID range. Host can partition guest PASID range based on
> +	 * policies but it is out of guest's control.
> +	 */
> +	if (min < PASID_MIN || max > intel_pasid_max_id)
> +		return INVALID_IOASID;
> +
> +	if (vcmd_alloc_pasid(iommu, &ioasid))
> +		return INVALID_IOASID;
> +
> +	return ioasid;
> +}
> +
> +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> +{
> +	struct intel_iommu *iommu = data;
> +
> +	if (!iommu)
> +		return;
> +	/*
> +	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> +	 * We can only free the PASID when all the devices are unbound.
> +	 */
> +	if (ioasid_find(NULL, ioasid, NULL)) {
> +		pr_alert("Cannot free active IOASID %d\n", ioasid);
> +		return;
> +	}

However the sanity check is not done in default_free. Is there a reason
why using vcmd adds such  new requirement?

> +	vcmd_free_pasid(iommu, ioasid);
> +}
> +
> +static void register_pasid_allocator(struct intel_iommu *iommu)
> +{
> +	/*
> +	 * If we are running in the host, no need for custom allocator
> +	 * in that PASIDs are allocated from the host system-wide.
> +	 */
> +	if (!cap_caching_mode(iommu->cap))
> +		return;

is it more accurate to check against vcmd capability?

> +
> +	if (!sm_supported(iommu)) {
> +		pr_warn("VT-d Scalable Mode not enabled, no PASID
> allocation\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Register a custom PASID allocator if we are running in a guest,
> +	 * guest PASID must be obtained via virtual command interface.
> +	 * There can be multiple vIOMMUs in each guest but only one
> allocator
> +	 * is active. All vIOMMU allocators will eventually be calling the same

which one? the first or last?

> +	 * host allocator.
> +	 */
> +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> +		pr_info("Register custom PASID allocator\n");
> +		iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> +		iommu->pasid_allocator.free = intel_ioasid_free;
> +		iommu->pasid_allocator.pdata = (void *)iommu;
> +		if (ioasid_register_allocator(&iommu->pasid_allocator)) {
> +			pr_warn("Custom PASID allocator failed, scalable
> mode disabled\n");
> +			/*
> +			 * Disable scalable mode on this IOMMU if there
> +			 * is no custom allocator. Mixing SM capable
> vIOMMU
> +			 * and non-SM vIOMMU are not supported.
> +			 */
> +			intel_iommu_sm = 0;

since you register an allocator for every vIOMMU, means previously
registered allocators should also be unregistered here?

> +		}
> +	}
> +}
> +#endif
> +
>  static int __init init_dmars(void)
>  {
>  	struct dmar_drhd_unit *drhd;
> @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
>  	 */
>  	for_each_active_iommu(iommu, drhd) {
>  		iommu_flush_write_buffer(iommu);
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +		register_pasid_allocator(iommu);
> +#endif
>  		iommu_set_root_entry(iommu);
>  		iommu->flush.flush_context(iommu, 0, 0, 0,
> DMA_CCMD_GLOBAL_INVL);
>  		iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> DMA_TLB_GLOBAL_FLUSH);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 9cbf5357138b..9c357a325c72 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -19,6 +19,7 @@
>  #include <linux/iommu.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/dmar.h>
> +#include <linux/ioasid.h>
> 
>  #include <asm/cacheflush.h>
>  #include <asm/iommu.h>
> @@ -563,6 +564,7 @@ struct intel_iommu {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	struct page_req_dsc *prq;
>  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> +	struct ioasid_allocator_ops pasid_allocator; /* Custom allocator for
> PASIDs */
>  #endif
>  	struct q_inval  *qi;            /* Queued invalidation info */
>  	u32 *iommu_state; /* Store iommu states between suspend and
> resume.*/
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw
  2020-03-27 11:53   ` Tian, Kevin
@ 2020-03-29  7:20     ` Lu Baolu
  2020-03-30 17:50       ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Lu Baolu @ 2020-03-29  7:20 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

On 2020/3/27 19:53, Tian, Kevin wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Sent: Saturday, March 21, 2020 7:28 AM
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> could you elaborate in which scenario this helper function is required?

I added below commit message:

     An Intel iommu domain uses 5-level page table by default. If the
     iommu that the domain tries to attach supports less page levels,
     the top level page tables should be skipped. Add a helper to do
     this so that it could be used in other places.

Best regards,
baolu

>   
>> ---
>>   drivers/iommu/intel-pasid.c | 22 ++++++++++++++++++++++
>>   1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
>> index 22b30f10b396..191508c7c03e 100644
>> --- a/drivers/iommu/intel-pasid.c
>> +++ b/drivers/iommu/intel-pasid.c
>> @@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct intel_iommu
>> *iommu,
>>   }
>>
>>   /*
>> + * Skip top levels of page tables for iommu which has less agaw
>> + * than default. Unnecessary for PT mode.
>> + */
>> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
>> +				  struct intel_iommu *iommu,
>> +				  struct dma_pte **pgd)
>> +{
>> +	int agaw;
>> +
>> +	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
>> +		*pgd = phys_to_virt(dma_pte_addr(*pgd));
>> +		if (!dma_pte_present(*pgd)) {
>> +			return -EINVAL;
>> +		}
>> +	}
>> +	pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n",
>> __func__, (u64)*pgd,
>> +		iommu->agaw, domain->agaw);
>> +
>> +	return agaw;
>> +}
>> +
>> +/*
>>    * Set up the scalable mode pasid entry for second only translation type.
>>    */
>>   int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>> --
>> 2.7.4
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL
  2020-03-27 16:05     ` Auger Eric
@ 2020-03-29  7:35       ` Lu Baolu
  0 siblings, 0 replies; 67+ messages in thread
From: Lu Baolu @ 2020-03-29  7:35 UTC (permalink / raw)
  To: Auger Eric, Tian, Kevin, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

On 2020/3/28 0:05, Auger Eric wrote:
> Hi Jacob,
> 
> On 3/27/20 12:55 PM, Tian, Kevin wrote:
>>> From: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>> Sent: Saturday, March 21, 2020 7:28 AM
>>>
>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>> ---
>>>   drivers/iommu/intel-pasid.c | 14 ++++----------
>>>   1 file changed, 4 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
>>> index 191508c7c03e..9bdb7ee228b6 100644
>>> --- a/drivers/iommu/intel-pasid.c
>>> +++ b/drivers/iommu/intel-pasid.c
>>> @@ -544,17 +544,11 @@ int intel_pasid_setup_second_level(struct
>>> intel_iommu *iommu,
>>>   		return -EINVAL;
>>>   	}
>>>
>>> -	/*
>>> -	 * Skip top levels of page tables for iommu which has less agaw
>>> -	 * than default. Unnecessary for PT mode.
>>> -	 */
>>>   	pgd = domain->pgd;
>>> -	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
>>> -		pgd = phys_to_virt(dma_pte_addr(pgd));
>>> -		if (!dma_pte_present(pgd)) {
>>> -			dev_err(dev, "Invalid domain page table\n");
>>> -			return -EINVAL;
>>> -		}
>>> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
>>> +	if (agaw < 0) {
>>> +		dev_err(dev, "Invalid domain page table\n");
> is the dev_err() really requested. I see in domain_setup_first_level(),
> there is none.
>>> +		return -EINVAL;
>>>   	}
>> ok, I see how it is used. possibly combine last and this one together since
>> it's mostly moving code...
> I tend to agree with Kevin. May be better squash the 2 patches. Also not
> sure the inline of iommu_skip_agaw() is meaningful then. Also Add commit
> messages on the resulting patch.
> 
> Note domain_setup_first_level() also could use the helper while we are
> it (if declaration moved to common helper). Only the error code differs
> in case !dma_pte_present(pgd), ie. -ENOMEM. May be good to align.
> 
> Otherwise those stuff may be done in a fixup patch.

Agreed. Will squash these 2 patches with a meaningful commit message. As
for using this helper in other files, like domain_setup_first_level(),
we need more review and test efforts, hence it's better to put it in a
followup patch.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-27 12:21   ` Tian, Kevin
@ 2020-03-29  8:03     ` Lu Baolu
  2020-03-30 18:21       ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Lu Baolu @ 2020-03-29  8:03 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj, Ashok, Jonathan Cameron

On 2020/3/27 20:21, Tian, Kevin wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Sent: Saturday, March 21, 2020 7:28 AM
>>
>> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> 
> now the spec is already at rev3.1 😊

Updated.

> 
>> With PASID granular translation type set to 0x11b, translation
>> result from the first level(FL) also subject to a second level(SL)
>> page table translation. This mode is used for SVA virtualization,
>> where FL performs guest virtual to guest physical translation and
>> SL performs guest physical to host physical translation.
>>
>> This patch adds a helper function for setting up nested translation
>> where second level comes from a domain and first level comes from
>> a guest PGD.
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> ---
>>   drivers/iommu/intel-pasid.c | 240
>> +++++++++++++++++++++++++++++++++++++++++++-
>>   drivers/iommu/intel-pasid.h |  12 +++
>>   include/linux/intel-iommu.h |   3 +
>>   3 files changed, 252 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
>> index 9bdb7ee228b6..10c7856afc6b 100644
>> --- a/drivers/iommu/intel-pasid.c
>> +++ b/drivers/iommu/intel-pasid.c
>> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
>>   	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
>>   }
>>
>> +/*
>> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_emt(struct pasid_entry *pe, u64 value)
>> +{
>> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
>> +}
>> +
>> +/*
>> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_pat(struct pasid_entry *pe, u64 value)
>> +{
>> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
>> +}
>> +
>> +/*
>> + * Setup the Cache Disable (CD) field (Bit 89)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_cd(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
>> +}
>> +
>> +/*
>> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_emte(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
>> +}
>> +
>> +/*
>> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_eafe(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
>> +}
>> +
>> +/*
>> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_pcd(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
>> +}
>> +
>> +/*
>> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
>> + * of a scalable mode PASID entry.
>> + */
>> +static inline void
>> +pasid_set_pwt(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
>> +}
>> +
>>   static void
>>   pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>>   				    u16 did, int pasid)
>> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu
>> *iommu,
>>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>>
>>   	/* Setup Present and PASID Granular Transfer Type: */
>> -	pasid_set_translation_type(pte, 1);
>> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
>>   	pasid_set_present(pte);
>>   	pasid_flush_caches(iommu, pte, pasid, did);
>>
>> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
>> intel_iommu *iommu,
>>   	pasid_set_domain_id(pte, did);
>>   	pasid_set_slptr(pte, pgd_val);
>>   	pasid_set_address_width(pte, agaw);
>> -	pasid_set_translation_type(pte, 2);
>> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>>   	pasid_set_fault_enable(pte);
>>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>>
>> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
>> intel_iommu *iommu,
>>   	pasid_clear_entry(pte);
>>   	pasid_set_domain_id(pte, did);
>>   	pasid_set_address_width(pte, iommu->agaw);
>> -	pasid_set_translation_type(pte, 4);
>> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
>>   	pasid_set_fault_enable(pte);
>>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>>
>> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
>> intel_iommu *iommu,
>>
>>   	return 0;
>>   }
>> +
>> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
>> +				struct pasid_entry *pte,
>> +				struct iommu_gpasid_bind_data_vtd
>> *pasid_data)
>> +{
>> +	/*
>> +	 * Not all guest PASID table entry fields are passed down during bind,
>> +	 * here we only set up the ones that are dependent on guest settings.
>> +	 * Execution related bits such as NXE, SMEP are not meaningful to
>> IOMMU,
>> +	 * therefore not set. Other fields, such as snoop related, are set
>> based
>> +	 * on host needs regardless of guest settings.
>> +	 */
>> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
>> +		if (!ecap_srs(iommu->ecap)) {
>> +			pr_err("No supervisor request support on %s\n",
>> +			       iommu->name);
>> +			return -EINVAL;
>> +		}
>> +		pasid_set_sre(pte);
>> +	}
>> +
>> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
>> +		if (!ecap_eafs(iommu->ecap)) {
>> +			pr_err("No extended access flag support on %s\n",
>> +				iommu->name);
>> +			return -EINVAL;
>> +		}
>> +		pasid_set_eafe(pte);
>> +	}
>> +
>> +	/*
>> +	 * Memory type is only applicable to devices inside processor
>> coherent
>> +	 * domain. PCIe devices are not included. We can skip the rest of the
>> +	 * flags if IOMMU does not support MTS.
> 
> when you say that PCI devices are not included, is it simple for information
> or should we impose some check to make sure below path not applied to
> them?

Jacob, does it work for you if I add below check?

	if (ecap_mts(iommu->ecap) && !dev_is_pci(dev))

Or, we need to remove this comment line?

> 
>> +	 */
>> +	if (ecap_mts(iommu->ecap)) {
>> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
>> +			pasid_set_emte(pte);
>> +			pasid_set_emt(pte, pasid_data->emt);
>> +		}
>> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
>> +			pasid_set_pcd(pte);
>> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
>> +			pasid_set_pwt(pte);
>> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
>> +			pasid_set_cd(pte);
>> +		pasid_set_pat(pte, pasid_data->pat);
>> +	} else if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK)
>> {
>> +		pr_err("No memory type support for bind guest PASID
>> on %s\n",
>> +			iommu->name);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +
>> +}
>> +
>> +/**
>> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
>> + * This could be used for guest shared virtual address. In this case, the
>> + * first level page tables are used for GVA-GPA translation in the guest,
>> + * second level page tables are used for GPA-HPA translation.
> 
> GVA->GPA is just one example. It could be gIOVA->GPA too. Here the
> point is that the first level is the translation table managed by the guest.

Agreed.

> 
>> + *
>> + * @iommu:      IOMMU which the device belong to
>> + * @dev:        Device to be set up for translation
>> + * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
>> + * @pasid:      PASID to be programmed in the device PASID table
>> + * @pasid_data: Additional PASID info from the guest bind request
>> + * @domain:     Domain info for setting up second level page tables
>> + * @addr_width: Address width of the first level (guest)
>> + */
>> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
>> +			struct device *dev, pgd_t *gpgd,
>> +			int pasid, struct iommu_gpasid_bind_data_vtd
>> *pasid_data,
>> +			struct dmar_domain *domain,
>> +			int addr_width)
>> +{
>> +	struct pasid_entry *pte;
>> +	struct dma_pte *pgd;
>> +	int ret = 0;
>> +	u64 pgd_val;
>> +	int agaw;
>> +	u16 did;
>> +
>> +	if (!ecap_nest(iommu->ecap)) {
>> +		pr_err("IOMMU: %s: No nested translation support\n",
>> +		       iommu->name);
>> +		return -EINVAL;
>> +	}
>> +
>> +	pte = intel_pasid_get_entry(dev, pasid);
>> +	if (WARN_ON(!pte))
>> +		return -EINVAL;
> 
> should we have intel_pasid_get_entry to return error which is then carried
> here? Looking at that function there could be error conditions both being
> invalid parameter and no memory...

Agreed. Will do this in a followup patch.

> 
>> +
>> +	/*
>> +	 * Caller must ensure PASID entry is not in use, i.e. not bind the
>> +	 * same PASID to the same device twice.
>> +	 */
>> +	if (pasid_pte_is_present(pte))
>> +		return -EBUSY;
> 
> is any lock held outside of this function? curious whether any race
> condition may happen in between.

The pasid entry change should always be protected by iommu->lock.

> 
>> +
>> +	pasid_clear_entry(pte);
>> +
>> +	/* Sanity checking performed by caller to make sure address
>> +	 * width matching in two dimensions:
>> +	 * 1. CPU vs. IOMMU
>> +	 * 2. Guest vs. Host.
>> +	 */
>> +	switch (addr_width) {
>> +	case ADDR_WIDTH_5LEVEL:
>> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
>> +			cap_5lp_support(iommu->cap)) {
>> +			pasid_set_flpm(pte, 1);
> 
> define a macro for 4lvl and 5lvl
> 
>> +		} else {
>> +			dev_err(dev, "5-level paging not supported\n");
>> +			return -EINVAL;
>> +		}
>> +		break;
>> +	case ADDR_WIDTH_4LEVEL:
>> +		pasid_set_flpm(pte, 0);
>> +		break;
>> +	default:
>> +		dev_err(dev, "Invalid guest address width %d\n",
>> addr_width);
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* First level PGD is in GPA, must be supported by the second level */
>> +	if ((u64)gpgd > domain->max_addr) {
>> +		dev_err(dev, "Guest PGD %llx not supported, max %llx\n",
>> +			(u64)gpgd, domain->max_addr);
>> +		return -EINVAL;
>> +	}
>> +	pasid_set_flptr(pte, (u64)gpgd);
>> +
>> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
>> +	if (ret) {
>> +		dev_err(dev, "Guest PASID bind data not supported\n");
>> +		return ret;
>> +	}
>> +
>> +	/* Setup the second level based on the given domain */
>> +	pgd = domain->pgd;
>> +
>> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
>> +	if (agaw < 0) {
>> +		dev_err(dev, "Invalid domain page table\n");
>> +		return -EINVAL;
>> +	}
>> +	pgd_val = virt_to_phys(pgd);
>> +	pasid_set_slptr(pte, pgd_val);
>> +	pasid_set_fault_enable(pte);
>> +
>> +	did = domain->iommu_did[iommu->seq_id];
>> +	pasid_set_domain_id(pte, did);
>> +
>> +	pasid_set_address_width(pte, agaw);
>> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>> +
>> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
>> +	pasid_set_present(pte);
>> +	pasid_flush_caches(iommu, pte, pasid, did);
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
>> index 92de6df24ccb..698015ee3f04 100644
>> --- a/drivers/iommu/intel-pasid.h
>> +++ b/drivers/iommu/intel-pasid.h
>> @@ -36,6 +36,7 @@
>>    * to vmalloc or even module mappings.
>>    */
>>   #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
>> +#define PASID_FLAG_NESTED		BIT(1)
>>
>>   /*
>>    * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
>> @@ -51,6 +52,11 @@ struct pasid_entry {
>>   	u64 val[8];
>>   };
>>
>> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
>> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
>> +#define PASID_ENTRY_PGTT_NESTED		(3)
>> +#define PASID_ENTRY_PGTT_PT		(4)
>> +
>>   /* The representative of a PASID table */
>>   struct pasid_table {
>>   	void			*table;		/* pasid table pointer */
>> @@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct
>> intel_iommu *iommu,
>>   int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>>   				   struct dmar_domain *domain,
>>   				   struct device *dev, int pasid);
>> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
>> +			struct device *dev, pgd_t *pgd,
>> +			int pasid,
>> +			struct iommu_gpasid_bind_data_vtd *pasid_data,
>> +			struct dmar_domain *domain,
>> +			int addr_width);
>>   void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>>   				 struct device *dev, int pasid);
>>
>> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
>> index ed7171d2ae1f..eda1d6687144 100644
>> --- a/include/linux/intel-iommu.h
>> +++ b/include/linux/intel-iommu.h
>> @@ -42,6 +42,9 @@
>>   #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
>>   #define DMA_FL_PTE_XD		BIT_ULL(63)
>>
>> +#define ADDR_WIDTH_5LEVEL	(57)
>> +#define ADDR_WIDTH_4LEVEL	(48)
>> +
>>   #define CONTEXT_TT_MULTI_LEVEL	0
>>   #define CONTEXT_TT_DEV_IOTLB	1
>>   #define CONTEXT_TT_PASS_THROUGH 2
>> --
>> 2.7.4
> 

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-20 23:27 ` [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function Jacob Pan
  2020-03-27 12:21   ` Tian, Kevin
@ 2020-03-29 11:35   ` Auger Eric
  2020-04-01 20:06     ` Jacob Pan
  1 sibling, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-29 11:35 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Tian, Kevin, Raj Ashok, Jonathan Cameron

Hi Jacob,

On 3/21/20 12:27 AM, Jacob Pan wrote:
> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> With PASID granular translation type set to 0x11b, translation
> result from the first level(FL) also subject to a second level(SL)
> page table translation. This mode is used for SVA virtualization,
> where FL performs guest virtual to guest physical translation and
> SL performs guest physical to host physical translation.
> 
> This patch adds a helper function for setting up nested translation
> where second level comes from a domain and first level comes from
> a guest PGD.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-pasid.c | 240 +++++++++++++++++++++++++++++++++++++++++++-
>  drivers/iommu/intel-pasid.h |  12 +++
>  include/linux/intel-iommu.h |   3 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 9bdb7ee228b6..10c7856afc6b 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
>  	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
>  }
>  
> +/*
> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
> +}
> +
> +/*
> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
> +}
> +
> +/*
> + * Setup the Cache Disable (CD) field (Bit 89)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_cd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> +}
> +
> +/*
> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emte(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> +}
> +
> +/*
> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_eafe(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> +}
> +
> +/*
> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pcd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> +}
> +
> +/*
> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pwt(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> +}
> +
>  static void
>  pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>  				    u16 did, int pasid)
> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>  
>  	/* Setup Present and PASID Granular Transfer Type: */
> -	pasid_set_translation_type(pte, 1);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
>  	pasid_set_present(pte);
>  	pasid_flush_caches(iommu, pte, pasid, did);
>  
> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_slptr(pte, pgd_val);
>  	pasid_set_address_width(pte, agaw);
> -	pasid_set_translation_type(pte, 2);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>  
> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  	pasid_clear_entry(pte);
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_address_width(pte, iommu->agaw);
> -	pasid_set_translation_type(pte, 4);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));

All above looks good to me
>  
> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  
>  	return 0;
>  }
> +
> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> +				struct pasid_entry *pte,
> +				struct iommu_gpasid_bind_data_vtd *pasid_data)
> +{
> +	/*
> +	 * Not all guest PASID table entry fields are passed down during bind,
> +	 * here we only set up the ones that are dependent on guest settings.
> +	 * Execution related bits such as NXE, SMEP are not meaningful to IOMMU,
> +	 * therefore not set. Other fields, such as snoop related, are set based
> +	 * on host needs regardless of guest settings.
> +	 */
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> +		if (!ecap_srs(iommu->ecap)) {
> +			pr_err("No supervisor request support on %s\n",
> +			       iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_sre(pte);
> +	}
> +
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> +		if (!ecap_eafs(iommu->ecap)) {
> +			pr_err("No extended access flag support on %s\n",
> +				iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_eafe(pte);
> +	}
> +
> +	/*
> +	 * Memory type is only applicable to devices inside processor coherent
> +	 * domain. PCIe devices are not included. We can skip the rest of the
> +	 * flags if IOMMU does not support MTS.
> +	 */
nit:
	if (!pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK)
		return 0;

	if (!ecap_mts(iommu->ecap) {
		pr_err("No memory type support for bind guest PASID on %s\n",
		iommu->name);
		return -EINVAL;
	}

	settings ../..
> +	if (ecap_mts(iommu->ecap)) {
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
> +			pasid_set_emte(pte);
> +			pasid_set_emt(pte, pasid_data->emt);
> +		}
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> +			pasid_set_pcd(pte);
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> +			pasid_set_pwt(pte);
> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> +			pasid_set_cd(pte);
> +		pasid_set_pat(pte, pasid_data->pat);
> +	} else if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK) {
> +		pr_err("No memory type support for bind guest PASID on %s\n",
> +			iommu->name);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +
> +}
> +
> +/**
> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
> + * This could be used for guest shared virtual address. In this case, the
> + * first level page tables are used for GVA-GPA translation in the guest,
> + * second level page tables are used for GPA-HPA translation.
> + *
> + * @iommu:      IOMMU which the device belong to
> + * @dev:        Device to be set up for translation
> + * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
> + * @pasid:      PASID to be programmed in the device PASID table
> + * @pasid_data: Additional PASID info from the guest bind request
> + * @domain:     Domain info for setting up second level page tables
> + * @addr_width: Address width of the first level (guest)
> + */
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *gpgd,
> +			int pasid, struct iommu_gpasid_bind_data_vtd *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width)
> +{
> +	struct pasid_entry *pte;
> +	struct dma_pte *pgd;
> +	int ret = 0;
> +	u64 pgd_val;
> +	int agaw;
> +	u16 did;
> +
> +	if (!ecap_nest(iommu->ecap)) {
> +		pr_err("IOMMU: %s: No nested translation support\n",
> +		       iommu->name);
> +		return -EINVAL;
> +	}
I am surprised you don't check that the dmar_domain has the
DOMAIN_FLAG_NESTED_MODE flag (ir I missed it). Don't you have any risk
that the userspace overwrites the PTE of a device attached to an usual
domain, ie. fulled handled by the host?
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	/*
> +	 * Caller must ensure PASID entry is not in use, i.e. not bind the
> +	 * same PASID to the same device twice.
> +	 */
> +	if (pasid_pte_is_present(pte))
> +		return -EBUSY;
Here you check the PTE is not valid, is it sufficient to guarantee the
above? Also refering to the race potential issue pointed out by Kevin.
> +
> +	pasid_clear_entry(pte);
> +
> +	/* Sanity checking performed by caller to make sure address
> +	 * width matching in two dimensions:
> +	 * 1. CPU vs. IOMMU
> +	 * 2. Guest vs. Host.
> +	 */
> +	switch (addr_width) {
> +	case ADDR_WIDTH_5LEVEL:
> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> +			cap_5lp_support(iommu->cap)) {
> +			pasid_set_flpm(pte, 1);
> +		} else {
> +			dev_err(dev, "5-level paging not supported\n");
> +			return -EINVAL;
> +		}
> +		break;
> +	case ADDR_WIDTH_4LEVEL:
> +		pasid_set_flpm(pte, 0);
> +		break;
> +	default:
> +		dev_err(dev, "Invalid guest address width %d\n", addr_width);
> +		return -EINVAL;
> +	}
> +
> +	/* First level PGD is in GPA, must be supported by the second level */
> +	if ((u64)gpgd > domain->max_addr) {
> +		dev_err(dev, "Guest PGD %llx not supported, max %llx\n",
> +			(u64)gpgd, domain->max_addr);
> +		return -EINVAL;
> +	}
> +	pasid_set_flptr(pte, (u64)gpgd);
> +
> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> +	if (ret) {
> +		dev_err(dev, "Guest PASID bind data not supported\n");
Shall we output all those traces without limit? They are triggered by
userspace, meaning this latter can trigger a storm of those.
> +		return ret;
> +	}
> +
> +	/* Setup the second level based on the given domain */
> +	pgd = domain->pgd;
> +
> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> +	if (agaw < 0) {
> +		dev_err(dev, "Invalid domain page table\n");
> +		return -EINVAL;
> +	}
> +	pgd_val = virt_to_phys(pgd);
> +	pasid_set_slptr(pte, pgd_val);
> +	pasid_set_fault_enable(pte);
> +
> +	did = domain->iommu_did[iommu->seq_id];
> +	pasid_set_domain_id(pte, did);
> +
> +	pasid_set_address_width(pte, agaw);
> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> +
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> +	pasid_set_present(pte);
> +	pasid_flush_caches(iommu, pte, pasid, did);
> +
> +	return ret;
> +}
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 92de6df24ccb..698015ee3f04 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -36,6 +36,7 @@
>   * to vmalloc or even module mappings.
>   */
>  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> +#define PASID_FLAG_NESTED		BIT(1)
>  
>  /*
>   * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
> @@ -51,6 +52,11 @@ struct pasid_entry {
>  	u64 val[8];
>  };
>  
> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> +#define PASID_ENTRY_PGTT_NESTED		(3)
> +#define PASID_ENTRY_PGTT_PT		(4)
> +
>  /* The representative of a PASID table */
>  struct pasid_table {
>  	void			*table;		/* pasid table pointer */
> @@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, int pasid);
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *pgd,
> +			int pasid,
> +			struct iommu_gpasid_bind_data_vtd *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ed7171d2ae1f..eda1d6687144 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -42,6 +42,9 @@
>  #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
>  #define DMA_FL_PTE_XD		BIT_ULL(63)
>  
> +#define ADDR_WIDTH_5LEVEL	(57)
> +#define ADDR_WIDTH_4LEVEL	(48)
> +
>  #define CONTEXT_TT_MULTI_LEVEL	0
>  #define CONTEXT_TT_DEV_IOTLB	1
>  #define CONTEXT_TT_PASS_THROUGH 2
> 
Thanks

Eric

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-20 23:27 ` [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support Jacob Pan
  2020-03-28  8:02   ` Tian, Kevin
@ 2020-03-29 13:40   ` Auger Eric
  2020-03-30 22:53     ` Jacob Pan
  1 sibling, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-29 13:40 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Tian, Kevin, Raj Ashok, Jonathan Cameron

Hi,

On 3/21/20 12:27 AM, Jacob Pan wrote:
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
> 
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c |   4 +
>  drivers/iommu/intel-svm.c   | 224 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |   8 +-
>  include/linux/intel-svm.h   |  17 ++++
>  4 files changed, 252 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index e599b2537b1c..b1477cd423dd 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> +#endif
>  };
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index d7f2a5358900..47c0deb5ae56 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
>  	list_for_each_entry((sdev), &(svm)->devs, list)	\
>  		if ((d) != (sdev)->dev) {} else
>  
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +			struct device *dev,
> +			struct iommu_gpasid_bind_data *data)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct dmar_domain *ddomain;
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = 0;
> +
> +	if (WARN_ON(!iommu) || !data)
> +		return -EINVAL;
> +
> +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> +		return -EINVAL;
> +
> +	if (dev_is_pci(dev)) {
> +		/* VT-d supports devices with full 20 bit PASIDs only */
> +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> +			return -EINVAL;
> +	} else {
> +		return -ENOTSUPP;
> +	}
> +
> +	/*
> +	 * We only check host PASID range, we have no knowledge to check
> +	 * guest PASID range nor do we use the guest PASID.
nit : "nor do we use the guest PASID". Well the guest PASID FLAG is
checked below and if set, svm->gpasid is set ;-)
> +	 */
> +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> +		return -EINVAL;
> +
> +	ddomain = to_dmar_domain(domain);
> +
> +	/* Sanity check paging mode support match between host and guest */
> +	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> +	    !cap_5lp_support(iommu->cap)) {
> +		pr_err("Cannot support 5 level paging requested by guest!\n");
> +		return -EINVAL;
nit: This check also is done in intel_pasid_setup_nested with an extra
check:
+	switch (addr_width) {
+	case ADDR_WIDTH_5LEVEL:
+		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+			cap_5lp_support(iommu->cap)) {

> +	}
> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, data->hpasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	if (svm) {
> +		/*
> +		 * If we found svm for the PASID, there must be at
> +		 * least one device bond, otherwise svm should be freed.
> +		 */
> +		if (WARN_ON(list_empty(&svm->devs))) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (svm->mm == get_task_mm(current) &&
> +		    data->hpasid == svm->pasid &&
> +		    data->gpasid == svm->gpasid) {
> +			pr_warn("Cannot bind the same guest-host PASID for the same process\n");
> +			mmput(svm->mm);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		mmput(current->mm);
> +
> +		for_each_svm_dev(sdev, svm, dev) {
> +			/* In case of multiple sub-devices of the same pdev
> +			 * assigned, we should allow multiple bind calls with
> +			 * the same PASID and pdev.
> +			 */
> +			sdev->users++;
> +			goto out;
> +		}
> +	} else {
> +		/* We come here when PASID has never been bond to a device. */
> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> +		if (!svm) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
> +		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
> +		 * ownership.
> +		 */
> +		svm->mm = get_task_mm(current);
> +		svm->pasid = data->hpasid;
> +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> +			svm->gpasid = data->gpasid;
> +			svm->flags |= SVM_FLAG_GUEST_PASID;
> +		}
> +		ioasid_set_data(data->hpasid, svm);
> +		INIT_LIST_HEAD_RCU(&svm->devs);
> +		mmput(svm->mm);
> +	}
> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> +	if (!sdev) {
> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}
nit: the above 4 lines are duplicated 3 times. Might be worth a helper.
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sdev->dev = dev;
> +	sdev->users = 1;
> +
> +	/* Set up device context entry for PASID if not enabled already */
> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> +	if (ret) {
> +		dev_err(dev, "Failed to enable PASID capability\n");
unlimited tracing upon userspace call? Don't know what is the best policy.
> +		kfree(sdev);
> +		/*
> +		 * If this this a new PASID that never bond to a device, then
> +		 * the device list must be empty which indicates struct svm
> +		 * was allocated in this function.
> +		 */
> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}
> +		goto out;
> +	}
> +
> +	/*
> +	 * For guest bind, we need to set up PASID table entry as follows:
> +	 * - FLPM matches guest paging mode
> +	 * - turn on nested mode
> +	 * - SL guest address width matching
> +	 */
> +	ret = intel_pasid_setup_nested(iommu,
> +				       dev,
> +				       (pgd_t *)data->gpgd,
> +				       data->hpasid,
> +				       &data->vtd,
> +				       ddomain,
> +				       data->addr_width);
> +	if (ret) {
> +		dev_err(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
> +			data->hpasid, ret);
> +		/*
> +		 * PASID entry should be in cleared state if nested mode
> +		 * set up failed. So we only need to clear IOASID tracking
> +		 * data such that free call will succeed.
> +		 */
> +		kfree(sdev);
> +		if (list_empty(&svm->devs)) {
> +			ioasid_set_data(data->hpasid, NULL);
> +			kfree(svm);
> +		}

> +		goto out;
> +	}
> +	svm->flags |= SVM_FLAG_GUEST_MODE;
> +
> +	init_rcu_head(&sdev->rcu);
> +	list_add_rcu(&sdev->list, &svm->devs);
> + out:
> +	mutex_unlock(&pasid_mutex);
> +	return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = -EINVAL;
> +
> +	if (WARN_ON(!iommu))
> +		return -EINVAL;
> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (!svm) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	for_each_svm_dev(sdev, svm, dev) {
> +		ret = 0;
> +		sdev->users--;
> +		if (!sdev->users) {
> +			list_del_rcu(&sdev->list);
> +			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
> +			/* TODO: Drain in flight PRQ for the PASID since it
> +			 * may get reused soon, we don't want to
> +			 * confuse with its previous life.
> +			 * intel_svm_drain_prq(dev, pasid);
> +			 */
> +			kfree_rcu(sdev, rcu);
> +
> +			if (list_empty(&svm->devs)) {
> +				/*
> +				 * We do not free PASID here until explicit call
> +				 * from VFIO to free. The PASID life cycle
> +				 * management is largely tied to VFIO management
> +				 * of assigned device life cycles. In case of
> +				 * guest exit without a explicit free PASID call,
> +				 * the responsibility lies in VFIO layer to free
> +				 * the PASIDs allocated for the guest.
> +				 * For security reasons, VFIO has to track the
> +				 * PASID ownership per guest anyway to ensure
> +				 * that PASID allocated by one guest cannot be
> +				 * used by another.
> +				 */
> +				ioasid_set_data(pasid, NULL);
> +				kfree(svm);
> +			}
> +		}
> +		break;
> +	}
> +out:
> +	mutex_unlock(&pasid_mutex);
> +
> +	return ret;
> +}
> +
>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
>  {
>  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index eda1d6687144..85b05120940e 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device *dev);
>  extern void intel_svm_check(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_gpasid_bind_data *data);
> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>  struct svm_dev_ops;
>  
>  struct intel_svm_dev {
> @@ -698,9 +700,13 @@ struct intel_svm_dev {
>  struct intel_svm {
>  	struct mmu_notifier notifier;
>  	struct mm_struct *mm;
> +
>  	struct intel_iommu *iommu;
>  	int flags;
>  	int pasid;
> +	int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
> +		     * to guest PASID mapping.
> +		     */
>  	struct list_head devs;
>  	struct list_head list;
>  };
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index d7c403d0dd27..c19690937540 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>   * do such IOTLB flushes automatically.
>   */
>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
> + * In this case the mm_struct is in the guest kernel or userspace, its life
> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE	(1<<2)
> +/*
> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
> + * which requires guest and host PASID translation at both directions. We keep
> + * track of guest PASID in order to provide lookup service to device drivers.
> + * One such example is a physical function (PF) driver that supports mediated
> + * device (mdev) assignment. Guest programming of mdev configuration space can
> + * only be done with guest PASID, therefore PF driver needs to find the matching
> + * host PASID to program the real hardware.
> + */
> +#define SVM_FLAG_GUEST_PASID	(1<<3)
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  
> 
Thanks

Eric

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-28 10:01   ` Tian, Kevin
@ 2020-03-29 15:34     ` Auger Eric
  2020-03-31  2:49       ` Tian, Kevin
  2020-03-29 16:05     ` Auger Eric
  2020-03-31 18:13     ` Jacob Pan
  2 siblings, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-29 15:34 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

Hi,

On 3/28/20 11:01 AM, Tian, Kevin wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Sent: Saturday, March 21, 2020 7:28 AM
>>
>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>> vIOMMU, we need to provide invalidation support at IOMMU API and driver
>> level. This patch adds Intel VT-d specific function to implement
>> iommu passdown invalidate API for shared virtual address.
>>
>> The use case is for supporting caching structure invalidation
>> of assigned SVM capable devices. Emulated IOMMU exposes queue
> 
> emulated IOMMU -> vIOMMU, since virito-iommu could use the
> interface as well.
> 
>> invalidation capability and passes down all descriptors from the guest
>> to the physical IOMMU.
>>
>> The assumption is that guest to host device ID mapping should be
>> resolved prior to calling IOMMU driver. Based on the device handle,
>> host IOMMU driver can replace certain fields before submit to the
>> invalidation queue.
>>
>> ---
>> v7 review fixed in v10
>> ---
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
>> ---
>>  drivers/iommu/intel-iommu.c | 182
>> ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 182 insertions(+)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index b1477cd423dd..a76afb0fd51a 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -5619,6 +5619,187 @@ static void
>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>>  }
>>
>> +/*
>> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
>> + * VT-d granularity. Invalidation is typically included in the unmap operation
>> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
>> + * owns the first level page tables. Invalidations of translation caches in the
>> + * guest are trapped and passed down to the host.
>> + *
>> + * vIOMMU in the guest will only expose first level page tables, therefore
>> + * we do not include IOTLB granularity for request without PASID (second
>> level).
> 
> I would revise above as "We do not support IOTLB granularity for request 
> without PASID (second level), therefore any vIOMMU implementation that
> exposes the SVA capability to the guest should only expose the first level
> page tables, implying all invalidation requests from the guest will include
> a valid PASID"
> 
>> + *
>> + * For example, to find the VT-d granularity encoding for IOTLB
>> + * type and page selective granularity within PASID:
>> + * X: indexed by iommu cache type
>> + * Y: indexed by enum iommu_inv_granularity
>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
>> + *
>> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
>> + *
>> + */
>> +const static int
>> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
>> NR] = {
>> +	/*
>> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
>> +	 * page selective (address granularity)
>> +	 */
>> +	{0, 1, 1},
>> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
>> +	{1, 1, 0},
> 
> Is this combination correct? when single PASID is being specified, it is 
> essentially a page-selective invalidation since you need provide Address
> and Size. 
Isn't it the same when G=1? Still the addr/size is used. Doesn't it
correspond to IOMMU_INV_GRANU_ADDR with IOMMU_INV_ADDR_FLAGS_PASID flag
unset?

so {0, 0, 1}?

Thanks

Eric

> 
>> +	/* PASID cache */
> 
> PASID cache is fully managed by the host. Guest PASID cache invalidation
> is interpreted by vIOMMU for bind and unbind operations. I don't think
> we should accept any PASID cache invalidation from userspace or guest.
> 
>> +	{1, 1, 0}
>> +};
>> +
>> +const static int
>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
>> _NR] = {
>> +	/* PASID based IOTLB */
>> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>> +	/* PASID based dev TLBs */
>> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
>> +	/* PASID cache */
>> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
>> +};
>> +
>> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
>> +{
>> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
>> IOMMU_INV_GRANU_NR ||
>> +		!inv_type_granu_map[type][granu])
>> +		return -EINVAL;
>> +
>> +	*vtd_granu = inv_type_granu_table[type][granu];
>> +
> 
> btw do we really need both map and table here? Can't we just
> use one table with unsupported granularity marked as a special
> value?
> 
>> +	return 0;
>> +}
>> +
>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>> +{
>> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
>> +
>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>> +	 * IOMMU cache invalidate API passes granu_size in bytes, and
>> number of
>> +	 * granu size in contiguous memory.
>> +	 */
>> +	return order_base_2(nr_pages);
>> +}
>> +
>> +#ifdef CONFIG_INTEL_IOMMU_SVM
>> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
>> +		struct device *dev, struct iommu_cache_invalidate_info
>> *inv_info)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	struct intel_iommu *iommu;
>> +	unsigned long flags;
>> +	int cache_type;
>> +	u8 bus, devfn;
>> +	u16 did, sid;
>> +	int ret = 0;
>> +	u64 size = 0;
>> +
>> +	if (!inv_info || !dmar_domain ||
>> +		inv_info->version !=
>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>> +		return -EINVAL;
>> +
>> +	if (!dev || !dev_is_pci(dev))
>> +		return -ENODEV;
>> +
>> +	iommu = device_to_iommu(dev, &bus, &devfn);
>> +	if (!iommu)
>> +		return -ENODEV;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	spin_lock(&iommu->lock);
>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
>> +	if (!info) {
>> +		ret = -EINVAL;
>> +		goto out_unlock;
> 
> -ENOTSUPP?
> 
>> +	}
>> +	did = dmar_domain->iommu_did[iommu->seq_id];
>> +	sid = PCI_DEVID(bus, devfn);
>> +
>> +	/* Size is only valid in non-PASID selective invalidation */
>> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
>> +		size = to_vtd_size(inv_info->addr_info.granule_size,
>> +				   inv_info->addr_info.nb_granules);
>> +
>> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache,
>> IOMMU_CACHE_INV_TYPE_NR) {
>> +		int granu = 0;
>> +		u64 pasid = 0;
>> +
>> +		ret = to_vtd_granularity(cache_type, inv_info->granularity,
>> &granu);
>> +		if (ret) {
>> +			pr_err("Invalid cache type and granu
>> combination %d/%d\n", cache_type,
>> +				inv_info->granularity);
>> +			break;
>> +		}
>> +
>> +		/* PASID is stored in different locations based on granularity
>> */
>> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
>> +			inv_info->pasid_info.flags &
>> IOMMU_INV_PASID_FLAGS_PASID)
>> +			pasid = inv_info->pasid_info.pasid;
>> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR
>> &&
>> +			inv_info->addr_info.flags &
>> IOMMU_INV_ADDR_FLAGS_PASID)
>> +			pasid = inv_info->addr_info.pasid;
>> +		else {
>> +			pr_err("Cannot find PASID for given cache type and
>> granularity\n");
>> +			break;
>> +		}
>> +
>> +		switch (BIT(cache_type)) {
>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
>> +			if ((inv_info->granularity !=
>> IOMMU_INV_GRANU_PASID) &&
> 
> granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> why IOMMU_INV_GRANU_DOMAIN also needs size check.
> 
>> +				size && (inv_info->addr_info.addr &
>> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>> +				pr_err("Address out of range, 0x%llx, size
>> order %llu\n",
>> +					inv_info->addr_info.addr, size);
>> +				ret = -ERANGE;
>> +				goto out_unlock;
>> +			}
>> +
>> +			qi_flush_piotlb(iommu, did,
>> +					pasid,
>> +					mm_to_dma_pfn(inv_info-
>>> addr_info.addr),
>> +					(granu == QI_GRAN_NONG_PASID) ? -
>> 1 : 1 << size,
>> +					inv_info->addr_info.flags &
>> IOMMU_INV_ADDR_FLAGS_LEAF);
>> +
>> +			/*
>> +			 * Always flush device IOTLB if ATS is enabled since
>> guest
>> +			 * vIOMMU exposes CM = 1, no device IOTLB flush
>> will be passed
>> +			 * down.
>> +			 */
> 
> Does VT-d spec mention that no device IOTLB flush is required when CM=1?
> 
>> +			if (info->ats_enabled) {
>> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
>>> pfsid,
>> +						pasid, info->ats_qdep,
>> +						inv_info->addr_info.addr,
>> size,
>> +						granu);
>> +			}
>> +			break;
>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>> +			if (info->ats_enabled) {
>> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
>>> pfsid,
>> +						inv_info->addr_info.pasid,
>> info->ats_qdep,
>> +						inv_info->addr_info.addr,
>> size,
>> +						granu);
> 
> I'm confused here. There are two granularities allowed for devtlb, but here
> you only handle one of them?
> 
>> +			} else
>> +				pr_warn("Passdown device IOTLB flush w/o
>> ATS!\n");
>> +
>> +			break;
>> +		case IOMMU_CACHE_INV_TYPE_PASID:
>> +			qi_flush_pasid_cache(iommu, did, granu, inv_info-
>>> pasid_info.pasid);
>> +
> 
> as earlier comment, we shouldn't allow userspace or guest to invalidate
> PASID cache
> 
>> +			break;
>> +		default:
>> +			dev_err(dev, "Unsupported IOMMU invalidation
>> type %d\n",
>> +				cache_type);
>> +			ret = -EINVAL;
>> +		}
>> +	}
>> +out_unlock:
>> +	spin_unlock(&iommu->lock);
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	return ret;
>> +}
>> +#endif
>> +
>>  static int intel_iommu_map(struct iommu_domain *domain,
>>  			   unsigned long iova, phys_addr_t hpa,
>>  			   size_t size, int iommu_prot, gfp_t gfp)
>> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
>>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>  #endif
>> --
>> 2.7.4
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-28 10:01   ` Tian, Kevin
  2020-03-29 15:34     ` Auger Eric
@ 2020-03-29 16:05     ` Auger Eric
  2020-03-31  3:34       ` Tian, Kevin
  2020-03-31 18:13     ` Jacob Pan
  2 siblings, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-29 16:05 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron



On 3/28/20 11:01 AM, Tian, Kevin wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Sent: Saturday, March 21, 2020 7:28 AM
>>
>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>> vIOMMU, we need to provide invalidation support at IOMMU API and driver
>> level. This patch adds Intel VT-d specific function to implement
>> iommu passdown invalidate API for shared virtual address.
>>
>> The use case is for supporting caching structure invalidation
>> of assigned SVM capable devices. Emulated IOMMU exposes queue
> 
> emulated IOMMU -> vIOMMU, since virito-iommu could use the
> interface as well.
> 
>> invalidation capability and passes down all descriptors from the guest
>> to the physical IOMMU.
>>
>> The assumption is that guest to host device ID mapping should be
>> resolved prior to calling IOMMU driver. Based on the device handle,
>> host IOMMU driver can replace certain fields before submit to the
>> invalidation queue.
>>
>> ---
>> v7 review fixed in v10
>> ---
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
>> ---
>>  drivers/iommu/intel-iommu.c | 182
>> ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 182 insertions(+)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index b1477cd423dd..a76afb0fd51a 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -5619,6 +5619,187 @@ static void
>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>>  }
>>
>> +/*
>> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
>> + * VT-d granularity. Invalidation is typically included in the unmap operation
>> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
>> + * owns the first level page tables. Invalidations of translation caches in the
>> + * guest are trapped and passed down to the host.
>> + *
>> + * vIOMMU in the guest will only expose first level page tables, therefore
>> + * we do not include IOTLB granularity for request without PASID (second
>> level).
> 
> I would revise above as "We do not support IOTLB granularity for request 
> without PASID (second level), therefore any vIOMMU implementation that
> exposes the SVA capability to the guest should only expose the first level
> page tables, implying all invalidation requests from the guest will include
> a valid PASID"
> 
>> + *
>> + * For example, to find the VT-d granularity encoding for IOTLB
>> + * type and page selective granularity within PASID:
>> + * X: indexed by iommu cache type
>> + * Y: indexed by enum iommu_inv_granularity
>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
>> + *
>> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
>> + *
>> + */
>> +const static int
>> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
>> NR] = {
>> +	/*
>> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
>> +	 * page selective (address granularity)
>> +	 */
>> +	{0, 1, 1},
>> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
>> +	{1, 1, 0},
> 
> Is this combination correct? when single PASID is being specified, it is 
> essentially a page-selective invalidation since you need provide Address
> and Size. 
> 
>> +	/* PASID cache */
> 
> PASID cache is fully managed by the host. Guest PASID cache invalidation
> is interpreted by vIOMMU for bind and unbind operations. I don't think
> we should accept any PASID cache invalidation from userspace or guest.
I tend to agree here.
> 
>> +	{1, 1, 0}
>> +};
>> +
>> +const static int
>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
>> _NR] = {
>> +	/* PASID based IOTLB */
>> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>> +	/* PASID based dev TLBs */
>> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
>> +	/* PASID cache */
>> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
>> +};
>> +
>> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
>> +{
>> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
>> IOMMU_INV_GRANU_NR ||
>> +		!inv_type_granu_map[type][granu])
>> +		return -EINVAL;
>> +
>> +	*vtd_granu = inv_type_granu_table[type][granu];
>> +
> 
> btw do we really need both map and table here? Can't we just
> use one table with unsupported granularity marked as a special
> value?
I asked the same question some time ago. If I remember correctly the
issue is while a granu can be supported in inv_type_granu_map, the
associated value in inv_type_granu_table can be 0. This typically
matches both values of G field (0 or 1) in the invalidation cmd. See
other comment below.
> 
>> +	return 0;
>> +}
>> +
>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>> +{
>> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
>> +
>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>> +	 * IOMMU cache invalidate API passes granu_size in bytes, and
>> number of
>> +	 * granu size in contiguous memory.
>> +	 */
>> +	return order_base_2(nr_pages);
>> +}
>> +
>> +#ifdef CONFIG_INTEL_IOMMU_SVM
>> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
>> +		struct device *dev, struct iommu_cache_invalidate_info
>> *inv_info)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	struct intel_iommu *iommu;
>> +	unsigned long flags;
>> +	int cache_type;
>> +	u8 bus, devfn;
>> +	u16 did, sid;
>> +	int ret = 0;
>> +	u64 size = 0;
>> +
>> +	if (!inv_info || !dmar_domain ||
>> +		inv_info->version !=
>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>> +		return -EINVAL;
>> +
>> +	if (!dev || !dev_is_pci(dev))
>> +		return -ENODEV;
>> +
>> +	iommu = device_to_iommu(dev, &bus, &devfn);
>> +	if (!iommu)
>> +		return -ENODEV;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	spin_lock(&iommu->lock);
>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
>> +	if (!info) {
>> +		ret = -EINVAL;
>> +		goto out_unlock;
> 
> -ENOTSUPP?
> 
>> +	}
>> +	did = dmar_domain->iommu_did[iommu->seq_id];
>> +	sid = PCI_DEVID(bus, devfn);
>> +
>> +	/* Size is only valid in non-PASID selective invalidation */
>> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
>> +		size = to_vtd_size(inv_info->addr_info.granule_size,
>> +				   inv_info->addr_info.nb_granules);
>> +
>> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache,
>> IOMMU_CACHE_INV_TYPE_NR) {
>> +		int granu = 0;
>> +		u64 pasid = 0;
>> +
>> +		ret = to_vtd_granularity(cache_type, inv_info->granularity,
>> &granu);
>> +		if (ret) {
>> +			pr_err("Invalid cache type and granu
>> combination %d/%d\n", cache_type,
>> +				inv_info->granularity);
>> +			break;
>> +		}
>> +
>> +		/* PASID is stored in different locations based on granularity
>> */
>> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
>> +			inv_info->pasid_info.flags &
>> IOMMU_INV_PASID_FLAGS_PASID)
>> +			pasid = inv_info->pasid_info.pasid;
>> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR
>> &&
>> +			inv_info->addr_info.flags &
>> IOMMU_INV_ADDR_FLAGS_PASID)
>> +			pasid = inv_info->addr_info.pasid;
>> +		else {
>> +			pr_err("Cannot find PASID for given cache type and
>> granularity\n");
>> +			break;
>> +		}
>> +
>> +		switch (BIT(cache_type)) {
>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
>> +			if ((inv_info->granularity !=
>> IOMMU_INV_GRANU_PASID) &&
> 
> granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> why IOMMU_INV_GRANU_DOMAIN also needs size check.
> 
>> +				size && (inv_info->addr_info.addr &
>> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>> +				pr_err("Address out of range, 0x%llx, size
>> order %llu\n",
>> +					inv_info->addr_info.addr, size);
>> +				ret = -ERANGE;
>> +				goto out_unlock;
>> +			}
>> +
>> +			qi_flush_piotlb(iommu, did,
>> +					pasid,
>> +					mm_to_dma_pfn(inv_info-
>>> addr_info.addr),
>> +					(granu == QI_GRAN_NONG_PASID) ? -
>> 1 : 1 << size,
>> +					inv_info->addr_info.flags &
>> IOMMU_INV_ADDR_FLAGS_LEAF);
>> +
>> +			/*
>> +			 * Always flush device IOTLB if ATS is enabled since
>> guest
>> +			 * vIOMMU exposes CM = 1, no device IOTLB flush
>> will be passed
>> +			 * down.
>> +			 */
> 
> Does VT-d spec mention that no device IOTLB flush is required when CM=1?
> 
>> +			if (info->ats_enabled) {
>> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
>>> pfsid,
>> +						pasid, info->ats_qdep,
>> +						inv_info->addr_info.addr,
>> size,
>> +						granu);
>> +			}
>> +			break;
>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>> +			if (info->ats_enabled) {
>> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
>>> pfsid,
>> +						inv_info->addr_info.pasid,
>> info->ats_qdep,
>> +						inv_info->addr_info.addr,
>> size,
>> +						granu);
> 
> I'm confused here. There are two granularities allowed for devtlb, but here
> you only handle one of them?
granu is the result of to_vtd_granularity() so it can take either of the
2 values.

Thanks

Eric
> 
>> +			} else
>> +				pr_warn("Passdown device IOTLB flush w/o
>> ATS!\n");
>> +
>> +			break;
>> +		case IOMMU_CACHE_INV_TYPE_PASID:
>> +			qi_flush_pasid_cache(iommu, did, granu, inv_info-
>>> pasid_info.pasid);
>> +
> 
> as earlier comment, we shouldn't allow userspace or guest to invalidate
> PASID cache
> 
>> +			break;
>> +		default:
>> +			dev_err(dev, "Unsupported IOMMU invalidation
>> type %d\n",
>> +				cache_type);
>> +			ret = -EINVAL;
>> +		}
>> +	}
>> +out_unlock:
>> +	spin_unlock(&iommu->lock);
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	return ret;
>> +}
>> +#endif
>> +
>>  static int intel_iommu_map(struct iommu_domain *domain,
>>  			   unsigned long iova, phys_addr_t hpa,
>>  			   size_t size, int iommu_prot, gfp_t gfp)
>> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
>>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>  #endif
>> --
>> 2.7.4
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-20 23:27 ` [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
  2020-03-28 10:01   ` Tian, Kevin
@ 2020-03-29 16:05   ` Auger Eric
  2020-03-31 22:28     ` Jacob Pan
  1 sibling, 1 reply; 67+ messages in thread
From: Auger Eric @ 2020-03-29 16:05 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj Ashok, Jonathan Cameron

Hi Jacob,

On 3/21/20 12:27 AM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> ---
> v7 review fixed in v10
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 182 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index b1477cd423dd..a76afb0fd51a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5619,6 +5619,187 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not include IOTLB granularity for request without PASID (second level).
> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> +	/*
> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
> +	 * page selective (address granularity)
> +	 */
> +	{0, 1, 1},
> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
> +	{1, 1, 0},
> +	/* PASID cache */
> +	{1, 1, 0}
> +};
> +
> +const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> +	/* PASID based IOTLB */
> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> +	/* PASID based dev TLBs */
> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> +	/* PASID cache */
> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> +{
> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >= IOMMU_INV_GRANU_NR ||
> +		!inv_type_granu_map[type][granu])
> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][granu];
> +
> +	return 0;
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> +	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
> +	 * granu size in contiguous memory.
> +	 */
> +	return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct intel_iommu *iommu;
> +	unsigned long flags;
> +	int cache_type;
> +	u8 bus, devfn;
> +	u16 did, sid;
> +	int ret = 0;
> +	u64 size = 0;
> +
> +	if (!inv_info || !dmar_domain ||
> +		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> +		return -EINVAL;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	spin_lock(&iommu->lock);
> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +	if (!info) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +
> +	/* Size is only valid in non-PASID selective invalidation */
> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> +		size = to_vtd_size(inv_info->addr_info.granule_size,
> +				   inv_info->addr_info.nb_granules);
> +
> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> +		int granu = 0;
> +		u64 pasid = 0;
> +
> +		ret = to_vtd_granularity(cache_type, inv_info->granularity, &granu);
> +		if (ret) {
> +			pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
> +				inv_info->granularity);
> +			break;
> +		}
> +
> +		/* PASID is stored in different locations based on granularity */
> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> +			inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
> +			pasid = inv_info->pasid_info.pasid;
> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> +			inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
> +			pasid = inv_info->addr_info.pasid;
> +		else {
> +			pr_err("Cannot find PASID for given cache type and granularity\n");
I don't get this error msg. In case of domain-selective invalidation,
PASID is not used so if I am not wrong you will end up here while there
is no issue.
> +			break;
> +		}
> +
> +		switch (BIT(cache_type)) {
> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> +			if ((inv_info->granularity != IOMMU_INV_GRANU_PASID) &&
> +				size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> +				pr_err("Address out of range, 0x%llx, size order %llu\n",
> +					inv_info->addr_info.addr, size);
> +				ret = -ERANGE;
> +				goto out_unlock;
> +			}
> +
> +			qi_flush_piotlb(iommu, did,
> +					pasid,
> +					mm_to_dma_pfn(inv_info->addr_info.addr),
> +					(granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> +					inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> +
> +			/*
> +			 * Always flush device IOTLB if ATS is enabled since guest
> +			 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> +			 * down.
> +			 */
> +			if (info->ats_enabled) {
nit {} not requested
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> +						pasid, info->ats_qdep,
> +						inv_info->addr_info.addr, size,
> +						granu);
> +			}
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> +			if (info->ats_enabled) {
nit {} not requested
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> +						inv_info->addr_info.pasid, info->ats_qdep,
> +						inv_info->addr_info.addr, size,
> +						granu);
> +			} else
> +				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
> +
nit: extra line
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_PASID:
> +			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
> +
nit: extra line
> +			break;
> +		default:
> +			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
> +				cache_type);
> +			ret = -EINVAL;
> +		}
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +#endif
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot, gfp_t gfp)
> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>  #endif
> 
Thanks

Eric

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw
  2020-03-29  7:20     ` Lu Baolu
@ 2020-03-30 17:50       ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-30 17:50 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Raj, Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Sun, 29 Mar 2020 15:20:55 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> On 2020/3/27 19:53, Tian, Kevin wrote:
> >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>  
> > 
> > could you elaborate in which scenario this helper function is
> > required?  
> 
> I added below commit message:
> 
>      An Intel iommu domain uses 5-level page table by default. If the
>      iommu that the domain tries to attach supports less page levels,
>      the top level page tables should be skipped. Add a helper to do
>      this so that it could be used in other places.
> 
Thanks Baolu,
I will also add this to my v11, it might save you some time :)


> Best regards,
> baolu
> 
> >     
> >> ---
> >>   drivers/iommu/intel-pasid.c | 22 ++++++++++++++++++++++
> >>   1 file changed, 22 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-pasid.c
> >> b/drivers/iommu/intel-pasid.c index 22b30f10b396..191508c7c03e
> >> 100644 --- a/drivers/iommu/intel-pasid.c
> >> +++ b/drivers/iommu/intel-pasid.c
> >> @@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct
> >> intel_iommu *iommu,
> >>   }
> >>
> >>   /*
> >> + * Skip top levels of page tables for iommu which has less agaw
> >> + * than default. Unnecessary for PT mode.
> >> + */
> >> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> >> +				  struct intel_iommu *iommu,
> >> +				  struct dma_pte **pgd)
> >> +{
> >> +	int agaw;
> >> +
> >> +	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> >> +		*pgd = phys_to_virt(dma_pte_addr(*pgd));
> >> +		if (!dma_pte_present(*pgd)) {
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> >> +	pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n",
> >> __func__, (u64)*pgd,
> >> +		iommu->agaw, domain->agaw);
> >> +
> >> +	return agaw;
> >> +}
> >> +
> >> +/*
> >>    * Set up the scalable mode pasid entry for second only
> >> translation type. */
> >>   int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> >> --
> >> 2.7.4  
> >   

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-29  8:03     ` Lu Baolu
@ 2020-03-30 18:21       ` Jacob Pan
  2020-03-31  3:36         ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-30 18:21 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Yi L, Tian, Kevin, Raj, Ashok, Jean-Philippe Brucker, iommu,
	LKML, Alex Williamson, David Woodhouse, Jonathan Cameron

On Sun, 29 Mar 2020 16:03:36 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> On 2020/3/27 20:21, Tian, Kevin wrote:
> >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.  
> > 
> > now the spec is already at rev3.1 😊  
> 
> Updated.
> 
> >   
> >> With PASID granular translation type set to 0x11b, translation
> >> result from the first level(FL) also subject to a second level(SL)
> >> page table translation. This mode is used for SVA virtualization,
> >> where FL performs guest virtual to guest physical translation and
> >> SL performs guest physical to host physical translation.
> >>
> >> This patch adds a helper function for setting up nested translation
> >> where second level comes from a domain and first level comes from
> >> a guest PGD.
> >>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >> ---
> >>   drivers/iommu/intel-pasid.c | 240
> >> +++++++++++++++++++++++++++++++++++++++++++-
> >>   drivers/iommu/intel-pasid.h |  12 +++
> >>   include/linux/intel-iommu.h |   3 +
> >>   3 files changed, 252 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/iommu/intel-pasid.c
> >> b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> >> 100644 --- a/drivers/iommu/intel-pasid.c
> >> +++ b/drivers/iommu/intel-pasid.c
> >> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> >> value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> >>   }
> >>
> >> +/*
> >> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> >> 27); +}
> >> +
> >> +/*
> >> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> >> 32); +}
> >> +
> >> +/*
> >> + * Setup the Cache Disable (CD) field (Bit 89)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_cd(struct pasid_entry *pe)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_emte(struct pasid_entry *pe)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_eafe(struct pasid_entry *pe)
> >> +{
> >> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pcd(struct pasid_entry *pe)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pwt(struct pasid_entry *pe)
> >> +{
> >> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> >> +}
> >> +
> >>   static void
> >>   pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
> >>   				    u16 did, int pasid)
> >> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct
> >> intel_iommu *iommu,
> >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >>   	/* Setup Present and PASID Granular Transfer Type: */
> >> -	pasid_set_translation_type(pte, 1);
> >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
> >>   	pasid_set_present(pte);
> >>   	pasid_flush_caches(iommu, pte, pasid, did);
> >>
> >> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> >> intel_iommu *iommu,
> >>   	pasid_set_domain_id(pte, did);
> >>   	pasid_set_slptr(pte, pgd_val);
> >>   	pasid_set_address_width(pte, agaw);
> >> -	pasid_set_translation_type(pte, 2);
> >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
> >>   	pasid_set_fault_enable(pte);
> >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> >> intel_iommu *iommu,
> >>   	pasid_clear_entry(pte);
> >>   	pasid_set_domain_id(pte, did);
> >>   	pasid_set_address_width(pte, iommu->agaw);
> >> -	pasid_set_translation_type(pte, 4);
> >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
> >>   	pasid_set_fault_enable(pte);
> >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> >> intel_iommu *iommu,
> >>
> >>   	return 0;
> >>   }
> >> +
> >> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> >> +				struct pasid_entry *pte,
> >> +				struct iommu_gpasid_bind_data_vtd
> >> *pasid_data)
> >> +{
> >> +	/*
> >> +	 * Not all guest PASID table entry fields are passed down
> >> during bind,
> >> +	 * here we only set up the ones that are dependent on
> >> guest settings.
> >> +	 * Execution related bits such as NXE, SMEP are not
> >> meaningful to IOMMU,
> >> +	 * therefore not set. Other fields, such as snoop
> >> related, are set based
> >> +	 * on host needs regardless of guest settings.
> >> +	 */
> >> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> >> +		if (!ecap_srs(iommu->ecap)) {
> >> +			pr_err("No supervisor request support on
> >> %s\n",
> >> +			       iommu->name);
> >> +			return -EINVAL;
> >> +		}
> >> +		pasid_set_sre(pte);
> >> +	}
> >> +
> >> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> >> +		if (!ecap_eafs(iommu->ecap)) {
> >> +			pr_err("No extended access flag support
> >> on %s\n",
> >> +				iommu->name);
> >> +			return -EINVAL;
> >> +		}
> >> +		pasid_set_eafe(pte);
> >> +	}
> >> +
> >> +	/*
> >> +	 * Memory type is only applicable to devices inside
> >> processor coherent
> >> +	 * domain. PCIe devices are not included. We can skip the
> >> rest of the
> >> +	 * flags if IOMMU does not support MTS.  
> > 
> > when you say that PCI devices are not included, is it simple for
> > information or should we impose some check to make sure below path
> > not applied to them?  
> 
> Jacob, does it work for you if I add below check?
> 
> 	if (ecap_mts(iommu->ecap) && !dev_is_pci(dev))
> 
> Or, we need to remove this comment line?
> 
> >   
> >> +	 */
> >> +	if (ecap_mts(iommu->ecap)) {
> >> +		if (pasid_data->flags &
> >> IOMMU_SVA_VTD_GPASID_EMTE) {
> >> +			pasid_set_emte(pte);
> >> +			pasid_set_emt(pte, pasid_data->emt);
> >> +		}
> >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> >> +			pasid_set_pcd(pte);
> >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> >> +			pasid_set_pwt(pte);
> >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> >> +			pasid_set_cd(pte);
> >> +		pasid_set_pat(pte, pasid_data->pat);
> >> +	} else if (pasid_data->flags &
> >> IOMMU_SVA_VTD_GPASID_MTS_MASK) {
> >> +		pr_err("No memory type support for bind guest
> >> PASID on %s\n",
> >> +			iommu->name);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	return 0;
> >> +
> >> +}
> >> +
> >> +/**
> >> + * intel_pasid_setup_nested() - Set up PASID entry for nested
> >> translation.
> >> + * This could be used for guest shared virtual address. In this
> >> case, the
> >> + * first level page tables are used for GVA-GPA translation in
> >> the guest,
> >> + * second level page tables are used for GPA-HPA translation.  
> > 
> > GVA->GPA is just one example. It could be gIOVA->GPA too. Here the
> > point is that the first level is the translation table managed by
> > the guest.  
> 
> Agreed.
> 
Yes, that is why I chose the word "could be" :), but mention both cases
are good.

> >   
> >> + *
> >> + * @iommu:      IOMMU which the device belong to
> >> + * @dev:        Device to be set up for translation
> >> + * @gpgd:       FLPTPTR: First Level Page translation pointer in
> >> GPA
> >> + * @pasid:      PASID to be programmed in the device PASID table
> >> + * @pasid_data: Additional PASID info from the guest bind request
> >> + * @domain:     Domain info for setting up second level page
> >> tables
> >> + * @addr_width: Address width of the first level (guest)
> >> + */
> >> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> >> +			struct device *dev, pgd_t *gpgd,
> >> +			int pasid, struct
> >> iommu_gpasid_bind_data_vtd *pasid_data,
> >> +			struct dmar_domain *domain,
> >> +			int addr_width)
> >> +{
> >> +	struct pasid_entry *pte;
> >> +	struct dma_pte *pgd;
> >> +	int ret = 0;
> >> +	u64 pgd_val;
> >> +	int agaw;
> >> +	u16 did;
> >> +
> >> +	if (!ecap_nest(iommu->ecap)) {
> >> +		pr_err("IOMMU: %s: No nested translation
> >> support\n",
> >> +		       iommu->name);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	pte = intel_pasid_get_entry(dev, pasid);
> >> +	if (WARN_ON(!pte))
> >> +		return -EINVAL;  
> > 
> > should we have intel_pasid_get_entry to return error which is then
> > carried here? Looking at that function there could be error
> > conditions both being invalid parameter and no memory...  
> 
> Agreed. Will do this in a followup patch.
> 
> >   
> >> +
> >> +	/*
> >> +	 * Caller must ensure PASID entry is not in use, i.e. not
> >> bind the
> >> +	 * same PASID to the same device twice.
> >> +	 */
> >> +	if (pasid_pte_is_present(pte))
> >> +		return -EBUSY;  
> > 
> > is any lock held outside of this function? curious whether any race
> > condition may happen in between.  
> 
> The pasid entry change should always be protected by iommu->lock.
> 
Agreed.

> >   
> >> +
> >> +	pasid_clear_entry(pte);
> >> +
> >> +	/* Sanity checking performed by caller to make sure
> >> address
> >> +	 * width matching in two dimensions:
> >> +	 * 1. CPU vs. IOMMU
> >> +	 * 2. Guest vs. Host.
> >> +	 */
> >> +	switch (addr_width) {
> >> +	case ADDR_WIDTH_5LEVEL:
> >> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> >> +			cap_5lp_support(iommu->cap)) {
> >> +			pasid_set_flpm(pte, 1);  
> > 
> > define a macro for 4lvl and 5lvl
> >   
> >> +		} else {
> >> +			dev_err(dev, "5-level paging not
> >> supported\n");
> >> +			return -EINVAL;
> >> +		}
> >> +		break;
> >> +	case ADDR_WIDTH_4LEVEL:
> >> +		pasid_set_flpm(pte, 0);
> >> +		break;
> >> +	default:
> >> +		dev_err(dev, "Invalid guest address width %d\n",
> >> addr_width);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	/* First level PGD is in GPA, must be supported by the
> >> second level */
> >> +	if ((u64)gpgd > domain->max_addr) {
> >> +		dev_err(dev, "Guest PGD %llx not supported, max
> >> %llx\n",
> >> +			(u64)gpgd, domain->max_addr);
> >> +		return -EINVAL;
> >> +	}
> >> +	pasid_set_flptr(pte, (u64)gpgd);
> >> +
> >> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> >> +	if (ret) {
> >> +		dev_err(dev, "Guest PASID bind data not
> >> supported\n");
> >> +		return ret;
> >> +	}
> >> +
> >> +	/* Setup the second level based on the given domain */
> >> +	pgd = domain->pgd;
> >> +
> >> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> >> +	if (agaw < 0) {
> >> +		dev_err(dev, "Invalid domain page table\n");
> >> +		return -EINVAL;
> >> +	}
> >> +	pgd_val = virt_to_phys(pgd);
> >> +	pasid_set_slptr(pte, pgd_val);
> >> +	pasid_set_fault_enable(pte);
> >> +
> >> +	did = domain->iommu_did[iommu->seq_id];
> >> +	pasid_set_domain_id(pte, did);
> >> +
> >> +	pasid_set_address_width(pte, agaw);
> >> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >> +
> >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> >> +	pasid_set_present(pte);
> >> +	pasid_flush_caches(iommu, pte, pasid, did);
> >> +
> >> +	return ret;
> >> +}
> >> diff --git a/drivers/iommu/intel-pasid.h
> >> b/drivers/iommu/intel-pasid.h index 92de6df24ccb..698015ee3f04
> >> 100644 --- a/drivers/iommu/intel-pasid.h
> >> +++ b/drivers/iommu/intel-pasid.h
> >> @@ -36,6 +36,7 @@
> >>    * to vmalloc or even module mappings.
> >>    */
> >>   #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> >> +#define PASID_FLAG_NESTED		BIT(1)
> >>
> >>   /*
> >>    * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for
> >> first- @@ -51,6 +52,11 @@ struct pasid_entry {
> >>   	u64 val[8];
> >>   };
> >>
> >> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> >> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> >> +#define PASID_ENTRY_PGTT_NESTED		(3)
> >> +#define PASID_ENTRY_PGTT_PT		(4)
> >> +
> >>   /* The representative of a PASID table */
> >>   struct pasid_table {
> >>   	void			*table;		/*
> >> pasid table pointer */ @@ -99,6 +105,12 @@ int
> >> intel_pasid_setup_second_level(struct intel_iommu *iommu,
> >>   int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
> >>   				   struct dmar_domain *domain,
> >>   				   struct device *dev, int
> >> pasid); +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> >> +			struct device *dev, pgd_t *pgd,
> >> +			int pasid,
> >> +			struct iommu_gpasid_bind_data_vtd
> >> *pasid_data,
> >> +			struct dmar_domain *domain,
> >> +			int addr_width);
> >>   void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> >>   				 struct device *dev, int pasid);
> >>
> >> diff --git a/include/linux/intel-iommu.h
> >> b/include/linux/intel-iommu.h index ed7171d2ae1f..eda1d6687144
> >> 100644 --- a/include/linux/intel-iommu.h
> >> +++ b/include/linux/intel-iommu.h
> >> @@ -42,6 +42,9 @@
> >>   #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
> >>   #define DMA_FL_PTE_XD		BIT_ULL(63)
> >>
> >> +#define ADDR_WIDTH_5LEVEL	(57)
> >> +#define ADDR_WIDTH_4LEVEL	(48)
> >> +
> >>   #define CONTEXT_TT_MULTI_LEVEL	0
> >>   #define CONTEXT_TT_DEV_IOTLB	1
> >>   #define CONTEXT_TT_PASS_THROUGH 2
> >> --
> >> 2.7.4  
> >   
> 
> Best regards,
> baolu

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-28  8:02   ` Tian, Kevin
@ 2020-03-30 20:51     ` Jacob Pan
  2020-03-31  3:43       ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-30 20:51 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yi L, Raj, Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Sat, 28 Mar 2020 08:02:01 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> >     .-------------.  .---------------------------.
> >     |   vIOMMU    |  | Guest process CR3, FL only|
> >     |             |  '---------------------------'
> >     .----------------/
> >     | PASID Entry |--- PASID cache flush -
> >     '-------------'                       |
> >     |             |                       V
> >     |             |                CR3 in GPA
> >     '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >       v        v                          v
> > Host
> >     .-------------.  .----------------------.
> >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >     |             |  '----------------------'
> >     .----------------/  |
> >     | PASID Entry |     V (Nested xlate)
> >     '----------------\.------------------------------.
> >     |             |   |SL for GPA-HPA, default domain|
> >     |             |   '------------------------------'
> >     '-------------'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 224
> > ++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/intel-iommu.h |   8 +-
> >  include/linux/intel-svm.h   |  17 ++++
> >  4 files changed, 252 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > +#endif
> >  };
> > 
> >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..47c0deb5ae56 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> >  	list_for_each_entry((sdev), &(svm)->devs, list)	\
> >  		if ((d) != (sdev)->dev) {} else
> > 
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +			struct device *dev,
> > +			struct iommu_gpasid_bind_data *data)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct dmar_domain *ddomain;  
> 
> what about the full name e.g. dmar_domain? though a bit longer
> but clearer than ddomain.
> 
Sure, I don't have preference.

> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!iommu) || !data)
> > +		return -EINVAL;
> > +
> > +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +		return -EINVAL;
> > +
> > +	if (dev_is_pci(dev)) {
> > +		/* VT-d supports devices with full 20 bit PASIDs
> > only */
> > +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +			return -EINVAL;
> > +	} else {
> > +		return -ENOTSUPP;
> > +	}
> > +
> > +	/*
> > +	 * We only check host PASID range, we have no knowledge to
> > check
> > +	 * guest PASID range nor do we use the guest PASID.
> > +	 */
> > +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > +		return -EINVAL;
> > +
> > +	ddomain = to_dmar_domain(domain);
> > +
> > +	/* Sanity check paging mode support match between host and
> > guest */
> > +	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > +	    !cap_5lp_support(iommu->cap)) {
> > +		pr_err("Cannot support 5 level paging requested by
> > guest!\n");
> > +		return -EINVAL;
> > +	}  
> 
> -ENOTSUPP?
I was thinking from this API p.o.v, the input is invalid. Since both
cap and addr_width are derived from input arguments.

> 
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, data->hpasid, NULL);
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	if (svm) {
> > +		/*
> > +		 * If we found svm for the PASID, there must be at
> > +		 * least one device bond, otherwise svm should be
> > freed.
> > +		 */
> > +		if (WARN_ON(list_empty(&svm->devs))) {
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		if (svm->mm == get_task_mm(current) &&
> > +		    data->hpasid == svm->pasid &&
> > +		    data->gpasid == svm->gpasid) {
> > +			pr_warn("Cannot bind the same guest-host
> > PASID for the same process\n");  
> 
> Sorry I didn’t get the rationale here. Isn't this branch is for
> binding the same PASID to multiple devices? In that case definitely
> it is binding the same guest-host PASID for the same process.
> otherwise if hpasid is different then you'll hit a different
> intel_svm, while if gpasid is different how you can use one intel_svm
> to hold multiple gpasids?
> 
> I feel the error condition should be the opposite. and suppose
> SVM_FLAG_ GUEST_PASID should be verified before checking gpasid.
> 
You are right, actually we don't need the check here. The
scenario for multiple devices bind to the same PASID is checked in
for_each_svm_dev()
I will remove this code.

> > +			mmput(svm->mm);
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		mmput(current->mm);
> > +
> > +		for_each_svm_dev(sdev, svm, dev) {
> > +			/* In case of multiple sub-devices of the
> > same pdev
> > +			 * assigned, we should allow multiple bind
> > calls with
> > +			 * the same PASID and pdev.  
> 
> Does sub-device mean mdev? I didn't find such notation in current
> iommu directory.
> 
yes it is intended for mdev.
> and to make it clearer, "In case of multiple mdevs of the same pdev
> assigned to the same guest process".
> 
I am avoiding mdev on purpose since it is not a concept in iommu
driver. sub-device is more generic.

> > +			 */
> > +			sdev->users++;
> > +			goto out;
> > +		}
> > +	} else {
> > +		/* We come here when PASID has never been bond to a
> > device. */
> > +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > +		if (!svm) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		/* REVISIT: upper layer/VFIO can track host
> > process that bind the PASID.
> > +		 * ioasid_set = mm might be sufficient for vfio to
> > check pasid VMM
> > +		 * ownership.
> > +		 */  
> 
> Above message is unclear about what should be revisited. Does it
> describe the current implementation or the expected revision in the
> future? 
> 
What I meant was if VFIO can check PASID-mm ownership by itself, then
we don;t have to store svm->mm here. Will drop the line below.
I will add this comment to clarify.

> > +		svm->mm = get_task_mm(current);
> > +		svm->pasid = data->hpasid;
> > +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > +			svm->gpasid = data->gpasid;
> > +			svm->flags |= SVM_FLAG_GUEST_PASID;
> > +		}
> > +		ioasid_set_data(data->hpasid, svm);
> > +		INIT_LIST_HEAD_RCU(&svm->devs);
> > +		mmput(svm->mm);
> > +	}
> > +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > +	if (!sdev) {
> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	sdev->dev = dev;
> > +	sdev->users = 1;
> > +
> > +	/* Set up device context entry for PASID if not enabled
> > already */
> > +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > +	if (ret) {
> > +		dev_err(dev, "Failed to enable PASID
> > capability\n");
> > +		kfree(sdev);
> > +		/*
> > +		 * If this this a new PASID that never bond to a
> > device, then
> > +		 * the device list must be empty which indicates
> > struct svm
> > +		 * was allocated in this function.
> > +		 */  
> 
> the comment better move to the 1st occurrence when sdev allocation
> fails. or even better put it in out label...
> 
Sounds good.

> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * For guest bind, we need to set up PASID table entry as
> > follows:
> > +	 * - FLPM matches guest paging mode
> > +	 * - turn on nested mode
> > +	 * - SL guest address width matching
> > +	 */  
> 
> looks above just explains the internal detail of
> intel_pasid_setup_nested, which is not necessary to be here.
> 
Right, will remove the comments.

> > +	ret = intel_pasid_setup_nested(iommu,
> > +				       dev,
> > +				       (pgd_t *)data->gpgd,
> > +				       data->hpasid,
> > +				       &data->vtd,
> > +				       ddomain,
> > +				       data->addr_width);  
> 
> It's worthy of an explanation here that setup_nested is required for
> every device (even when they are sharing same intel_svm) because
> we allocate pasid table per device. Otherwise I made a mistake to
> think that only the 1st device bound to a new hpasid requires this
> step. 😊
> 
Good suggestion, I will add the comments as:
/*
 * PASID table is per device for better security. Therefore, for
 * each bind of a new device even with an existing PASID, we need to
 * call the nested mode setup function here.
 */

> > +	if (ret) {
> > +		dev_err(dev, "Failed to set up PASID %llu in
> > nested mode, Err %d\n",
> > +			data->hpasid, ret);
> > +		/*
> > +		 * PASID entry should be in cleared state if
> > nested mode
> > +		 * set up failed. So we only need to clear IOASID
> > tracking
> > +		 * data such that free call will succeed.
> > +		 */
> > +		kfree(sdev);
> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}
> > +		goto out;
> > +	}
> > +	svm->flags |= SVM_FLAG_GUEST_MODE;
> > +
> > +	init_rcu_head(&sdev->rcu);
> > +	list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > +	mutex_unlock(&pasid_mutex);
> > +	return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = -EINVAL;
> > +
> > +	if (WARN_ON(!iommu))
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, pasid, NULL);
> > +	if (!svm) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	for_each_svm_dev(sdev, svm, dev) {
> > +		ret = 0;
> > +		sdev->users--;
> > +		if (!sdev->users) {
> > +			list_del_rcu(&sdev->list);
> > +			intel_pasid_tear_down_entry(iommu, dev,
> > svm-  
> > >pasid);  
> > +			/* TODO: Drain in flight PRQ for the PASID
> > since it
> > +			 * may get reused soon, we don't want to
> > +			 * confuse with its previous life.
> > +			 * intel_svm_drain_prq(dev, pasid);
> > +			 */
> > +			kfree_rcu(sdev, rcu);
> > +
> > +			if (list_empty(&svm->devs)) {
> > +				/*
> > +				 * We do not free PASID here until
> > explicit call
> > +				 * from VFIO to free. The PASID
> > life cycle
> > +				 * management is largely tied to
> > VFIO management
> > +				 * of assigned device life cycles.
> > In case of
> > +				 * guest exit without a explicit
> > free PASID call,
> > +				 * the responsibility lies in VFIO
> > layer to free
> > +				 * the PASIDs allocated for the
> > guest.
> > +				 * For security reasons, VFIO has
> > to track the
> > +				 * PASID ownership per guest
> > anyway to ensure
> > +				 * that PASID allocated by one
> > guest cannot be
> > +				 * used by another.  
> 
> As commented in other patches, VFIO is only one example user of this
> API... 
> 
Right, how about this:
	/*
	 * We do not free the IOASID here in that
	 * IOMMU driver did not allocate it.
	 * Unlike native SVM, IOASID for guest use was
	 * allocated prior to the bind call.
	 * In any case, if the free call comes before
	 * the unbind, IOMMU driver will get notified
	 * and perform cleanup.
	 */

> > +				 */
> > +				ioasid_set_data(pasid, NULL);
> > +				kfree(svm);
> > +			}
> > +		}
> > +		break;
> > +	}  
> 
> what about no dev match? an -EINVAL is also required then.
> 
Yes, ret is initialized as -EINVAL

> > +out:
> > +	mutex_unlock(&pasid_mutex);
> > +
> > +	return ret;
> > +}
> > +
> >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops)
> >  {
> >  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index eda1d6687144..85b05120940e
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device
> > *dev);
> >  extern void intel_svm_check(struct intel_iommu *iommu);
> >  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> >  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +		struct device *dev, struct iommu_gpasid_bind_data
> > *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
> > pasid); struct svm_dev_ops;
> > 
> >  struct intel_svm_dev {
> > @@ -698,9 +700,13 @@ struct intel_svm_dev {
> >  struct intel_svm {
> >  	struct mmu_notifier notifier;
> >  	struct mm_struct *mm;
> > +
> >  	struct intel_iommu *iommu;
> >  	int flags;
> >  	int pasid;
> > +	int gpasid; /* Guest PASID in case of vSVA bind with
> > non-identity host
> > +		     * to guest PASID mapping.
> > +		     */  
> 
> we don't need to highlight identity or non-identity thing, since
> either way shares the same infrastructure here and it is not the
> knowledge that the kernel driver should assume
> 
Sorry, I don't get your point.

What I meant was that this field "gpasid" is only used for non-identity
case. For identity case, we don't have SVM_FLAG_GUEST_PASID.

> >  	struct list_head devs;
> >  	struct list_head list;
> >  };
> > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > index d7c403d0dd27..c19690937540 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> >   * do such IOTLB flushes automatically.
> >   */
> >  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.
> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE	(1<<2)
> > +/*
> > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > PASID space,
> > + * which requires guest and host PASID translation at both
> > directions. We keep
> > + * track of guest PASID in order to provide lookup service to
> > device drivers.
> > + * One such example is a physical function (PF) driver that
> > supports mediated
> > + * device (mdev) assignment. Guest programming of mdev
> > configuration space can
> > + * only be done with guest PASID, therefore PF driver needs to
> > find the matching
> > + * host PASID to program the real hardware.
> > + */
> > +#define SVM_FLAG_GUEST_PASID	(1<<3)
> > 
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> > 
> > --
> > 2.7.4  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-29 13:40   ` Auger Eric
@ 2020-03-30 22:53     ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-30 22:53 UTC (permalink / raw)
  To: Auger Eric
  Cc: Yi L, Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Sun, 29 Mar 2020 15:40:22 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> >     .-------------.  .---------------------------.
> >     |   vIOMMU    |  | Guest process CR3, FL only|
> >     |             |  '---------------------------'
> >     .----------------/
> >     | PASID Entry |--- PASID cache flush -
> >     '-------------'                       |
> >     |             |                       V
> >     |             |                CR3 in GPA
> >     '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >       v        v                          v
> > Host
> >     .-------------.  .----------------------.
> >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >     |             |  '----------------------'
> >     .----------------/  |
> >     | PASID Entry |     V (Nested xlate)
> >     '----------------\.------------------------------.
> >     |             |   |SL for GPA-HPA, default domain|
> >     |             |   '------------------------------'
> >     '-------------'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 224
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h |   8 +- include/linux/intel-svm.h   |
> > 17 ++++ 4 files changed, 252 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > +#endif
> >  };
> >  
> >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..47c0deb5ae56 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> >  	list_for_each_entry((sdev), &(svm)->devs, list)	\
> >  		if ((d) != (sdev)->dev) {} else
> >  
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +			struct device *dev,
> > +			struct iommu_gpasid_bind_data *data)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct dmar_domain *ddomain;
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!iommu) || !data)
> > +		return -EINVAL;
> > +
> > +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +		return -EINVAL;
> > +
> > +	if (dev_is_pci(dev)) {
> > +		/* VT-d supports devices with full 20 bit PASIDs
> > only */
> > +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +			return -EINVAL;
> > +	} else {
> > +		return -ENOTSUPP;
> > +	}
> > +
> > +	/*
> > +	 * We only check host PASID range, we have no knowledge to
> > check
> > +	 * guest PASID range nor do we use the guest PASID.  
> nit : "nor do we use the guest PASID". Well the guest PASID FLAG is
> checked below and if set, svm->gpasid is set ;-)
Yes, it is a little contradictory, I will remove the use.

I meant we don;t really use the gpasid for real work in host driver :)

> > +	 */
> > +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > +		return -EINVAL;
> > +
> > +	ddomain = to_dmar_domain(domain);
> > +
> > +	/* Sanity check paging mode support match between host and
> > guest */
> > +	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > +	    !cap_5lp_support(iommu->cap)) {
> > +		pr_err("Cannot support 5 level paging requested by
> > guest!\n");
> > +		return -EINVAL;  
> nit: This check also is done in intel_pasid_setup_nested with an extra
> check:
Good catch, I will remove this.

> +	switch (addr_width) {
> +	case ADDR_WIDTH_5LEVEL:
> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> +			cap_5lp_support(iommu->cap)) {
> 
> > +	}
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, data->hpasid, NULL);
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	if (svm) {
> > +		/*
> > +		 * If we found svm for the PASID, there must be at
> > +		 * least one device bond, otherwise svm should be
> > freed.
> > +		 */
> > +		if (WARN_ON(list_empty(&svm->devs))) {
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		if (svm->mm == get_task_mm(current) &&
> > +		    data->hpasid == svm->pasid &&
> > +		    data->gpasid == svm->gpasid) {
> > +			pr_warn("Cannot bind the same guest-host
> > PASID for the same process\n");
> > +			mmput(svm->mm);
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		mmput(current->mm);
> > +
> > +		for_each_svm_dev(sdev, svm, dev) {
> > +			/* In case of multiple sub-devices of the
> > same pdev
> > +			 * assigned, we should allow multiple bind
> > calls with
> > +			 * the same PASID and pdev.
> > +			 */
> > +			sdev->users++;
> > +			goto out;
> > +		}
> > +	} else {
> > +		/* We come here when PASID has never been bond to
> > a device. */
> > +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > +		if (!svm) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		/* REVISIT: upper layer/VFIO can track host
> > process that bind the PASID.
> > +		 * ioasid_set = mm might be sufficient for vfio to
> > check pasid VMM
> > +		 * ownership.
> > +		 */
> > +		svm->mm = get_task_mm(current);
> > +		svm->pasid = data->hpasid;
> > +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > +			svm->gpasid = data->gpasid;
> > +			svm->flags |= SVM_FLAG_GUEST_PASID;
> > +		}
> > +		ioasid_set_data(data->hpasid, svm);
> > +		INIT_LIST_HEAD_RCU(&svm->devs);
> > +		mmput(svm->mm);
> > +	}
> > +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > +	if (!sdev) {
> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}  
> nit: the above 4 lines are duplicated 3 times. Might be worth a
> helper.
Good point, I will add a helper like this

static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
{
	if (list_empty(&svm->devs)) {
		ioasid_attach_data(pasid, NULL);
		kfree(svm);
	}
}



> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	sdev->dev = dev;
> > +	sdev->users = 1;
> > +
> > +	/* Set up device context entry for PASID if not enabled
> > already */
> > +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > +	if (ret) {
> > +		dev_err(dev, "Failed to enable PASID
> > capability\n");  
> unlimited tracing upon userspace call? Don't know what is the best
> policy.
Good point. Perhaps just use dev_err_ratelimited for all user calls?

> > +		kfree(sdev);
> > +		/*
> > +		 * If this this a new PASID that never bond to a
> > device, then
> > +		 * the device list must be empty which indicates
> > struct svm
> > +		 * was allocated in this function.
> > +		 */
> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * For guest bind, we need to set up PASID table entry as
> > follows:
> > +	 * - FLPM matches guest paging mode
> > +	 * - turn on nested mode
> > +	 * - SL guest address width matching
> > +	 */
> > +	ret = intel_pasid_setup_nested(iommu,
> > +				       dev,
> > +				       (pgd_t *)data->gpgd,
> > +				       data->hpasid,
> > +				       &data->vtd,
> > +				       ddomain,
> > +				       data->addr_width);
> > +	if (ret) {
> > +		dev_err(dev, "Failed to set up PASID %llu in
> > nested mode, Err %d\n",
> > +			data->hpasid, ret);
> > +		/*
> > +		 * PASID entry should be in cleared state if
> > nested mode
> > +		 * set up failed. So we only need to clear IOASID
> > tracking
> > +		 * data such that free call will succeed.
> > +		 */
> > +		kfree(sdev);
> > +		if (list_empty(&svm->devs)) {
> > +			ioasid_set_data(data->hpasid, NULL);
> > +			kfree(svm);
> > +		}  
> 
> > +		goto out;
> > +	}
> > +	svm->flags |= SVM_FLAG_GUEST_MODE;
> > +
> > +	init_rcu_head(&sdev->rcu);
> > +	list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > +	mutex_unlock(&pasid_mutex);
> > +	return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = -EINVAL;
> > +
> > +	if (WARN_ON(!iommu))
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, pasid, NULL);
> > +	if (!svm) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	for_each_svm_dev(sdev, svm, dev) {
> > +		ret = 0;
> > +		sdev->users--;
> > +		if (!sdev->users) {
> > +			list_del_rcu(&sdev->list);
> > +			intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);
> > +			/* TODO: Drain in flight PRQ for the PASID
> > since it
> > +			 * may get reused soon, we don't want to
> > +			 * confuse with its previous life.
> > +			 * intel_svm_drain_prq(dev, pasid);
> > +			 */
> > +			kfree_rcu(sdev, rcu);
> > +
> > +			if (list_empty(&svm->devs)) {
> > +				/*
> > +				 * We do not free PASID here until
> > explicit call
> > +				 * from VFIO to free. The PASID
> > life cycle
> > +				 * management is largely tied to
> > VFIO management
> > +				 * of assigned device life cycles.
> > In case of
> > +				 * guest exit without a explicit
> > free PASID call,
> > +				 * the responsibility lies in VFIO
> > layer to free
> > +				 * the PASIDs allocated for the
> > guest.
> > +				 * For security reasons, VFIO has
> > to track the
> > +				 * PASID ownership per guest
> > anyway to ensure
> > +				 * that PASID allocated by one
> > guest cannot be
> > +				 * used by another.
> > +				 */
> > +				ioasid_set_data(pasid, NULL);
> > +				kfree(svm);
> > +			}
> > +		}
> > +		break;
> > +	}
> > +out:
> > +	mutex_unlock(&pasid_mutex);
> > +
> > +	return ret;
> > +}
> > +
> >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops) {
> >  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index eda1d6687144..85b05120940e
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device
> > *dev); extern void intel_svm_check(struct intel_iommu *iommu);
> >  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> >  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +		struct device *dev, struct iommu_gpasid_bind_data
> > *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
> > pasid); struct svm_dev_ops;
> >  
> >  struct intel_svm_dev {
> > @@ -698,9 +700,13 @@ struct intel_svm_dev {
> >  struct intel_svm {
> >  	struct mmu_notifier notifier;
> >  	struct mm_struct *mm;
> > +
> >  	struct intel_iommu *iommu;
> >  	int flags;
> >  	int pasid;
> > +	int gpasid; /* Guest PASID in case of vSVA bind with
> > non-identity host
> > +		     * to guest PASID mapping.
> > +		     */
> >  	struct list_head devs;
> >  	struct list_head list;
> >  };
> > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > index d7c403d0dd27..c19690937540 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> >   * do such IOTLB flushes automatically.
> >   */
> >  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.
> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE	(1<<2)
> > +/*
> > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > PASID space,
> > + * which requires guest and host PASID translation at both
> > directions. We keep
> > + * track of guest PASID in order to provide lookup service to
> > device drivers.
> > + * One such example is a physical function (PF) driver that
> > supports mediated
> > + * device (mdev) assignment. Guest programming of mdev
> > configuration space can
> > + * only be done with guest PASID, therefore PF driver needs to
> > find the matching
> > + * host PASID to program the real hardware.
> > + */
> > +#define SVM_FLAG_GUEST_PASID	(1<<3)
> >  
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types
  2020-03-27 14:46   ` Auger Eric
@ 2020-03-30 23:28     ` Jacob Pan
  2020-03-31 16:13       ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-30 23:28 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Fri, 27 Mar 2020 15:46:23 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > When Shared Virtual Memory is exposed to a guest via vIOMMU,
> > scalable IOTLB invalidation may be passed down from outside IOMMU
> > subsystems. This patch adds invalidation functions that can be used
> > for additional translation cache types.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > 
> > ---
> > v9 -> v10:
> > Fix off by 1 in pasid device iotlb flush
> > 
> > Address v7 missed review from Eric
> > 
> > ---
> > ---
> >  drivers/iommu/dmar.c        | 36
> > ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel-pasid.c |
> > 3 ++- include/linux/intel-iommu.h | 20 ++++++++++++++++----
> >  3 files changed, 54 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index f77dae7ba7d4..4d6b7b5b37ee 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu
> > *iommu, u16 did, u32 pasid, u64 addr, qi_submit_sync(&desc, iommu);
> >  }
> >  
> > +/* PASID-based device IOTLB Invalidate */
> > +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid,
> > u16 pfsid,
> > +		u32 pasid,  u16 qdep, u64 addr, unsigned
> > size_order, u64 granu) +{
> > +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order -
> > 1);
> > +	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
> > +
> > +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
> > QI_DEV_EIOTLB_SID(sid) |
> > +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> > +		QI_DEV_IOTLB_PFSID(pfsid);
> > +	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
> > +
> > +	/*
> > +	 * If S bit is 0, we only flush a single page. If S bit is
> > set,
> > +	 * The least significant zero bit indicates the
> > invalidation address
> > +	 * range. VT-d spec 6.5.2.6.
> > +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates
> > 16KB.
> > +	 * size order = 0 is PAGE_SIZE 4KB
> > +	 * Max Invs Pending (MIP) is set to 0 for now until we
> > have DIT in
> > +	 * ECAP.
> > +	 */
> > +	desc.qw1 |= addr & ~mask;
> > +	if (size_order)
> > +		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> > +
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +
> > +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
> > granu, int pasid) +{
> > +	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
> > +
> > +	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
> > QI_PC_GRAN(granu) | QI_PC_TYPE;
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +
> >  /*
> >   * Disable Queued Invalidation interface.
> >   */
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index 10c7856afc6b..9f6d07410722
> > 100644 --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct
> > intel_iommu *iommu, {
> >  	struct qi_desc desc;
> >  
> > -	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL |
> > QI_PC_PASID(pasid);
> > +	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
> > +		QI_PC_PASID(pasid) | QI_PC_TYPE;  
> Just a nit, this fix is not documented in the commit message.
> 
Thanks, I just sent out this fix separately. Will remove this from the
set.
https://lkml.org/lkml/2020/3/30/1065

> Besides
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Thanks
> 
> Eric
> 
> >  	desc.qw1 = 0;
> >  	desc.qw2 = 0;
> >  	desc.qw3 = 0;
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 85b05120940e..43539713b3b3
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -334,7 +334,7 @@ enum {
> >  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
> > (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
> > (((u64)addr) & VTD_PAGE_MASK) #define
> > QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
> > QI_IOTLB_AM(am)		(((u8)am)) +#define
> > QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
> >  #define QI_CC_FM(fm)		(((u64)fm) << 48)
> >  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> > @@ -353,16 +353,21 @@ enum {
> >  #define QI_PC_DID(did)		(((u64)did) << 16)
> >  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
> >  
> > -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> > -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> > +/* PASID cache invalidation granu */
> > +#define QI_PC_ALL_PASIDS	0
> > +#define QI_PC_PASID_SEL		1
> >  
> >  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> >  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> > -#define QI_EIOTLB_AM(am)	(((u64)am))
> > +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
> >  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
> >  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
> >  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
> >  
> > +/* QI Dev-IOTLB inv granu */
> > +#define QI_DEV_IOTLB_GRAN_ALL		1
> > +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0> +
> >  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
> >  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
> >  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> > @@ -662,8 +667,15 @@ extern void qi_flush_iotlb(struct intel_iommu
> > *iommu, u16 did, u64 addr, unsigned int size_order, u64 type);
> >  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid,
> > u16 pfsid, u16 qdep, u64 addr, unsigned mask);
> > +
> >  void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32
> > pasid, u64 addr, unsigned long npages, bool ih);
> > +
> > +extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu,
> > u16 sid, u16 pfsid,
> > +			u32 pasid, u16 qdep, u64 addr, unsigned
> > size_order, u64 granu); +
> > +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
> > did, u64 granu, int pasid); +
> >  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu
> > *iommu); 
> >  extern int dmar_ir_support(void);
> >   
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-29 15:34     ` Auger Eric
@ 2020-03-31  2:49       ` Tian, Kevin
  2020-03-31 20:58         ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-31  2:49 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Sunday, March 29, 2020 11:34 PM
> 
> Hi,
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> >> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> ++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +	/*
> >> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +	 * page selective (address granularity)
> >> +	 */
> >> +	{0, 1, 1},
> >> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +	{1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> Isn't it the same when G=1? Still the addr/size is used. Doesn't it

I thought addr/size is not used when G=1, but it might be wrong. I'm
checking with our vt-d spec owner.

> correspond to IOMMU_INV_GRANU_ADDR with
> IOMMU_INV_ADDR_FLAGS_PASID flag
> unset?
> 
> so {0, 0, 1}?

I have one more open:

How does userspace know which invalidation type/gran is supported?
I didn't see such capability reporting in Yi's VFIO vSVA patch set. Do we
want the user/kernel assume the same capability set if they are 
architectural? However the kernel could also do some optimization
e.g. hide devtlb invalidation capability given that the kernel already 
invalidate devtlb automatically when serving iotlb invalidation...

Thanks
Kevin

> 
> Thanks
> 
> Eric
> 
> >
> >> +	/* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache invalidation
> > is interpreted by vIOMMU for bind and unbind operations. I don't think
> > we should accept any PASID cache invalidation from userspace or guest.
> >
> >> +	{1, 1, 0}
> >> +};
> >> +
> >> +const static int
> >>
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >> _NR] = {
> >> +	/* PASID based IOTLB */
> >> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >> +	/* PASID based dev TLBs */
> >> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> >> +	/* PASID cache */
> >> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> >> +};
> >> +
> >> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> >> +{
> >> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> >> IOMMU_INV_GRANU_NR ||
> >> +		!inv_type_granu_map[type][granu])
> >> +		return -EINVAL;
> >> +
> >> +	*vtd_granu = inv_type_granu_table[type][granu];
> >> +
> >
> > btw do we really need both map and table here? Can't we just
> > use one table with unsupported granularity marked as a special
> > value?
> >
> >> +	return 0;
> >> +}
> >> +
> >> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> >> +{
> >> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> >> +
> >> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> >> +	 * IOMMU cache invalidate API passes granu_size in bytes, and
> >> number of
> >> +	 * granu size in contiguous memory.
> >> +	 */
> >> +	return order_base_2(nr_pages);
> >> +}
> >> +
> >> +#ifdef CONFIG_INTEL_IOMMU_SVM
> >> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> >> +		struct device *dev, struct iommu_cache_invalidate_info
> >> *inv_info)
> >> +{
> >> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >> +	struct device_domain_info *info;
> >> +	struct intel_iommu *iommu;
> >> +	unsigned long flags;
> >> +	int cache_type;
> >> +	u8 bus, devfn;
> >> +	u16 did, sid;
> >> +	int ret = 0;
> >> +	u64 size = 0;
> >> +
> >> +	if (!inv_info || !dmar_domain ||
> >> +		inv_info->version !=
> >> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >> +		return -EINVAL;
> >> +
> >> +	if (!dev || !dev_is_pci(dev))
> >> +		return -ENODEV;
> >> +
> >> +	iommu = device_to_iommu(dev, &bus, &devfn);
> >> +	if (!iommu)
> >> +		return -ENODEV;
> >> +
> >> +	spin_lock_irqsave(&device_domain_lock, flags);
> >> +	spin_lock(&iommu->lock);
> >> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> >> +	if (!info) {
> >> +		ret = -EINVAL;
> >> +		goto out_unlock;
> >
> > -ENOTSUPP?
> >
> >> +	}
> >> +	did = dmar_domain->iommu_did[iommu->seq_id];
> >> +	sid = PCI_DEVID(bus, devfn);
> >> +
> >> +	/* Size is only valid in non-PASID selective invalidation */
> >> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> >> +		size = to_vtd_size(inv_info->addr_info.granule_size,
> >> +				   inv_info->addr_info.nb_granules);
> >> +
> >> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache,
> >> IOMMU_CACHE_INV_TYPE_NR) {
> >> +		int granu = 0;
> >> +		u64 pasid = 0;
> >> +
> >> +		ret = to_vtd_granularity(cache_type, inv_info->granularity,
> >> &granu);
> >> +		if (ret) {
> >> +			pr_err("Invalid cache type and granu
> >> combination %d/%d\n", cache_type,
> >> +				inv_info->granularity);
> >> +			break;
> >> +		}
> >> +
> >> +		/* PASID is stored in different locations based on granularity
> >> */
> >> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> >> +			inv_info->pasid_info.flags &
> >> IOMMU_INV_PASID_FLAGS_PASID)
> >> +			pasid = inv_info->pasid_info.pasid;
> >> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR
> >> &&
> >> +			inv_info->addr_info.flags &
> >> IOMMU_INV_ADDR_FLAGS_PASID)
> >> +			pasid = inv_info->addr_info.pasid;
> >> +		else {
> >> +			pr_err("Cannot find PASID for given cache type and
> >> granularity\n");
> >> +			break;
> >> +		}
> >> +
> >> +		switch (BIT(cache_type)) {
> >> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> >> +			if ((inv_info->granularity !=
> >> IOMMU_INV_GRANU_PASID) &&
> >
> > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> >
> >> +				size && (inv_info->addr_info.addr &
> >> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> >> +				pr_err("Address out of range, 0x%llx, size
> >> order %llu\n",
> >> +					inv_info->addr_info.addr, size);
> >> +				ret = -ERANGE;
> >> +				goto out_unlock;
> >> +			}
> >> +
> >> +			qi_flush_piotlb(iommu, did,
> >> +					pasid,
> >> +					mm_to_dma_pfn(inv_info-
> >>> addr_info.addr),
> >> +					(granu == QI_GRAN_NONG_PASID) ? -
> >> 1 : 1 << size,
> >> +					inv_info->addr_info.flags &
> >> IOMMU_INV_ADDR_FLAGS_LEAF);
> >> +
> >> +			/*
> >> +			 * Always flush device IOTLB if ATS is enabled since
> >> guest
> >> +			 * vIOMMU exposes CM = 1, no device IOTLB flush
> >> will be passed
> >> +			 * down.
> >> +			 */
> >
> > Does VT-d spec mention that no device IOTLB flush is required when CM=1?
> >
> >> +			if (info->ats_enabled) {
> >> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >>> pfsid,
> >> +						pasid, info->ats_qdep,
> >> +						inv_info->addr_info.addr,
> >> size,
> >> +						granu);
> >> +			}
> >> +			break;
> >> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> >> +			if (info->ats_enabled) {
> >> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >>> pfsid,
> >> +						inv_info->addr_info.pasid,
> >> info->ats_qdep,
> >> +						inv_info->addr_info.addr,
> >> size,
> >> +						granu);
> >
> > I'm confused here. There are two granularities allowed for devtlb, but here
> > you only handle one of them?
> >
> >> +			} else
> >> +				pr_warn("Passdown device IOTLB flush w/o
> >> ATS!\n");
> >> +
> >> +			break;
> >> +		case IOMMU_CACHE_INV_TYPE_PASID:
> >> +			qi_flush_pasid_cache(iommu, did, granu, inv_info-
> >>> pasid_info.pasid);
> >> +
> >
> > as earlier comment, we shouldn't allow userspace or guest to invalidate
> > PASID cache
> >
> >> +			break;
> >> +		default:
> >> +			dev_err(dev, "Unsupported IOMMU invalidation
> >> type %d\n",
> >> +				cache_type);
> >> +			ret = -EINVAL;
> >> +		}
> >> +	}
> >> +out_unlock:
> >> +	spin_unlock(&iommu->lock);
> >> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> >> +
> >> +	return ret;
> >> +}
> >> +#endif
> >> +
> >>  static int intel_iommu_map(struct iommu_domain *domain,
> >>  			   unsigned long iova, phys_addr_t hpa,
> >>  			   size_t size, int iommu_prot, gfp_t gfp)
> >> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
> >>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
> >>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> >>  #ifdef CONFIG_INTEL_IOMMU_SVM
> >> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >>  #endif
> >> --
> >> 2.7.4
> >

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-29 16:05     ` Auger Eric
@ 2020-03-31  3:34       ` Tian, Kevin
  2020-03-31 21:07         ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-31  3:34 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj, Ashok, Jonathan Cameron

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Monday, March 30, 2020 12:05 AM
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> >> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> ++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +	/*
> >> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +	 * page selective (address granularity)
> >> +	 */
> >> +	{0, 1, 1},
> >> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +	{1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> >
> >> +	/* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache invalidation
> > is interpreted by vIOMMU for bind and unbind operations. I don't think
> > we should accept any PASID cache invalidation from userspace or guest.
> I tend to agree here.
> >
> >> +	{1, 1, 0}
> >> +};
> >> +
> >> +const static int
> >>
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >> _NR] = {
> >> +	/* PASID based IOTLB */
> >> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >> +	/* PASID based dev TLBs */
> >> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> >> +	/* PASID cache */
> >> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> >> +};
> >> +
> >> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> >> +{
> >> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> >> IOMMU_INV_GRANU_NR ||
> >> +		!inv_type_granu_map[type][granu])
> >> +		return -EINVAL;
> >> +
> >> +	*vtd_granu = inv_type_granu_table[type][granu];
> >> +
> >
> > btw do we really need both map and table here? Can't we just
> > use one table with unsupported granularity marked as a special
> > value?
> I asked the same question some time ago. If I remember correctly the
> issue is while a granu can be supported in inv_type_granu_map, the
> associated value in inv_type_granu_table can be 0. This typically
> matches both values of G field (0 or 1) in the invalidation cmd. See
> other comment below.

I didn't fully understand it. Also what does a value '0' imply? also
it's interesting to see below in [PATCH 07/11]:

+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
+

> >
> >> +	return 0;
> >> +}
> >> +
> >> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> >> +{
> >> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> >> +
> >> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> >> +	 * IOMMU cache invalidate API passes granu_size in bytes, and
> >> number of
> >> +	 * granu size in contiguous memory.
> >> +	 */
> >> +	return order_base_2(nr_pages);
> >> +}
> >> +
> >> +#ifdef CONFIG_INTEL_IOMMU_SVM
> >> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> >> +		struct device *dev, struct iommu_cache_invalidate_info
> >> *inv_info)
> >> +{
> >> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >> +	struct device_domain_info *info;
> >> +	struct intel_iommu *iommu;
> >> +	unsigned long flags;
> >> +	int cache_type;
> >> +	u8 bus, devfn;
> >> +	u16 did, sid;
> >> +	int ret = 0;
> >> +	u64 size = 0;
> >> +
> >> +	if (!inv_info || !dmar_domain ||
> >> +		inv_info->version !=
> >> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >> +		return -EINVAL;
> >> +
> >> +	if (!dev || !dev_is_pci(dev))
> >> +		return -ENODEV;
> >> +
> >> +	iommu = device_to_iommu(dev, &bus, &devfn);
> >> +	if (!iommu)
> >> +		return -ENODEV;
> >> +
> >> +	spin_lock_irqsave(&device_domain_lock, flags);
> >> +	spin_lock(&iommu->lock);
> >> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> >> +	if (!info) {
> >> +		ret = -EINVAL;
> >> +		goto out_unlock;
> >
> > -ENOTSUPP?
> >
> >> +	}
> >> +	did = dmar_domain->iommu_did[iommu->seq_id];
> >> +	sid = PCI_DEVID(bus, devfn);
> >> +
> >> +	/* Size is only valid in non-PASID selective invalidation */
> >> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> >> +		size = to_vtd_size(inv_info->addr_info.granule_size,
> >> +				   inv_info->addr_info.nb_granules);
> >> +
> >> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache,
> >> IOMMU_CACHE_INV_TYPE_NR) {
> >> +		int granu = 0;
> >> +		u64 pasid = 0;
> >> +
> >> +		ret = to_vtd_granularity(cache_type, inv_info->granularity,
> >> &granu);
> >> +		if (ret) {
> >> +			pr_err("Invalid cache type and granu
> >> combination %d/%d\n", cache_type,
> >> +				inv_info->granularity);
> >> +			break;
> >> +		}
> >> +
> >> +		/* PASID is stored in different locations based on granularity
> >> */
> >> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> >> +			inv_info->pasid_info.flags &
> >> IOMMU_INV_PASID_FLAGS_PASID)
> >> +			pasid = inv_info->pasid_info.pasid;
> >> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR
> >> &&
> >> +			inv_info->addr_info.flags &
> >> IOMMU_INV_ADDR_FLAGS_PASID)
> >> +			pasid = inv_info->addr_info.pasid;
> >> +		else {
> >> +			pr_err("Cannot find PASID for given cache type and
> >> granularity\n");
> >> +			break;
> >> +		}
> >> +
> >> +		switch (BIT(cache_type)) {
> >> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> >> +			if ((inv_info->granularity !=
> >> IOMMU_INV_GRANU_PASID) &&
> >
> > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> >
> >> +				size && (inv_info->addr_info.addr &
> >> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> >> +				pr_err("Address out of range, 0x%llx, size
> >> order %llu\n",
> >> +					inv_info->addr_info.addr, size);
> >> +				ret = -ERANGE;
> >> +				goto out_unlock;
> >> +			}
> >> +
> >> +			qi_flush_piotlb(iommu, did,
> >> +					pasid,
> >> +					mm_to_dma_pfn(inv_info-
> >>> addr_info.addr),
> >> +					(granu == QI_GRAN_NONG_PASID) ? -
> >> 1 : 1 << size,
> >> +					inv_info->addr_info.flags &
> >> IOMMU_INV_ADDR_FLAGS_LEAF);
> >> +
> >> +			/*
> >> +			 * Always flush device IOTLB if ATS is enabled since
> >> guest
> >> +			 * vIOMMU exposes CM = 1, no device IOTLB flush
> >> will be passed
> >> +			 * down.
> >> +			 */
> >
> > Does VT-d spec mention that no device IOTLB flush is required when CM=1?
> >
> >> +			if (info->ats_enabled) {
> >> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >>> pfsid,
> >> +						pasid, info->ats_qdep,
> >> +						inv_info->addr_info.addr,
> >> size,
> >> +						granu);
> >> +			}
> >> +			break;
> >> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> >> +			if (info->ats_enabled) {
> >> +				qi_flush_dev_iotlb_pasid(iommu, sid, info-
> >>> pfsid,
> >> +						inv_info->addr_info.pasid,
> >> info->ats_qdep,
> >> +						inv_info->addr_info.addr,
> >> size,
> >> +						granu);
> >
> > I'm confused here. There are two granularities allowed for devtlb, but here
> > you only handle one of them?
> granu is the result of to_vtd_granularity() so it can take either of the
> 2 values.

yes, you're right. 

> 
> Thanks
> 
> Eric
> >
> >> +			} else
> >> +				pr_warn("Passdown device IOTLB flush w/o
> >> ATS!\n");
> >> +
> >> +			break;
> >> +		case IOMMU_CACHE_INV_TYPE_PASID:
> >> +			qi_flush_pasid_cache(iommu, did, granu, inv_info-
> >>> pasid_info.pasid);
> >> +
> >
> > as earlier comment, we shouldn't allow userspace or guest to invalidate
> > PASID cache
> >
> >> +			break;
> >> +		default:
> >> +			dev_err(dev, "Unsupported IOMMU invalidation
> >> type %d\n",
> >> +				cache_type);
> >> +			ret = -EINVAL;
> >> +		}
> >> +	}
> >> +out_unlock:
> >> +	spin_unlock(&iommu->lock);
> >> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> >> +
> >> +	return ret;
> >> +}
> >> +#endif
> >> +
> >>  static int intel_iommu_map(struct iommu_domain *domain,
> >>  			   unsigned long iova, phys_addr_t hpa,
> >>  			   size_t size, int iommu_prot, gfp_t gfp)
> >> @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
> >>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
> >>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> >>  #ifdef CONFIG_INTEL_IOMMU_SVM
> >> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >>  #endif
> >> --
> >> 2.7.4
> >

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-30 18:21       ` Jacob Pan
@ 2020-03-31  3:36         ` Tian, Kevin
  0 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2020-03-31  3:36 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu
  Cc: Yi L, Raj, Ashok, David Woodhouse, iommu, LKML, Alex Williamson,
	Jean-Philippe Brucker, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Tuesday, March 31, 2020 2:22 AM
> 
> On Sun, 29 Mar 2020 16:03:36 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> > On 2020/3/27 20:21, Tian, Kevin wrote:
> > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > >> Sent: Saturday, March 21, 2020 7:28 AM
> > >>
> > >> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> > >
> > > now the spec is already at rev3.1 😊
> >
> > Updated.
> >
> > >
> > >> With PASID granular translation type set to 0x11b, translation
> > >> result from the first level(FL) also subject to a second level(SL)
> > >> page table translation. This mode is used for SVA virtualization,
> > >> where FL performs guest virtual to guest physical translation and
> > >> SL performs guest physical to host physical translation.
> > >>
> > >> This patch adds a helper function for setting up nested translation
> > >> where second level comes from a domain and first level comes from
> > >> a guest PGD.
> > >>
> > >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > >> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > >> ---
> > >>   drivers/iommu/intel-pasid.c | 240
> > >> +++++++++++++++++++++++++++++++++++++++++++-
> > >>   drivers/iommu/intel-pasid.h |  12 +++
> > >>   include/linux/intel-iommu.h |   3 +
> > >>   3 files changed, 252 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/drivers/iommu/intel-pasid.c
> > >> b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> > >> 100644 --- a/drivers/iommu/intel-pasid.c
> > >> +++ b/drivers/iommu/intel-pasid.c
> > >> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> > >> value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> > >>   }
> > >>
> > >> +/*
> > >> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> > >> 27); +}
> > >> +
> > >> +/*
> > >> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> > >> 32); +}
> > >> +
> > >> +/*
> > >> + * Setup the Cache Disable (CD) field (Bit 89)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_cd(struct pasid_entry *pe)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emte(struct pasid_entry *pe)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_eafe(struct pasid_entry *pe)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pcd(struct pasid_entry *pe)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pwt(struct pasid_entry *pe)
> > >> +{
> > >> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> > >> +}
> > >> +
> > >>   static void
> > >>   pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
> > >>   				    u16 did, int pasid)
> > >> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct
> > >> intel_iommu *iommu,
> > >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >>
> > >>   	/* Setup Present and PASID Granular Transfer Type: */
> > >> -	pasid_set_translation_type(pte, 1);
> > >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
> > >>   	pasid_set_present(pte);
> > >>   	pasid_flush_caches(iommu, pte, pasid, did);
> > >>
> > >> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> > >> intel_iommu *iommu,
> > >>   	pasid_set_domain_id(pte, did);
> > >>   	pasid_set_slptr(pte, pgd_val);
> > >>   	pasid_set_address_width(pte, agaw);
> > >> -	pasid_set_translation_type(pte, 2);
> > >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
> > >>   	pasid_set_fault_enable(pte);
> > >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >>
> > >> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> > >> intel_iommu *iommu,
> > >>   	pasid_clear_entry(pte);
> > >>   	pasid_set_domain_id(pte, did);
> > >>   	pasid_set_address_width(pte, iommu->agaw);
> > >> -	pasid_set_translation_type(pte, 4);
> > >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
> > >>   	pasid_set_fault_enable(pte);
> > >>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >>
> > >> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> > >> intel_iommu *iommu,
> > >>
> > >>   	return 0;
> > >>   }
> > >> +
> > >> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> > >> +				struct pasid_entry *pte,
> > >> +				struct iommu_gpasid_bind_data_vtd
> > >> *pasid_data)
> > >> +{
> > >> +	/*
> > >> +	 * Not all guest PASID table entry fields are passed down
> > >> during bind,
> > >> +	 * here we only set up the ones that are dependent on
> > >> guest settings.
> > >> +	 * Execution related bits such as NXE, SMEP are not
> > >> meaningful to IOMMU,
> > >> +	 * therefore not set. Other fields, such as snoop
> > >> related, are set based
> > >> +	 * on host needs regardless of guest settings.
> > >> +	 */
> > >> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> > >> +		if (!ecap_srs(iommu->ecap)) {
> > >> +			pr_err("No supervisor request support on
> > >> %s\n",
> > >> +			       iommu->name);
> > >> +			return -EINVAL;
> > >> +		}
> > >> +		pasid_set_sre(pte);
> > >> +	}
> > >> +
> > >> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> > >> +		if (!ecap_eafs(iommu->ecap)) {
> > >> +			pr_err("No extended access flag support
> > >> on %s\n",
> > >> +				iommu->name);
> > >> +			return -EINVAL;
> > >> +		}
> > >> +		pasid_set_eafe(pte);
> > >> +	}
> > >> +
> > >> +	/*
> > >> +	 * Memory type is only applicable to devices inside
> > >> processor coherent
> > >> +	 * domain. PCIe devices are not included. We can skip the
> > >> rest of the
> > >> +	 * flags if IOMMU does not support MTS.
> > >
> > > when you say that PCI devices are not included, is it simple for
> > > information or should we impose some check to make sure below path
> > > not applied to them?
> >
> > Jacob, does it work for you if I add below check?
> >
> > 	if (ecap_mts(iommu->ecap) && !dev_is_pci(dev))
> >
> > Or, we need to remove this comment line?
> >
> > >
> > >> +	 */
> > >> +	if (ecap_mts(iommu->ecap)) {
> > >> +		if (pasid_data->flags &
> > >> IOMMU_SVA_VTD_GPASID_EMTE) {
> > >> +			pasid_set_emte(pte);
> > >> +			pasid_set_emt(pte, pasid_data->emt);
> > >> +		}
> > >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> > >> +			pasid_set_pcd(pte);
> > >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> > >> +			pasid_set_pwt(pte);
> > >> +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> > >> +			pasid_set_cd(pte);
> > >> +		pasid_set_pat(pte, pasid_data->pat);
> > >> +	} else if (pasid_data->flags &
> > >> IOMMU_SVA_VTD_GPASID_MTS_MASK) {
> > >> +		pr_err("No memory type support for bind guest
> > >> PASID on %s\n",
> > >> +			iommu->name);
> > >> +		return -EINVAL;
> > >> +	}
> > >> +
> > >> +	return 0;
> > >> +
> > >> +}
> > >> +
> > >> +/**
> > >> + * intel_pasid_setup_nested() - Set up PASID entry for nested
> > >> translation.
> > >> + * This could be used for guest shared virtual address. In this
> > >> case, the
> > >> + * first level page tables are used for GVA-GPA translation in
> > >> the guest,
> > >> + * second level page tables are used for GPA-HPA translation.
> > >
> > > GVA->GPA is just one example. It could be gIOVA->GPA too. Here the
> > > point is that the first level is the translation table managed by
> > > the guest.
> >
> > Agreed.
> >
> Yes, that is why I chose the word "could be" :), but mention both cases
> are good.

"could be" is fine. I read too fast.

> 
> > >
> > >> + *
> > >> + * @iommu:      IOMMU which the device belong to
> > >> + * @dev:        Device to be set up for translation
> > >> + * @gpgd:       FLPTPTR: First Level Page translation pointer in
> > >> GPA
> > >> + * @pasid:      PASID to be programmed in the device PASID table
> > >> + * @pasid_data: Additional PASID info from the guest bind request
> > >> + * @domain:     Domain info for setting up second level page
> > >> tables
> > >> + * @addr_width: Address width of the first level (guest)
> > >> + */
> > >> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > >> +			struct device *dev, pgd_t *gpgd,
> > >> +			int pasid, struct
> > >> iommu_gpasid_bind_data_vtd *pasid_data,
> > >> +			struct dmar_domain *domain,
> > >> +			int addr_width)
> > >> +{
> > >> +	struct pasid_entry *pte;
> > >> +	struct dma_pte *pgd;
> > >> +	int ret = 0;
> > >> +	u64 pgd_val;
> > >> +	int agaw;
> > >> +	u16 did;
> > >> +
> > >> +	if (!ecap_nest(iommu->ecap)) {
> > >> +		pr_err("IOMMU: %s: No nested translation
> > >> support\n",
> > >> +		       iommu->name);
> > >> +		return -EINVAL;
> > >> +	}
> > >> +
> > >> +	pte = intel_pasid_get_entry(dev, pasid);
> > >> +	if (WARN_ON(!pte))
> > >> +		return -EINVAL;
> > >
> > > should we have intel_pasid_get_entry to return error which is then
> > > carried here? Looking at that function there could be error
> > > conditions both being invalid parameter and no memory...
> >
> > Agreed. Will do this in a followup patch.
> >
> > >
> > >> +
> > >> +	/*
> > >> +	 * Caller must ensure PASID entry is not in use, i.e. not
> > >> bind the
> > >> +	 * same PASID to the same device twice.
> > >> +	 */
> > >> +	if (pasid_pte_is_present(pte))
> > >> +		return -EBUSY;
> > >
> > > is any lock held outside of this function? curious whether any race
> > > condition may happen in between.
> >
> > The pasid entry change should always be protected by iommu->lock.
> >
> Agreed.

yes, I realized it after completing the review of whole series. 😊

> 
> > >
> > >> +
> > >> +	pasid_clear_entry(pte);
> > >> +
> > >> +	/* Sanity checking performed by caller to make sure
> > >> address
> > >> +	 * width matching in two dimensions:
> > >> +	 * 1. CPU vs. IOMMU
> > >> +	 * 2. Guest vs. Host.
> > >> +	 */
> > >> +	switch (addr_width) {
> > >> +	case ADDR_WIDTH_5LEVEL:
> > >> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> > >> +			cap_5lp_support(iommu->cap)) {
> > >> +			pasid_set_flpm(pte, 1);
> > >
> > > define a macro for 4lvl and 5lvl
> > >
> > >> +		} else {
> > >> +			dev_err(dev, "5-level paging not
> > >> supported\n");
> > >> +			return -EINVAL;
> > >> +		}
> > >> +		break;
> > >> +	case ADDR_WIDTH_4LEVEL:
> > >> +		pasid_set_flpm(pte, 0);
> > >> +		break;
> > >> +	default:
> > >> +		dev_err(dev, "Invalid guest address width %d\n",
> > >> addr_width);
> > >> +		return -EINVAL;
> > >> +	}
> > >> +
> > >> +	/* First level PGD is in GPA, must be supported by the
> > >> second level */
> > >> +	if ((u64)gpgd > domain->max_addr) {
> > >> +		dev_err(dev, "Guest PGD %llx not supported, max
> > >> %llx\n",
> > >> +			(u64)gpgd, domain->max_addr);
> > >> +		return -EINVAL;
> > >> +	}
> > >> +	pasid_set_flptr(pte, (u64)gpgd);
> > >> +
> > >> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> > >> +	if (ret) {
> > >> +		dev_err(dev, "Guest PASID bind data not
> > >> supported\n");
> > >> +		return ret;
> > >> +	}
> > >> +
> > >> +	/* Setup the second level based on the given domain */
> > >> +	pgd = domain->pgd;
> > >> +
> > >> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> > >> +	if (agaw < 0) {
> > >> +		dev_err(dev, "Invalid domain page table\n");
> > >> +		return -EINVAL;
> > >> +	}
> > >> +	pgd_val = virt_to_phys(pgd);
> > >> +	pasid_set_slptr(pte, pgd_val);
> > >> +	pasid_set_fault_enable(pte);
> > >> +
> > >> +	did = domain->iommu_did[iommu->seq_id];
> > >> +	pasid_set_domain_id(pte, did);
> > >> +
> > >> +	pasid_set_address_width(pte, agaw);
> > >> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >> +
> > >> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> > >> +	pasid_set_present(pte);
> > >> +	pasid_flush_caches(iommu, pte, pasid, did);
> > >> +
> > >> +	return ret;
> > >> +}
> > >> diff --git a/drivers/iommu/intel-pasid.h
> > >> b/drivers/iommu/intel-pasid.h index 92de6df24ccb..698015ee3f04
> > >> 100644 --- a/drivers/iommu/intel-pasid.h
> > >> +++ b/drivers/iommu/intel-pasid.h
> > >> @@ -36,6 +36,7 @@
> > >>    * to vmalloc or even module mappings.
> > >>    */
> > >>   #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> > >> +#define PASID_FLAG_NESTED		BIT(1)
> > >>
> > >>   /*
> > >>    * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for
> > >> first- @@ -51,6 +52,11 @@ struct pasid_entry {
> > >>   	u64 val[8];
> > >>   };
> > >>
> > >> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> > >> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> > >> +#define PASID_ENTRY_PGTT_NESTED		(3)
> > >> +#define PASID_ENTRY_PGTT_PT		(4)
> > >> +
> > >>   /* The representative of a PASID table */
> > >>   struct pasid_table {
> > >>   	void			*table;		/*
> > >> pasid table pointer */ @@ -99,6 +105,12 @@ int
> > >> intel_pasid_setup_second_level(struct intel_iommu *iommu,
> > >>   int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
> > >>   				   struct dmar_domain *domain,
> > >>   				   struct device *dev, int
> > >> pasid); +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > >> +			struct device *dev, pgd_t *pgd,
> > >> +			int pasid,
> > >> +			struct iommu_gpasid_bind_data_vtd
> > >> *pasid_data,
> > >> +			struct dmar_domain *domain,
> > >> +			int addr_width);
> > >>   void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> > >>   				 struct device *dev, int pasid);
> > >>
> > >> diff --git a/include/linux/intel-iommu.h
> > >> b/include/linux/intel-iommu.h index ed7171d2ae1f..eda1d6687144
> > >> 100644 --- a/include/linux/intel-iommu.h
> > >> +++ b/include/linux/intel-iommu.h
> > >> @@ -42,6 +42,9 @@
> > >>   #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
> > >>   #define DMA_FL_PTE_XD		BIT_ULL(63)
> > >>
> > >> +#define ADDR_WIDTH_5LEVEL	(57)
> > >> +#define ADDR_WIDTH_4LEVEL	(48)
> > >> +
> > >>   #define CONTEXT_TT_MULTI_LEVEL	0
> > >>   #define CONTEXT_TT_DEV_IOTLB	1
> > >>   #define CONTEXT_TT_PASS_THROUGH 2
> > >> --
> > >> 2.7.4
> > >
> >
> > Best regards,
> > baolu
> 
> [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-30 20:51     ` Jacob Pan
@ 2020-03-31  3:43       ` Tian, Kevin
  2020-04-01 17:13         ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-03-31  3:43 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Tuesday, March 31, 2020 4:52 AM
> 
> On Sat, 28 Mar 2020 08:02:01 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When supporting guest SVA with emulated IOMMU, the guest PASID
> > > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > > will result in PASID cache flush which will be passed down to
> > > the host as bind guest PASID calls.
> > >
> > > For the SL page tables, it will be harvested from device's
> > > default domain (request w/o PASID), or aux domain in case of
> > > mediated device.
> > >
> > >     .-------------.  .---------------------------.
> > >     |   vIOMMU    |  | Guest process CR3, FL only|
> > >     |             |  '---------------------------'
> > >     .----------------/
> > >     | PASID Entry |--- PASID cache flush -
> > >     '-------------'                       |
> > >     |             |                       V
> > >     |             |                CR3 in GPA
> > >     '-------------'
> > > Guest
> > > ------| Shadow |--------------------------|--------
> > >       v        v                          v
> > > Host
> > >     .-------------.  .----------------------.
> > >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> > >     |             |  '----------------------'
> > >     .----------------/  |
> > >     | PASID Entry |     V (Nested xlate)
> > >     '----------------\.------------------------------.
> > >     |             |   |SL for GPA-HPA, default domain|
> > >     |             |   '------------------------------'
> > >     '-------------'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > >
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > > ---
> > >  drivers/iommu/intel-iommu.c |   4 +
> > >  drivers/iommu/intel-svm.c   | 224
> > > ++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/intel-iommu.h |   8 +-
> > >  include/linux/intel-svm.h   |  17 ++++
> > >  4 files changed, 252 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> > >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> > >  	.is_attach_deferred	=
> > > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > > +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > > +#endif
> > >  };
> > >
> > >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index d7f2a5358900..47c0deb5ae56 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> > >  	list_for_each_entry((sdev), &(svm)->devs, list)	\
> > >  		if ((d) != (sdev)->dev) {} else
> > >
> > > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > > +			struct device *dev,
> > > +			struct iommu_gpasid_bind_data *data)
> > > +{
> > > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > > +	struct dmar_domain *ddomain;
> >
> > what about the full name e.g. dmar_domain? though a bit longer
> > but clearer than ddomain.
> >
> Sure, I don't have preference.
> 
> > > +	struct intel_svm_dev *sdev;
> > > +	struct intel_svm *svm;
> > > +	int ret = 0;
> > > +
> > > +	if (WARN_ON(!iommu) || !data)
> > > +		return -EINVAL;
> > > +
> > > +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > > +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > > +		return -EINVAL;
> > > +
> > > +	if (dev_is_pci(dev)) {
> > > +		/* VT-d supports devices with full 20 bit PASIDs
> > > only */
> > > +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > > +			return -EINVAL;
> > > +	} else {
> > > +		return -ENOTSUPP;
> > > +	}
> > > +
> > > +	/*
> > > +	 * We only check host PASID range, we have no knowledge to
> > > check
> > > +	 * guest PASID range nor do we use the guest PASID.
> > > +	 */
> > > +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > > +		return -EINVAL;
> > > +
> > > +	ddomain = to_dmar_domain(domain);
> > > +
> > > +	/* Sanity check paging mode support match between host and
> > > guest */
> > > +	if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > > +	    !cap_5lp_support(iommu->cap)) {
> > > +		pr_err("Cannot support 5 level paging requested by
> > > guest!\n");
> > > +		return -EINVAL;
> > > +	}
> >
> > -ENOTSUPP?
> I was thinking from this API p.o.v, the input is invalid. Since both
> cap and addr_width are derived from input arguments.

ok, suppose the userspace already enumerates the capabilities before
making this call.

> 
> >
> > > +
> > > +	mutex_lock(&pasid_mutex);
> > > +	svm = ioasid_find(NULL, data->hpasid, NULL);
> > > +	if (IS_ERR(svm)) {
> > > +		ret = PTR_ERR(svm);
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (svm) {
> > > +		/*
> > > +		 * If we found svm for the PASID, there must be at
> > > +		 * least one device bond, otherwise svm should be
> > > freed.
> > > +		 */
> > > +		if (WARN_ON(list_empty(&svm->devs))) {
> > > +			ret = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		if (svm->mm == get_task_mm(current) &&
> > > +		    data->hpasid == svm->pasid &&
> > > +		    data->gpasid == svm->gpasid) {
> > > +			pr_warn("Cannot bind the same guest-host
> > > PASID for the same process\n");
> >
> > Sorry I didn’t get the rationale here. Isn't this branch is for
> > binding the same PASID to multiple devices? In that case definitely
> > it is binding the same guest-host PASID for the same process.
> > otherwise if hpasid is different then you'll hit a different
> > intel_svm, while if gpasid is different how you can use one intel_svm
> > to hold multiple gpasids?
> >
> > I feel the error condition should be the opposite. and suppose
> > SVM_FLAG_ GUEST_PASID should be verified before checking gpasid.
> >
> You are right, actually we don't need the check here. The
> scenario for multiple devices bind to the same PASID is checked in
> for_each_svm_dev()
> I will remove this code.
> 
> > > +			mmput(svm->mm);
> > > +			ret = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +		mmput(current->mm);
> > > +
> > > +		for_each_svm_dev(sdev, svm, dev) {
> > > +			/* In case of multiple sub-devices of the
> > > same pdev
> > > +			 * assigned, we should allow multiple bind
> > > calls with
> > > +			 * the same PASID and pdev.
> >
> > Does sub-device mean mdev? I didn't find such notation in current
> > iommu directory.
> >
> yes it is intended for mdev.
> > and to make it clearer, "In case of multiple mdevs of the same pdev
> > assigned to the same guest process".
> >
> I am avoiding mdev on purpose since it is not a concept in iommu
> driver. sub-device is more generic.

ok, fine to me.

> 
> > > +			 */
> > > +			sdev->users++;
> > > +			goto out;
> > > +		}
> > > +	} else {
> > > +		/* We come here when PASID has never been bond to a
> > > device. */
> > > +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > > +		if (!svm) {
> > > +			ret = -ENOMEM;
> > > +			goto out;
> > > +		}
> > > +		/* REVISIT: upper layer/VFIO can track host
> > > process that bind the PASID.
> > > +		 * ioasid_set = mm might be sufficient for vfio to
> > > check pasid VMM
> > > +		 * ownership.
> > > +		 */
> >
> > Above message is unclear about what should be revisited. Does it
> > describe the current implementation or the expected revision in the
> > future?
> >
> What I meant was if VFIO can check PASID-mm ownership by itself, then
> we don;t have to store svm->mm here. Will drop the line below.
> I will add this comment to clarify.
> 
> > > +		svm->mm = get_task_mm(current);
> > > +		svm->pasid = data->hpasid;
> > > +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > > +			svm->gpasid = data->gpasid;
> > > +			svm->flags |= SVM_FLAG_GUEST_PASID;
> > > +		}
> > > +		ioasid_set_data(data->hpasid, svm);
> > > +		INIT_LIST_HEAD_RCU(&svm->devs);
> > > +		mmput(svm->mm);
> > > +	}
> > > +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > > +	if (!sdev) {
> > > +		if (list_empty(&svm->devs)) {
> > > +			ioasid_set_data(data->hpasid, NULL);
> > > +			kfree(svm);
> > > +		}
> > > +		ret = -ENOMEM;
> > > +		goto out;
> > > +	}
> > > +	sdev->dev = dev;
> > > +	sdev->users = 1;
> > > +
> > > +	/* Set up device context entry for PASID if not enabled
> > > already */
> > > +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > > +	if (ret) {
> > > +		dev_err(dev, "Failed to enable PASID
> > > capability\n");
> > > +		kfree(sdev);
> > > +		/*
> > > +		 * If this this a new PASID that never bond to a
> > > device, then
> > > +		 * the device list must be empty which indicates
> > > struct svm
> > > +		 * was allocated in this function.
> > > +		 */
> >
> > the comment better move to the 1st occurrence when sdev allocation
> > fails. or even better put it in out label...
> >
> Sounds good.
> 
> > > +		if (list_empty(&svm->devs)) {
> > > +			ioasid_set_data(data->hpasid, NULL);
> > > +			kfree(svm);
> > > +		}
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * For guest bind, we need to set up PASID table entry as
> > > follows:
> > > +	 * - FLPM matches guest paging mode
> > > +	 * - turn on nested mode
> > > +	 * - SL guest address width matching
> > > +	 */
> >
> > looks above just explains the internal detail of
> > intel_pasid_setup_nested, which is not necessary to be here.
> >
> Right, will remove the comments.
> 
> > > +	ret = intel_pasid_setup_nested(iommu,
> > > +				       dev,
> > > +				       (pgd_t *)data->gpgd,
> > > +				       data->hpasid,
> > > +				       &data->vtd,
> > > +				       ddomain,
> > > +				       data->addr_width);
> >
> > It's worthy of an explanation here that setup_nested is required for
> > every device (even when they are sharing same intel_svm) because
> > we allocate pasid table per device. Otherwise I made a mistake to
> > think that only the 1st device bound to a new hpasid requires this
> > step. 😊
> >
> Good suggestion, I will add the comments as:
> /*
>  * PASID table is per device for better security. Therefore, for
>  * each bind of a new device even with an existing PASID, we need to
>  * call the nested mode setup function here.
>  */
> 
> > > +	if (ret) {
> > > +		dev_err(dev, "Failed to set up PASID %llu in
> > > nested mode, Err %d\n",
> > > +			data->hpasid, ret);
> > > +		/*
> > > +		 * PASID entry should be in cleared state if
> > > nested mode
> > > +		 * set up failed. So we only need to clear IOASID
> > > tracking
> > > +		 * data such that free call will succeed.
> > > +		 */
> > > +		kfree(sdev);
> > > +		if (list_empty(&svm->devs)) {
> > > +			ioasid_set_data(data->hpasid, NULL);
> > > +			kfree(svm);
> > > +		}
> > > +		goto out;
> > > +	}
> > > +	svm->flags |= SVM_FLAG_GUEST_MODE;
> > > +
> > > +	init_rcu_head(&sdev->rcu);
> > > +	list_add_rcu(&sdev->list, &svm->devs);
> > > + out:
> > > +	mutex_unlock(&pasid_mutex);
> > > +	return ret;
> > > +}
> > > +
> > > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > > +{
> > > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > > +	struct intel_svm_dev *sdev;
> > > +	struct intel_svm *svm;
> > > +	int ret = -EINVAL;
> > > +
> > > +	if (WARN_ON(!iommu))
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&pasid_mutex);
> > > +	svm = ioasid_find(NULL, pasid, NULL);
> > > +	if (!svm) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (IS_ERR(svm)) {
> > > +		ret = PTR_ERR(svm);
> > > +		goto out;
> > > +	}
> > > +
> > > +	for_each_svm_dev(sdev, svm, dev) {
> > > +		ret = 0;
> > > +		sdev->users--;
> > > +		if (!sdev->users) {
> > > +			list_del_rcu(&sdev->list);
> > > +			intel_pasid_tear_down_entry(iommu, dev,
> > > svm-
> > > >pasid);
> > > +			/* TODO: Drain in flight PRQ for the PASID
> > > since it
> > > +			 * may get reused soon, we don't want to
> > > +			 * confuse with its previous life.
> > > +			 * intel_svm_drain_prq(dev, pasid);
> > > +			 */
> > > +			kfree_rcu(sdev, rcu);
> > > +
> > > +			if (list_empty(&svm->devs)) {
> > > +				/*
> > > +				 * We do not free PASID here until
> > > explicit call
> > > +				 * from VFIO to free. The PASID
> > > life cycle
> > > +				 * management is largely tied to
> > > VFIO management
> > > +				 * of assigned device life cycles.
> > > In case of
> > > +				 * guest exit without a explicit
> > > free PASID call,
> > > +				 * the responsibility lies in VFIO
> > > layer to free
> > > +				 * the PASIDs allocated for the
> > > guest.
> > > +				 * For security reasons, VFIO has
> > > to track the
> > > +				 * PASID ownership per guest
> > > anyway to ensure
> > > +				 * that PASID allocated by one
> > > guest cannot be
> > > +				 * used by another.
> >
> > As commented in other patches, VFIO is only one example user of this
> > API...
> >
> Right, how about this:
> 	/*
> 	 * We do not free the IOASID here in that
> 	 * IOMMU driver did not allocate it.
> 	 * Unlike native SVM, IOASID for guest use was
> 	 * allocated prior to the bind call.
> 	 * In any case, if the free call comes before
> 	 * the unbind, IOMMU driver will get notified
> 	 * and perform cleanup.
> 	 */

looks good.

> 
> > > +				 */
> > > +				ioasid_set_data(pasid, NULL);
> > > +				kfree(svm);
> > > +			}
> > > +		}
> > > +		break;
> > > +	}
> >
> > what about no dev match? an -EINVAL is also required then.
> >
> Yes, ret is initialized as -EINVAL
> 
> > > +out:
> > > +	mutex_unlock(&pasid_mutex);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > > struct svm_dev_ops *ops)
> > >  {
> > >  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > > diff --git a/include/linux/intel-iommu.h
> > > b/include/linux/intel-iommu.h index eda1d6687144..85b05120940e
> > > 100644 --- a/include/linux/intel-iommu.h
> > > +++ b/include/linux/intel-iommu.h
> > > @@ -681,7 +681,9 @@ struct dmar_domain *find_domain(struct device
> > > *dev);
> > >  extern void intel_svm_check(struct intel_iommu *iommu);
> > >  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> > >  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > > -
> > > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > > +		struct device *dev, struct iommu_gpasid_bind_data
> > > *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
> > > pasid); struct svm_dev_ops;
> > >
> > >  struct intel_svm_dev {
> > > @@ -698,9 +700,13 @@ struct intel_svm_dev {
> > >  struct intel_svm {
> > >  	struct mmu_notifier notifier;
> > >  	struct mm_struct *mm;
> > > +
> > >  	struct intel_iommu *iommu;
> > >  	int flags;
> > >  	int pasid;
> > > +	int gpasid; /* Guest PASID in case of vSVA bind with
> > > non-identity host
> > > +		     * to guest PASID mapping.
> > > +		     */
> >
> > we don't need to highlight identity or non-identity thing, since
> > either way shares the same infrastructure here and it is not the
> > knowledge that the kernel driver should assume
> >
> Sorry, I don't get your point.
> 
> What I meant was that this field "gpasid" is only used for non-identity
> case. For identity case, we don't have SVM_FLAG_GUEST_PASID.

what's the problem if a guest tries to set gpasid even in identity
case? do you want to add check to reject it? Also I remember we
discussed before that we want to provide a consistent interface 
to other consumer e.g. KVM to setup VMCS PASID translation table.
In that case, regardless of identity or non-identity, we need provide
such mapping info.

> 
> > >  	struct list_head devs;
> > >  	struct list_head list;
> > >  };
> > > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > > index d7c403d0dd27..c19690937540 100644
> > > --- a/include/linux/intel-svm.h
> > > +++ b/include/linux/intel-svm.h
> > > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> > >   * do such IOTLB flushes automatically.
> > >   */
> > >  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> > > +/*
> > > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > > to a device.
> > > + * In this case the mm_struct is in the guest kernel or userspace,
> > > its life
> > > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > > API provides
> > > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > > device.
> > > + */
> > > +#define SVM_FLAG_GUEST_MODE	(1<<2)
> > > +/*
> > > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > > PASID space,
> > > + * which requires guest and host PASID translation at both
> > > directions. We keep
> > > + * track of guest PASID in order to provide lookup service to
> > > device drivers.
> > > + * One such example is a physical function (PF) driver that
> > > supports mediated
> > > + * device (mdev) assignment. Guest programming of mdev
> > > configuration space can
> > > + * only be done with guest PASID, therefore PF driver needs to
> > > find the matching
> > > + * host PASID to program the real hardware.
> > > + */
> > > +#define SVM_FLAG_GUEST_PASID	(1<<3)
> > >
> > >  #ifdef CONFIG_INTEL_IOMMU_SVM
> > >
> > > --
> > > 2.7.4
> >
> 
> [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types
  2020-03-30 23:28     ` Jacob Pan
@ 2020-03-31 16:13       ` Jacob Pan
  2020-03-31 16:15         ` Auger Eric
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 16:13 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Mon, 30 Mar 2020 16:28:34 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Fri, 27 Mar 2020 15:46:23 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
> > Hi Jacob,
> > 
> > On 3/21/20 12:27 AM, Jacob Pan wrote:  
> > > When Shared Virtual Memory is exposed to a guest via vIOMMU,
> > > scalable IOTLB invalidation may be passed down from outside IOMMU
> > > subsystems. This patch adds invalidation functions that can be
> > > used for additional translation cache types.
> > > 
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > 
> > > ---
> > > v9 -> v10:
> > > Fix off by 1 in pasid device iotlb flush
> > > 
> > > Address v7 missed review from Eric
> > > 
> > > ---
> > > ---
> > >  drivers/iommu/dmar.c        | 36
> > > ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel-pasid.c |
> > > 3 ++- include/linux/intel-iommu.h | 20 ++++++++++++++++----
> > >  3 files changed, 54 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > > index f77dae7ba7d4..4d6b7b5b37ee 100644
> > > --- a/drivers/iommu/dmar.c
> > > +++ b/drivers/iommu/dmar.c
> > > @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu
> > > *iommu, u16 did, u32 pasid, u64 addr, qi_submit_sync(&desc,
> > > iommu); }
> > >  
> > > +/* PASID-based device IOTLB Invalidate */
> > > +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid,
> > > u16 pfsid,
> > > +		u32 pasid,  u16 qdep, u64 addr, unsigned
> > > size_order, u64 granu) +{
> > > +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order
> > > - 1);
> > > +	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
> > > +
> > > +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
> > > QI_DEV_EIOTLB_SID(sid) |
> > > +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> > > +		QI_DEV_IOTLB_PFSID(pfsid);
> > > +	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
> > > +
> > > +	/*
> > > +	 * If S bit is 0, we only flush a single page. If S bit
> > > is set,
> > > +	 * The least significant zero bit indicates the
> > > invalidation address
> > > +	 * range. VT-d spec 6.5.2.6.
> > > +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates
> > > 16KB.
> > > +	 * size order = 0 is PAGE_SIZE 4KB
> > > +	 * Max Invs Pending (MIP) is set to 0 for now until we
> > > have DIT in
> > > +	 * ECAP.
> > > +	 */
> > > +	desc.qw1 |= addr & ~mask;
> > > +	if (size_order)
> > > +		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> > > +
> > > +	qi_submit_sync(&desc, iommu);
> > > +}
> > > +
> > > +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
> > > granu, int pasid) +{
> > > +	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
> > > +
> > > +	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
> > > QI_PC_GRAN(granu) | QI_PC_TYPE;
> > > +	qi_submit_sync(&desc, iommu);
> > > +}
> > > +
> > >  /*
> > >   * Disable Queued Invalidation interface.
> > >   */
> > > diff --git a/drivers/iommu/intel-pasid.c
> > > b/drivers/iommu/intel-pasid.c index 10c7856afc6b..9f6d07410722
> > > 100644 --- a/drivers/iommu/intel-pasid.c
> > > +++ b/drivers/iommu/intel-pasid.c
> > > @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct
> > > intel_iommu *iommu, {
> > >  	struct qi_desc desc;
> > >  
> > > -	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL |
> > > QI_PC_PASID(pasid);
> > > +	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
> > > +		QI_PC_PASID(pasid) | QI_PC_TYPE;    
> > Just a nit, this fix is not documented in the commit message.
> >   
> Thanks, I just sent out this fix separately. Will remove this from the
> set.
> https://lkml.org/lkml/2020/3/30/1065
> 
I just realized this is not a fix. since I redefined below macros such
that I could use them for granularity lookup without the QI_PC_TYPE.

 -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
 -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
 +/* PASID cache invalidation granu */
 +#define QI_PC_ALL_PASIDS	0
 +#define QI_PC_PASID_SEL		1

Thanks,

Jacob

> > Besides
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > 
> > Thanks
> > 
> > Eric
> >   
> > >  	desc.qw1 = 0;
> > >  	desc.qw2 = 0;
> > >  	desc.qw3 = 0;
> > > diff --git a/include/linux/intel-iommu.h
> > > b/include/linux/intel-iommu.h index 85b05120940e..43539713b3b3
> > > 100644 --- a/include/linux/intel-iommu.h
> > > +++ b/include/linux/intel-iommu.h
> > > @@ -334,7 +334,7 @@ enum {
> > >  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
> > > (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
> > > (((u64)addr) & VTD_PAGE_MASK) #define
> > > QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
> > > QI_IOTLB_AM(am)		(((u8)am)) +#define
> > > QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
> > >  #define QI_CC_FM(fm)		(((u64)fm) << 48)
> > >  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> > > @@ -353,16 +353,21 @@ enum {
> > >  #define QI_PC_DID(did)		(((u64)did) << 16)
> > >  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
> > >  
> > > -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> > > -#define QI_PC_PASID_SEL		(QI_PC_TYPE |
> > > QI_PC_GRAN(1)) +/* PASID cache invalidation granu */
> > > +#define QI_PC_ALL_PASIDS	0
> > > +#define QI_PC_PASID_SEL		1
> > >  
> > >  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> > >  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> > > -#define QI_EIOTLB_AM(am)	(((u64)am))
> > > +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
> > >  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
> > >  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
> > >  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
> > >  
> > > +/* QI Dev-IOTLB inv granu */
> > > +#define QI_DEV_IOTLB_GRAN_ALL		1
> > > +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0> +
> > >  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
> > >  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
> > >  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> > > @@ -662,8 +667,15 @@ extern void qi_flush_iotlb(struct intel_iommu
> > > *iommu, u16 did, u64 addr, unsigned int size_order, u64 type);
> > >  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16
> > > sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
> > > +
> > >  void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32
> > > pasid, u64 addr, unsigned long npages, bool ih);
> > > +
> > > +extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu,
> > > u16 sid, u16 pfsid,
> > > +			u32 pasid, u16 qdep, u64 addr, unsigned
> > > size_order, u64 granu); +
> > > +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
> > > did, u64 granu, int pasid); +
> > >  extern int qi_submit_sync(struct qi_desc *desc, struct
> > > intel_iommu *iommu); 
> > >  extern int dmar_ir_support(void);
> > >     
> >   
> 
> [Jacob Pan]

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types
  2020-03-31 16:13       ` Jacob Pan
@ 2020-03-31 16:15         ` Auger Eric
  0 siblings, 0 replies; 67+ messages in thread
From: Auger Eric @ 2020-03-31 16:15 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

Hi Jacob,

On 3/31/20 6:13 PM, Jacob Pan wrote:
> On Mon, 30 Mar 2020 16:28:34 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
>> On Fri, 27 Mar 2020 15:46:23 +0100
>> Auger Eric <eric.auger@redhat.com> wrote:
>>
>>> Hi Jacob,
>>>
>>> On 3/21/20 12:27 AM, Jacob Pan wrote:  
>>>> When Shared Virtual Memory is exposed to a guest via vIOMMU,
>>>> scalable IOTLB invalidation may be passed down from outside IOMMU
>>>> subsystems. This patch adds invalidation functions that can be
>>>> used for additional translation cache types.
>>>>
>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>
>>>> ---
>>>> v9 -> v10:
>>>> Fix off by 1 in pasid device iotlb flush
>>>>
>>>> Address v7 missed review from Eric
>>>>
>>>> ---
>>>> ---
>>>>  drivers/iommu/dmar.c        | 36
>>>> ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel-pasid.c |
>>>> 3 ++- include/linux/intel-iommu.h | 20 ++++++++++++++++----
>>>>  3 files changed, 54 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
>>>> index f77dae7ba7d4..4d6b7b5b37ee 100644
>>>> --- a/drivers/iommu/dmar.c
>>>> +++ b/drivers/iommu/dmar.c
>>>> @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu
>>>> *iommu, u16 did, u32 pasid, u64 addr, qi_submit_sync(&desc,
>>>> iommu); }
>>>>  
>>>> +/* PASID-based device IOTLB Invalidate */
>>>> +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid,
>>>> u16 pfsid,
>>>> +		u32 pasid,  u16 qdep, u64 addr, unsigned
>>>> size_order, u64 granu) +{
>>>> +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order
>>>> - 1);
>>>> +	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
>>>> +
>>>> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
>>>> QI_DEV_EIOTLB_SID(sid) |
>>>> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
>>>> +		QI_DEV_IOTLB_PFSID(pfsid);
>>>> +	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
>>>> +
>>>> +	/*
>>>> +	 * If S bit is 0, we only flush a single page. If S bit
>>>> is set,
>>>> +	 * The least significant zero bit indicates the
>>>> invalidation address
>>>> +	 * range. VT-d spec 6.5.2.6.
>>>> +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates
>>>> 16KB.
>>>> +	 * size order = 0 is PAGE_SIZE 4KB
>>>> +	 * Max Invs Pending (MIP) is set to 0 for now until we
>>>> have DIT in
>>>> +	 * ECAP.
>>>> +	 */
>>>> +	desc.qw1 |= addr & ~mask;
>>>> +	if (size_order)
>>>> +		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
>>>> +
>>>> +	qi_submit_sync(&desc, iommu);
>>>> +}
>>>> +
>>>> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
>>>> granu, int pasid) +{
>>>> +	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
>>>> +
>>>> +	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
>>>> QI_PC_GRAN(granu) | QI_PC_TYPE;
>>>> +	qi_submit_sync(&desc, iommu);
>>>> +}
>>>> +
>>>>  /*
>>>>   * Disable Queued Invalidation interface.
>>>>   */
>>>> diff --git a/drivers/iommu/intel-pasid.c
>>>> b/drivers/iommu/intel-pasid.c index 10c7856afc6b..9f6d07410722
>>>> 100644 --- a/drivers/iommu/intel-pasid.c
>>>> +++ b/drivers/iommu/intel-pasid.c
>>>> @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct
>>>> intel_iommu *iommu, {
>>>>  	struct qi_desc desc;
>>>>  
>>>> -	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL |
>>>> QI_PC_PASID(pasid);
>>>> +	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
>>>> +		QI_PC_PASID(pasid) | QI_PC_TYPE;    
>>> Just a nit, this fix is not documented in the commit message.
>>>   
>> Thanks, I just sent out this fix separately. Will remove this from the
>> set.
>> https://lkml.org/lkml/2020/3/30/1065
>>
> I just realized this is not a fix. since I redefined below macros such
> that I could use them for granularity lookup without the QI_PC_TYPE.
> 
>  -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
>  -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
>  +/* PASID cache invalidation granu */
>  +#define QI_PC_ALL_PASIDS	0
>  +#define QI_PC_PASID_SEL		1
ouah ok. So it improves code reading/consistency at least ;-)

Thanks

Eric
> 
> Thanks,
> 
> Jacob
> 
>>> Besides
>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>>
>>> Thanks
>>>
>>> Eric
>>>   
>>>>  	desc.qw1 = 0;
>>>>  	desc.qw2 = 0;
>>>>  	desc.qw3 = 0;
>>>> diff --git a/include/linux/intel-iommu.h
>>>> b/include/linux/intel-iommu.h index 85b05120940e..43539713b3b3
>>>> 100644 --- a/include/linux/intel-iommu.h
>>>> +++ b/include/linux/intel-iommu.h
>>>> @@ -334,7 +334,7 @@ enum {
>>>>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
>>>> (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
>>>> (((u64)addr) & VTD_PAGE_MASK) #define
>>>> QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
>>>> QI_IOTLB_AM(am)		(((u8)am)) +#define
>>>> QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
>>>>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>>>>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
>>>> @@ -353,16 +353,21 @@ enum {
>>>>  #define QI_PC_DID(did)		(((u64)did) << 16)
>>>>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>>>>  
>>>> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
>>>> -#define QI_PC_PASID_SEL		(QI_PC_TYPE |
>>>> QI_PC_GRAN(1)) +/* PASID cache invalidation granu */
>>>> +#define QI_PC_ALL_PASIDS	0
>>>> +#define QI_PC_PASID_SEL		1
>>>>  
>>>>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>>>>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
>>>> -#define QI_EIOTLB_AM(am)	(((u64)am))
>>>> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>>>>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>>>>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>>>>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>>>>  
>>>> +/* QI Dev-IOTLB inv granu */
>>>> +#define QI_DEV_IOTLB_GRAN_ALL		1
>>>> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0> +
>>>>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>>>>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>>>>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
>>>> @@ -662,8 +667,15 @@ extern void qi_flush_iotlb(struct intel_iommu
>>>> *iommu, u16 did, u64 addr, unsigned int size_order, u64 type);
>>>>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16
>>>> sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
>>>> +
>>>>  void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32
>>>> pasid, u64 addr, unsigned long npages, bool ih);
>>>> +
>>>> +extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu,
>>>> u16 sid, u16 pfsid,
>>>> +			u32 pasid, u16 qdep, u64 addr, unsigned
>>>> size_order, u64 granu); +
>>>> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
>>>> did, u64 granu, int pasid); +
>>>>  extern int qi_submit_sync(struct qi_desc *desc, struct
>>>> intel_iommu *iommu); 
>>>>  extern int dmar_ir_support(void);
>>>>     
>>>   
>>
>> [Jacob Pan]
> 
> [Jacob Pan]
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-28 10:01   ` Tian, Kevin
  2020-03-29 15:34     ` Auger Eric
  2020-03-29 16:05     ` Auger Eric
@ 2020-03-31 18:13     ` Jacob Pan
  2020-04-01  6:24       ` Tian, Kevin
  2 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 18:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Sat, 28 Mar 2020 10:01:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> > 
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue  
> 
> emulated IOMMU -> vIOMMU, since virito-iommu could use the
> interface as well.
> 
True, but it does not invalidate this statement about emulated IOMMU. I
will add another statement saying "the same interface can be used for
virtio-IOMMU as well". OK?

> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> > 
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> > 
> > ---
> > v7 review fixed in v10
> > ---
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 182
> > ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 182 insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5619,6 +5619,187 @@ static void
> > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >  }
> > 
> > +/*
> > + * 2D array for converting and sanitizing IOMMU generic TLB
> > granularity to
> > + * VT-d granularity. Invalidation is typically included in the
> > unmap operation
> > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > guest
> > + * owns the first level page tables. Invalidations of translation
> > caches in the
> > + * guest are trapped and passed down to the host.
> > + *
> > + * vIOMMU in the guest will only expose first level page tables,
> > therefore
> > + * we do not include IOTLB granularity for request without PASID
> > (second level).  
> 
> I would revise above as "We do not support IOTLB granularity for
> request without PASID (second level), therefore any vIOMMU
> implementation that exposes the SVA capability to the guest should
> only expose the first level page tables, implying all invalidation
> requests from the guest will include a valid PASID"
> 
Sounds good.

> > + *
> > + * For example, to find the VT-d granularity encoding for IOTLB
> > + * type and page selective granularity within PASID:
> > + * X: indexed by iommu cache type
> > + * Y: indexed by enum iommu_inv_granularity
> > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > + *
> > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > invalid
> > + *
> > + */
> > +const static int
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > NR] = {
> > +	/*
> > +	 * PASID based IOTLB invalidation: PASID selective (per
> > PASID),
> > +	 * page selective (address granularity)
> > +	 */
> > +	{0, 1, 1},
> > +	/* PASID based dev TLBs, only support all PASIDs or single
> > PASID */
> > +	{1, 1, 0},  
> 
> Is this combination correct? when single PASID is being specified, it
> is essentially a page-selective invalidation since you need provide
> Address and Size. 
> 
This is for translation between generic UAPI granu to VT-d granu, it
has nothing to do with address and size.
e.g.
If user passes IOMMU_INV_GRANU_PASID for the single PASID case as you
mentioned, this map table shows it is valid.

Then the lookup result will get VT-d granu:
QI_DEV_IOTLB_GRAN_PASID_SEL, which means G=0.


> > +	/* PASID cache */  
> 
> PASID cache is fully managed by the host. Guest PASID cache
> invalidation is interpreted by vIOMMU for bind and unbind operations.
> I don't think we should accept any PASID cache invalidation from
> userspace or guest.
> 

True for vIOMMU, this is here for completeness. Can be used by virtio
IOMMU, since PC flush is inclusive (IOTLB, devTLB), it is more
efficient.

> > +	{1, 1, 0}
> > +};
> > +
> > +const static int
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> > _NR] = {
> > +	/* PASID based IOTLB */
> > +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > +	/* PASID based dev TLBs */
> > +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > +	/* PASID cache */
> > +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > +};
> > +
> > +static inline int to_vtd_granularity(int type, int granu, int
> > *vtd_granu) +{
> > +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> > IOMMU_INV_GRANU_NR ||
> > +		!inv_type_granu_map[type][granu])
> > +		return -EINVAL;
> > +
> > +	*vtd_granu = inv_type_granu_table[type][granu];
> > +  
> 
> btw do we really need both map and table here? Can't we just
> use one table with unsupported granularity marked as a special
> value?
> 
Yes, for value = 1. e.g. G=0 but still valid.

> > +	return 0;
> > +}
> > +
> > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > +{
> > +	u64 nr_pages = (granu_size * nr_granules) >>
> > VTD_PAGE_SHIFT; +
> > +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > for 2MB, etc.
> > +	 * IOMMU cache invalidate API passes granu_size in bytes,
> > and number of
> > +	 * granu size in contiguous memory.
> > +	 */
> > +	return order_base_2(nr_pages);
> > +}
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > +		struct device *dev, struct
> > iommu_cache_invalidate_info *inv_info)
> > +{
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct intel_iommu *iommu;
> > +	unsigned long flags;
> > +	int cache_type;
> > +	u8 bus, devfn;
> > +	u16 did, sid;
> > +	int ret = 0;
> > +	u64 size = 0;
> > +
> > +	if (!inv_info || !dmar_domain ||
> > +		inv_info->version !=
> > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > +		return -EINVAL;
> > +
> > +	if (!dev || !dev_is_pci(dev))
> > +		return -ENODEV;
> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +
> > +	spin_lock_irqsave(&device_domain_lock, flags);
> > +	spin_lock(&iommu->lock);
> > +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > devfn);
> > +	if (!info) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;  
> 
> -ENOTSUPP?
> 
I guess it can go either way in that the error is based on invalid
inputs.

> > +	}
> > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > +	sid = PCI_DEVID(bus, devfn);
> > +
> > +	/* Size is only valid in non-PASID selective invalidation
> > */
> > +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > +		size =
> > to_vtd_size(inv_info->addr_info.granule_size,
> > +
> > inv_info->addr_info.nb_granules); +
> > +	for_each_set_bit(cache_type, (unsigned long
> > *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > +		int granu = 0;
> > +		u64 pasid = 0;
> > +
> > +		ret = to_vtd_granularity(cache_type,
> > inv_info->granularity, &granu);
> > +		if (ret) {
> > +			pr_err("Invalid cache type and granu
> > combination %d/%d\n", cache_type,
> > +				inv_info->granularity);
> > +			break;
> > +		}
> > +
> > +		/* PASID is stored in different locations based on
> > granularity */
> > +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID
> > &&
> > +			inv_info->pasid_info.flags &
> > IOMMU_INV_PASID_FLAGS_PASID)
> > +			pasid = inv_info->pasid_info.pasid;
> > +		else if (inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR &&
> > +			inv_info->addr_info.flags &
> > IOMMU_INV_ADDR_FLAGS_PASID)
> > +			pasid = inv_info->addr_info.pasid;
> > +		else {
> > +			pr_err("Cannot find PASID for given cache
> > type and granularity\n");
> > +			break;
> > +		}
> > +
> > +		switch (BIT(cache_type)) {
> > +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > +			if ((inv_info->granularity !=
> > IOMMU_INV_GRANU_PASID) &&  
> 
> granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> why IOMMU_INV_GRANU_DOMAIN also needs size check.
> 
Good point! will fix.

> > +				size && (inv_info->addr_info.addr &
> > ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > +				pr_err("Address out of range,
> > 0x%llx, size order %llu\n",
> > +					inv_info->addr_info.addr,
> > size);
> > +				ret = -ERANGE;
> > +				goto out_unlock;
> > +			}
> > +
> > +			qi_flush_piotlb(iommu, did,
> > +					pasid,
> > +					mm_to_dma_pfn(inv_info-  
> > >addr_info.addr),  
> > +					(granu ==
> > QI_GRAN_NONG_PASID) ? - 1 : 1 << size,
> > +					inv_info->addr_info.flags &
> > IOMMU_INV_ADDR_FLAGS_LEAF);
> > +
> > +			/*
> > +			 * Always flush device IOTLB if ATS is
> > enabled since guest
> > +			 * vIOMMU exposes CM = 1, no device IOTLB
> > flush will be passed
> > +			 * down.
> > +			 */  
> 
> Does VT-d spec mention that no device IOTLB flush is required when
> CM=1?
> 
Not explicitly. Just following the guideline in CH6.1 for efficient
virtualization. Early on, we also had discussion on supporting virtio
where IOTLB flush is inclusive.
Let me rephrase the comment:
/*
 * Always flush device IOTLB if ATS is enabled. vIOMMU
 * in the guest may assume IOTLB flush is inclusive,
 * which is more efficient.
 */


> > +			if (info->ats_enabled) {
> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info-  
> > >pfsid,  
> > +						pasid,
> > info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			}
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > +			if (info->ats_enabled) {
> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info-  
> > >pfsid,  
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);  
> 
> I'm confused here. There are two granularities allowed for devtlb,
> but here you only handle one of them?
> 
granu is passed into the flush function, which can be 1 or 0.

> > +			} else
> > +				pr_warn("Passdown device IOTLB
> > flush w/o ATS!\n");
> > +
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_PASID:
> > +			qi_flush_pasid_cache(iommu, did, granu,
> > inv_info-  
> > >pasid_info.pasid);  
> > +  
> 
> as earlier comment, we shouldn't allow userspace or guest to
> invalidate PASID cache
> 
same explanation :)

> > +			break;
> > +		default:
> > +			dev_err(dev, "Unsupported IOMMU
> > invalidation type %d\n",
> > +				cache_type);
> > +			ret = -EINVAL;
> > +		}
> > +	}
> > +out_unlock:
> > +	spin_unlock(&iommu->lock);
> > +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > +
> > +	return ret;
> > +}
> > +#endif
> > +
> >  static int intel_iommu_map(struct iommu_domain *domain,
> >  			   unsigned long iova, phys_addr_t hpa,
> >  			   size_t size, int iommu_prot, gfp_t gfp)
> > @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >  #endif
> > --
> > 2.7.4  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-31  2:49       ` Tian, Kevin
@ 2020-03-31 20:58         ` Jacob Pan
  2020-04-01  6:29           ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 20:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, LKML, iommu,
	David Woodhouse, Jonathan Cameron

On Tue, 31 Mar 2020 02:49:21 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Auger Eric <eric.auger@redhat.com>
> > Sent: Sunday, March 29, 2020 11:34 PM
> > 
> > Hi,
> > 
> > On 3/28/20 11:01 AM, Tian, Kevin wrote:  
> > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > >> Sent: Saturday, March 21, 2020 7:28 AM
> > >>
> > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > >> and  
> > driver  
> > >> level. This patch adds Intel VT-d specific function to implement
> > >> iommu passdown invalidate API for shared virtual address.
> > >>
> > >> The use case is for supporting caching structure invalidation
> > >> of assigned SVM capable devices. Emulated IOMMU exposes queue  
>  [...]  
>  [...]  
> > to  
> > >> + * VT-d granularity. Invalidation is typically included in the
> > >> unmap  
> > operation  
> > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > >> devices  
> > guest  
> > >> + * owns the first level page tables. Invalidations of
> > >> translation caches in  
> > the  
>  [...]  
>  [...]  
>  [...]  
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_  
> > >> NR] = {
> > >> +	/*
> > >> +	 * PASID based IOTLB invalidation: PASID selective (per
> > >> PASID),
> > >> +	 * page selective (address granularity)
> > >> +	 */
> > >> +	{0, 1, 1},
> > >> +	/* PASID based dev TLBs, only support all PASIDs or
> > >> single PASID */
> > >> +	{1, 1, 0},  
> > >
> > > Is this combination correct? when single PASID is being
> > > specified, it is essentially a page-selective invalidation since
> > > you need provide Address and Size.  
> > Isn't it the same when G=1? Still the addr/size is used. Doesn't
> > it  
> 
> I thought addr/size is not used when G=1, but it might be wrong. I'm
> checking with our vt-d spec owner.
> 

> > correspond to IOMMU_INV_GRANU_ADDR with
> > IOMMU_INV_ADDR_FLAGS_PASID flag
> > unset?
> > 
> > so {0, 0, 1}?  
> 
I am not sure I got your logic. The three fields correspond to 
	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation *

For devTLB, we use domain as global since there is no domain. Then I
came up with {1, 1, 0}, which means we could have global and pasid
granu invalidation for PASID based devTLB.

If the caller also provide addr and S bit, the flush routine will put
that into QI descriptor. I know this is a little odd, but from the
granu translation p.o.v. VT-d spec has no G bit for page selective
invalidation.

> I have one more open:
> 
> How does userspace know which invalidation type/gran is supported?
> I didn't see such capability reporting in Yi's VFIO vSVA patch set.
> Do we want the user/kernel assume the same capability set if they are 
> architectural? However the kernel could also do some optimization
> e.g. hide devtlb invalidation capability given that the kernel
> already invalidate devtlb automatically when serving iotlb
> invalidation...
> 
In general, we are trending to use VFIO capability chain to expose iommu
capabilities.

But for architectural features such as type/granu, we have to assume
the same capability between host & guest. Granu and types are not
enumerated on the host IOMMU either.

For devTLB optimization, I agree we need to expose a capability to
the guest stating that implicit devtlb invalidation is supported.
Otherwise, if Linux guest runs on other OSes may not support implicit
devtlb invalidation.

Right Yi?

> Thanks
> Kevin
> 
> > 
> > Thanks
> > 
> > Eric
> >   
> > >  
> > >> +	/* PASID cache */  
> > >
> > > PASID cache is fully managed by the host. Guest PASID cache
> > > invalidation is interpreted by vIOMMU for bind and unbind
> > > operations. I don't think we should accept any PASID cache
> > > invalidation from userspace or guest. 
>  [...]  
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU  
>  [...]  
> > >
> > > btw do we really need both map and table here? Can't we just
> > > use one table with unsupported granularity marked as a special
> > > value?
> > >  
>  [...]  
> > >
> > > -ENOTSUPP?
> > >  
>  [...]  
> > >
> > > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> > >  
>  [...]  
> > >>> addr_info.addr),  
>  [...]  
>  [...]  
> > >> +			if (info->ats_enabled) {
> > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > >> sid, info-  
> > >>> pfsid,  
>  [...]  
> > >>> pfsid,  
> > >> +
> > >> inv_info->addr_info.pasid, info->ats_qdep,
> > >> +
> > >> inv_info->addr_info.addr, size,
> > >> +						granu);  
>  [...]  
>  [...]  
> > >>> pasid_info.pasid);  
> > >> +  
> > >
> > > as earlier comment, we shouldn't allow userspace or guest to
> > > invalidate PASID cache
> > >  
>  [...]  
> > >  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-31  3:34       ` Tian, Kevin
@ 2020-03-31 21:07         ` Jacob Pan
  2020-04-01  6:32           ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 21:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, LKML, iommu,
	David Woodhouse, Jonathan Cameron

On Tue, 31 Mar 2020 03:34:22 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Auger Eric <eric.auger@redhat.com>
> > Sent: Monday, March 30, 2020 12:05 AM
> > 
> > On 3/28/20 11:01 AM, Tian, Kevin wrote:  
> > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > >> Sent: Saturday, March 21, 2020 7:28 AM
> > >>
> > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > >> and  
> > driver  
> > >> level. This patch adds Intel VT-d specific function to implement
> > >> iommu passdown invalidate API for shared virtual address.
> > >>
> > >> The use case is for supporting caching structure invalidation
> > >> of assigned SVM capable devices. Emulated IOMMU exposes queue  
> > >
> > > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > > interface as well.
> > >  
> > >> invalidation capability and passes down all descriptors from the
> > >> guest to the physical IOMMU.
> > >>
> > >> The assumption is that guest to host device ID mapping should be
> > >> resolved prior to calling IOMMU driver. Based on the device
> > >> handle, host IOMMU driver can replace certain fields before
> > >> submit to the invalidation queue.
> > >>
> > >> ---
> > >> v7 review fixed in v10
> > >> ---
> > >>
> > >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > >> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > >> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > >> ---
> > >>  drivers/iommu/intel-iommu.c | 182
> > >> ++++++++++++++++++++++++++++++++++++++++++++
> > >>  1 file changed, 182 insertions(+)
> > >>
> > >> diff --git a/drivers/iommu/intel-iommu.c
> > >> b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > >> 100644 --- a/drivers/iommu/intel-iommu.c
> > >> +++ b/drivers/iommu/intel-iommu.c
> > >> @@ -5619,6 +5619,187 @@ static void
> > >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > >>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > >>  }
> > >>
> > >> +/*
> > >> + * 2D array for converting and sanitizing IOMMU generic TLB
> > >> granularity  
> > to  
> > >> + * VT-d granularity. Invalidation is typically included in the
> > >> unmap  
> > operation  
> > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > >> devices  
> > guest  
> > >> + * owns the first level page tables. Invalidations of
> > >> translation caches in  
> > the  
> > >> + * guest are trapped and passed down to the host.
> > >> + *
> > >> + * vIOMMU in the guest will only expose first level page
> > >> tables, therefore
> > >> + * we do not include IOTLB granularity for request without
> > >> PASID (second level).  
> > >
> > > I would revise above as "We do not support IOTLB granularity for
> > > request without PASID (second level), therefore any vIOMMU
> > > implementation that exposes the SVA capability to the guest
> > > should only expose the first level page tables, implying all
> > > invalidation requests from the guest will include a valid PASID"
> > >  
> > >> + *
> > >> + * For example, to find the VT-d granularity encoding for IOTLB
> > >> + * type and page selective granularity within PASID:
> > >> + * X: indexed by iommu cache type
> > >> + * Y: indexed by enum iommu_inv_granularity
> > >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > >> + *
> > >> + * Granu_map array indicates validity of the table. 1: valid,
> > >> 0: invalid
> > >> + *
> > >> + */
> > >> +const static int
> > >>  
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_  
> > >> NR] = {
> > >> +	/*
> > >> +	 * PASID based IOTLB invalidation: PASID selective (per
> > >> PASID),
> > >> +	 * page selective (address granularity)
> > >> +	 */
> > >> +	{0, 1, 1},
> > >> +	/* PASID based dev TLBs, only support all PASIDs or
> > >> single PASID */
> > >> +	{1, 1, 0},  
> > >
> > > Is this combination correct? when single PASID is being
> > > specified, it is essentially a page-selective invalidation since
> > > you need provide Address and Size.
> > >  
> > >> +	/* PASID cache */  
> > >
> > > PASID cache is fully managed by the host. Guest PASID cache
> > > invalidation is interpreted by vIOMMU for bind and unbind
> > > operations. I don't think we should accept any PASID cache
> > > invalidation from userspace or guest.  
> > I tend to agree here.  
> > >  
> > >> +	{1, 1, 0}
> > >> +};
> > >> +
> > >> +const static int
> > >>  
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU  
> > >> _NR] = {
> > >> +	/* PASID based IOTLB */
> > >> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > >> +	/* PASID based dev TLBs */
> > >> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > >> +	/* PASID cache */
> > >> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > >> +};
> > >> +
> > >> +static inline int to_vtd_granularity(int type, int granu, int
> > >> *vtd_granu) +{
> > >> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> > >> IOMMU_INV_GRANU_NR ||
> > >> +		!inv_type_granu_map[type][granu])
> > >> +		return -EINVAL;
> > >> +
> > >> +	*vtd_granu = inv_type_granu_table[type][granu];
> > >> +  
> > >
> > > btw do we really need both map and table here? Can't we just
> > > use one table with unsupported granularity marked as a special
> > > value?  
> > I asked the same question some time ago. If I remember correctly the
> > issue is while a granu can be supported in inv_type_granu_map, the
> > associated value in inv_type_granu_table can be 0. This typically
> > matches both values of G field (0 or 1) in the invalidation cmd. See
> > other comment below.  
> 
> I didn't fully understand it. Also what does a value '0' imply? also
> it's interesting to see below in [PATCH 07/11]:
> 
0 in 2D map array means invalid.
0 in granu table can be either valid or invalid
That is why we need the map table to tell the difference.
I will add following comments since this causes lots of confusion.

 * Granu_map array indicates validity of the table. 1: valid, 0: invalid
 * This is useful when the entry in the granu table has a value of 0,
 * which can be a valid or invalid value.


> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		1
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> +
> 
Sorry I didn't get the point? These are the valid vt-d granu values.
Per Spec CH 6.5.2.6

> > >  
> > >> +	return 0;
> > >> +}
> > >> +
> > >> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > >> +{
> > >> +	u64 nr_pages = (granu_size * nr_granules) >>
> > >> VTD_PAGE_SHIFT; +
> > >> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for
> > >> 4k, 9 for 2MB, etc.
> > >> +	 * IOMMU cache invalidate API passes granu_size in
> > >> bytes, and number of
> > >> +	 * granu size in contiguous memory.
> > >> +	 */
> > >> +	return order_base_2(nr_pages);
> > >> +}
> > >> +
> > >> +#ifdef CONFIG_INTEL_IOMMU_SVM
> > >> +static int intel_iommu_sva_invalidate(struct iommu_domain
> > >> *domain,
> > >> +		struct device *dev, struct
> > >> iommu_cache_invalidate_info *inv_info)
> > >> +{
> > >> +	struct dmar_domain *dmar_domain =
> > >> to_dmar_domain(domain);
> > >> +	struct device_domain_info *info;
> > >> +	struct intel_iommu *iommu;
> > >> +	unsigned long flags;
> > >> +	int cache_type;
> > >> +	u8 bus, devfn;
> > >> +	u16 did, sid;
> > >> +	int ret = 0;
> > >> +	u64 size = 0;
> > >> +
> > >> +	if (!inv_info || !dmar_domain ||
> > >> +		inv_info->version !=
> > >> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > >> +		return -EINVAL;
> > >> +
> > >> +	if (!dev || !dev_is_pci(dev))
> > >> +		return -ENODEV;
> > >> +
> > >> +	iommu = device_to_iommu(dev, &bus, &devfn);
> > >> +	if (!iommu)
> > >> +		return -ENODEV;
> > >> +
> > >> +	spin_lock_irqsave(&device_domain_lock, flags);
> > >> +	spin_lock(&iommu->lock);
> > >> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > >> devfn);
> > >> +	if (!info) {
> > >> +		ret = -EINVAL;
> > >> +		goto out_unlock;  
> > >
> > > -ENOTSUPP?
> > >  
> > >> +	}
> > >> +	did = dmar_domain->iommu_did[iommu->seq_id];
> > >> +	sid = PCI_DEVID(bus, devfn);
> > >> +
> > >> +	/* Size is only valid in non-PASID selective
> > >> invalidation */
> > >> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > >> +		size =
> > >> to_vtd_size(inv_info->addr_info.granule_size,
> > >> +
> > >> inv_info->addr_info.nb_granules); +
> > >> +	for_each_set_bit(cache_type, (unsigned long
> > >> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > >> +		int granu = 0;
> > >> +		u64 pasid = 0;
> > >> +
> > >> +		ret = to_vtd_granularity(cache_type,
> > >> inv_info->granularity, &granu);
> > >> +		if (ret) {
> > >> +			pr_err("Invalid cache type and granu
> > >> combination %d/%d\n", cache_type,
> > >> +				inv_info->granularity);
> > >> +			break;
> > >> +		}
> > >> +
> > >> +		/* PASID is stored in different locations based
> > >> on granularity */
> > >> +		if (inv_info->granularity ==
> > >> IOMMU_INV_GRANU_PASID &&
> > >> +			inv_info->pasid_info.flags &
> > >> IOMMU_INV_PASID_FLAGS_PASID)
> > >> +			pasid = inv_info->pasid_info.pasid;
> > >> +		else if (inv_info->granularity ==
> > >> IOMMU_INV_GRANU_ADDR &&
> > >> +			inv_info->addr_info.flags &
> > >> IOMMU_INV_ADDR_FLAGS_PASID)
> > >> +			pasid = inv_info->addr_info.pasid;
> > >> +		else {
> > >> +			pr_err("Cannot find PASID for given
> > >> cache type and granularity\n");
> > >> +			break;
> > >> +		}
> > >> +
> > >> +		switch (BIT(cache_type)) {
> > >> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > >> +			if ((inv_info->granularity !=
> > >> IOMMU_INV_GRANU_PASID) &&  
> > >
> > > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> > >  
> > >> +				size &&
> > >> (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1)))
> > >> {
> > >> +				pr_err("Address out of range,
> > >> 0x%llx, size order %llu\n",
> > >> +
> > >> inv_info->addr_info.addr, size);
> > >> +				ret = -ERANGE;
> > >> +				goto out_unlock;
> > >> +			}
> > >> +
> > >> +			qi_flush_piotlb(iommu, did,
> > >> +					pasid,
> > >> +
> > >> mm_to_dma_pfn(inv_info-  
> > >>> addr_info.addr),  
> > >> +					(granu ==
> > >> QI_GRAN_NONG_PASID) ? - 1 : 1 << size,
> > >> +
> > >> inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> > >> +
> > >> +			/*
> > >> +			 * Always flush device IOTLB if ATS is
> > >> enabled since guest
> > >> +			 * vIOMMU exposes CM = 1, no device
> > >> IOTLB flush will be passed
> > >> +			 * down.
> > >> +			 */  
> > >
> > > Does VT-d spec mention that no device IOTLB flush is required
> > > when CM=1? 
> > >> +			if (info->ats_enabled) {
> > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > >> sid, info-  
> > >>> pfsid,  
> > >> +						pasid,
> > >> info->ats_qdep,
> > >> +
> > >> inv_info->addr_info.addr, size,
> > >> +						granu);
> > >> +			}
> > >> +			break;
> > >> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > >> +			if (info->ats_enabled) {
> > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > >> sid, info-  
> > >>> pfsid,  
> > >> +
> > >> inv_info->addr_info.pasid, info->ats_qdep,
> > >> +
> > >> inv_info->addr_info.addr, size,
> > >> +						granu);  
> > >
> > > I'm confused here. There are two granularities allowed for
> > > devtlb, but here you only handle one of them?  
> > granu is the result of to_vtd_granularity() so it can take either
> > of the 2 values.  
> 
> yes, you're right. 
> 
> > 
> > Thanks
> > 
> > Eric  
> > >  
> > >> +			} else
> > >> +				pr_warn("Passdown device IOTLB
> > >> flush w/o ATS!\n");
> > >> +
> > >> +			break;
> > >> +		case IOMMU_CACHE_INV_TYPE_PASID:
> > >> +			qi_flush_pasid_cache(iommu, did, granu,
> > >> inv_info-  
> > >>> pasid_info.pasid);  
> > >> +  
> > >
> > > as earlier comment, we shouldn't allow userspace or guest to
> > > invalidate PASID cache
> > >  
> > >> +			break;
> > >> +		default:
> > >> +			dev_err(dev, "Unsupported IOMMU
> > >> invalidation type %d\n",
> > >> +				cache_type);
> > >> +			ret = -EINVAL;
> > >> +		}
> > >> +	}
> > >> +out_unlock:
> > >> +	spin_unlock(&iommu->lock);
> > >> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > >> +
> > >> +	return ret;
> > >> +}
> > >> +#endif
> > >> +
> > >>  static int intel_iommu_map(struct iommu_domain *domain,
> > >>  			   unsigned long iova, phys_addr_t hpa,
> > >>  			   size_t size, int iommu_prot, gfp_t
> > >> gfp) @@ -6204,6 +6385,7 @@ const struct iommu_ops
> > >> intel_iommu_ops = { .is_attach_deferred	=
> > >> intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > >> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > >> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> > >>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > >>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > >>  #endif
> > >> --
> > >> 2.7.4  
> > >  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-29 16:05   ` Auger Eric
@ 2020-03-31 22:28     ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 22:28 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Sun, 29 Mar 2020 18:05:47 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> > 
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> > 
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> > 
> > ---
> > v7 review fixed in v10
> > ---
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 182
> > ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 182
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5619,6 +5619,187 @@ static void
> > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >  
> > +/*
> > + * 2D array for converting and sanitizing IOMMU generic TLB
> > granularity to
> > + * VT-d granularity. Invalidation is typically included in the
> > unmap operation
> > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > guest
> > + * owns the first level page tables. Invalidations of translation
> > caches in the
> > + * guest are trapped and passed down to the host.
> > + *
> > + * vIOMMU in the guest will only expose first level page tables,
> > therefore
> > + * we do not include IOTLB granularity for request without PASID
> > (second level).
> > + *
> > + * For example, to find the VT-d granularity encoding for IOTLB
> > + * type and page selective granularity within PASID:
> > + * X: indexed by iommu cache type
> > + * Y: indexed by enum iommu_inv_granularity
> > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > + *
> > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > invalid
> > + *
> > + */
> > +const static int
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> > +	/*
> > +	 * PASID based IOTLB invalidation: PASID selective (per
> > PASID),
> > +	 * page selective (address granularity)
> > +	 */
> > +	{0, 1, 1},
> > +	/* PASID based dev TLBs, only support all PASIDs or single
> > PASID */
> > +	{1, 1, 0},
> > +	/* PASID cache */
> > +	{1, 1, 0}
> > +};
> > +
> > +const static int
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] =
> > {
> > +	/* PASID based IOTLB */
> > +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > +	/* PASID based dev TLBs */
> > +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > +	/* PASID cache */
> > +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > +};
> > +
> > +static inline int to_vtd_granularity(int type, int granu, int
> > *vtd_granu) +{
> > +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> > IOMMU_INV_GRANU_NR ||
> > +		!inv_type_granu_map[type][granu])
> > +		return -EINVAL;
> > +
> > +	*vtd_granu = inv_type_granu_table[type][granu];
> > +
> > +	return 0;
> > +}
> > +
> > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > +{
> > +	u64 nr_pages = (granu_size * nr_granules) >>
> > VTD_PAGE_SHIFT; +
> > +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > for 2MB, etc.
> > +	 * IOMMU cache invalidate API passes granu_size in bytes,
> > and number of
> > +	 * granu size in contiguous memory.
> > +	 */
> > +	return order_base_2(nr_pages);
> > +}
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > +		struct device *dev, struct
> > iommu_cache_invalidate_info *inv_info) +{
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct intel_iommu *iommu;
> > +	unsigned long flags;
> > +	int cache_type;
> > +	u8 bus, devfn;
> > +	u16 did, sid;
> > +	int ret = 0;
> > +	u64 size = 0;
> > +
> > +	if (!inv_info || !dmar_domain ||
> > +		inv_info->version !=
> > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > +		return -EINVAL;
> > +
> > +	if (!dev || !dev_is_pci(dev))
> > +		return -ENODEV;
> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +
> > +	spin_lock_irqsave(&device_domain_lock, flags);
> > +	spin_lock(&iommu->lock);
> > +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > devfn);
> > +	if (!info) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > +	sid = PCI_DEVID(bus, devfn);
> > +
> > +	/* Size is only valid in non-PASID selective invalidation
> > */
> > +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > +		size =
> > to_vtd_size(inv_info->addr_info.granule_size,
> > +
> > inv_info->addr_info.nb_granules); +
> > +	for_each_set_bit(cache_type, (unsigned long
> > *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > +		int granu = 0;
> > +		u64 pasid = 0;
> > +
> > +		ret = to_vtd_granularity(cache_type,
> > inv_info->granularity, &granu);
> > +		if (ret) {
> > +			pr_err("Invalid cache type and granu
> > combination %d/%d\n", cache_type,
> > +				inv_info->granularity);
> > +			break;
> > +		}
> > +
> > +		/* PASID is stored in different locations based on
> > granularity */
> > +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID
> > &&
> > +			inv_info->pasid_info.flags &
> > IOMMU_INV_PASID_FLAGS_PASID)
> > +			pasid = inv_info->pasid_info.pasid;
> > +		else if (inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR &&
> > +			inv_info->addr_info.flags &
> > IOMMU_INV_ADDR_FLAGS_PASID)
> > +			pasid = inv_info->addr_info.pasid;
> > +		else {
> > +			pr_err("Cannot find PASID for given cache
> > type and granularity\n");  
> I don't get this error msg. In case of domain-selective invalidation,
> PASID is not used so if I am not wrong you will end up here while
> there is no issue.
Right, I will remove the else.

> > +			break;
> > +		}
> > +
> > +		switch (BIT(cache_type)) {
> > +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > +			if ((inv_info->granularity !=
> > IOMMU_INV_GRANU_PASID) &&
> > +				size && (inv_info->addr_info.addr
> > & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > +				pr_err("Address out of range,
> > 0x%llx, size order %llu\n",
> > +					inv_info->addr_info.addr,
> > size);
> > +				ret = -ERANGE;
> > +				goto out_unlock;
> > +			}
> > +
> > +			qi_flush_piotlb(iommu, did,
> > +					pasid,
> > +
> > mm_to_dma_pfn(inv_info->addr_info.addr),
> > +					(granu ==
> > QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> > +					inv_info->addr_info.flags
> > & IOMMU_INV_ADDR_FLAGS_LEAF); +
> > +			/*
> > +			 * Always flush device IOTLB if ATS is
> > enabled since guest
> > +			 * vIOMMU exposes CM = 1, no device IOTLB
> > flush will be passed
> > +			 * down.
> > +			 */
> > +			if (info->ats_enabled) {  
> nit {} not requested
Will fix. same for the remaining.

Thanks!

Jacob

> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > +						pasid,
> > info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			}
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > +			if (info->ats_enabled) {  
> nit {} not requested
> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			} else
> > +				pr_warn("Passdown device IOTLB
> > flush w/o ATS!\n");
> > +  
> nit: extra line
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_PASID:
> > +			qi_flush_pasid_cache(iommu, did, granu,
> > inv_info->pasid_info.pasid);
> > +  
> nit: extra line
> > +			break;
> > +		default:
> > +			dev_err(dev, "Unsupported IOMMU
> > invalidation type %d\n",
> > +				cache_type);
> > +			ret = -EINVAL;
> > +		}
> > +	}
> > +out_unlock:
> > +	spin_unlock(&iommu->lock);
> > +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > +
> > +	return ret;
> > +}
> > +#endif
> > +
> >  static int intel_iommu_map(struct iommu_domain *domain,
> >  			   unsigned long iova, phys_addr_t hpa,
> >  			   size_t size, int iommu_prot, gfp_t gfp)
> > @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >  #endif
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register
  2020-03-28 10:04   ` Tian, Kevin
@ 2020-03-31 22:33     ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 22:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Sat, 28 Mar 2020 10:04:38 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > Virtual command registers are used in the guest only, to prevent
> > vmexit cost, we cache the capability and store it during
> > initialization.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> > 
> > ---
> > v7 Reviewed by Eric & Baolu
> > ---
> > ---
> >  drivers/iommu/dmar.c        | 1 +
> >  include/linux/intel-iommu.h | 5 +++++
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index 4d6b7b5b37ee..3b36491c8bbb 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu,
> > u64 phys_addr)
> >  		warn_invalid_dmar(phys_addr, " returns all ones");
> >  		goto unmap;
> >  	}
> > +	iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> > 
> >  	/* the registers might be more than one page */
> >  	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 43539713b3b3..ccbf164fb711
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -194,6 +194,9 @@
> >  #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
> >  #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping
> > Control */
> > 
> > +/* Virtual command interface capabilities */  
> 
> capabilities -> capability
Will do, I was thinking the future :)

Thanks,

Jacob
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> > +#define vccap_pasid(v)		((v & DMA_VCS_PAS)) /* PASID
> > allocation */
> > +
> >  /* IOTLB_REG */
> >  #define DMA_TLB_FLUSH_GRANU_OFFSET  60
> >  #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
> > @@ -287,6 +290,7 @@
> > 
> >  /* PRS_REG */
> >  #define DMA_PRS_PPR	((u32)1)
> > +#define DMA_VCS_PAS	((u64)1)
> > 
> >  #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)
> > 	\
> >  do
> > {
> > \ @@ -537,6 +541,7 @@ struct intel_iommu { u64
> > reg_size; /* size of hw register set */ u64		cap;
> >  	u64		ecap;
> > +	u64		vccap;
> >  	u32		gcmd; /* Holds TE, EAFL. Don't need
> > SRTP, SFL, WBF */
> >  	raw_spinlock_t	register_lock; /* protect register
> > handling */ int		seq_id;	/* sequence id of the
> > iommu */ --
> > 2.7.4  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation
  2020-03-28 10:08   ` Tian, Kevin
@ 2020-03-31 22:37     ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-03-31 22:37 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Sat, 28 Mar 2020 10:08:52 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> > 
> > Enabling IOMMU in a guest requires communication with the host
> > driver for certain aspects. Use of PASID ID to enable Shared Virtual
> > Addressing (SVA) requires managing PASID's in the host. VT-d 3.0
> > spec provides a Virtual Command Register (VCMD) to facilitate this.
> > Writes to this register in the guest are trapped by QEMU which
> > proxies the call to the host driver.  
> 
> Qemu -> vIOMMU
> 
Sounds good. Thanks!
> > 
> > This virtual command interface consists of a capability register,
> > a virtual command register, and a virtual response register. Refer
> > to section 10.4.42, 10.4.43, 10.4.44 for more information.
> > 
> > This patch adds the enlightened PASID allocation/free interfaces
> > via the virtual command interface.
> > 
> > Cc: Ashok Raj <ashok.raj@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > ---
> >  drivers/iommu/intel-pasid.c | 57
> > +++++++++++++++++++++++++++++++++++++++++++++
> >  drivers/iommu/intel-pasid.h | 13 ++++++++++-
> >  include/linux/intel-iommu.h |  1 +
> >  3 files changed, 70 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index 9f6d07410722..e87ad67aad36
> > 100644 --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -27,6 +27,63 @@
> >  static DEFINE_SPINLOCK(pasid_lock);
> >  u32 intel_pasid_max_id = PASID_MAX;
> > 
> > +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
> > *pasid) +{
> > +	unsigned long flags;
> > +	u8 status_code;
> > +	int ret = 0;
> > +	u64 res;
> > +
> > +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> > +	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> > VCMD_CMD_ALLOC);
> > +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> > +		      !(res & VCMD_VRSP_IP), res);
> > +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> > +
> > +	status_code = VCMD_VRSP_SC(res);
> > +	switch (status_code) {
> > +	case VCMD_VRSP_SC_SUCCESS:
> > +		*pasid = VCMD_VRSP_RESULT_PASID(res);
> > +		break;
> > +	case VCMD_VRSP_SC_NO_PASID_AVAIL:
> > +		pr_info("IOMMU: %s: No PASID available\n",
> > iommu->name);
> > +		ret = -ENOSPC;
> > +		break;
> > +	default:
> > +		ret = -ENODEV;
> > +		pr_warn("IOMMU: %s: Unexpected error code %d\n",
> > +			iommu->name, status_code);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> > +{
> > +	unsigned long flags;
> > +	u8 status_code;
> > +	u64 res;
> > +
> > +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> > +	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> > +		    VCMD_CMD_OPERAND(pasid) | VCMD_CMD_FREE);
> > +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> > +		      !(res & VCMD_VRSP_IP), res);
> > +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> > +
> > +	status_code = VCMD_VRSP_SC(res);
> > +	switch (status_code) {
> > +	case VCMD_VRSP_SC_SUCCESS:
> > +		break;
> > +	case VCMD_VRSP_SC_INVALID_PASID:
> > +		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> > +		break;
> > +	default:
> > +		pr_warn("IOMMU: %s: Unexpected error code %d\n",
> > +			iommu->name, status_code);
> > +	}
> > +}
> > +
> >  /*
> >   * Per device pasid table management:
> >   */
> > diff --git a/drivers/iommu/intel-pasid.h
> > b/drivers/iommu/intel-pasid.h index 698015ee3f04..cd3d63f3e936
> > 100644 --- a/drivers/iommu/intel-pasid.h
> > +++ b/drivers/iommu/intel-pasid.h
> > @@ -23,6 +23,16 @@
> >  #define is_pasid_enabled(entry)		(((entry)->lo >> 3)
> > & 0x1) #define get_pasid_dir_size(entry)	(1 <<
> > ((((entry)->lo >> 9) & 0x7) + 7))
> > 
> > +/* Virtual command interface for enlightened pasid management. */
> > +#define VCMD_CMD_ALLOC			0x1
> > +#define VCMD_CMD_FREE			0x2
> > +#define VCMD_VRSP_IP			0x1
> > +#define VCMD_VRSP_SC(e)			(((e) >> 1) & 0x3)
> > +#define VCMD_VRSP_SC_SUCCESS		0
> > +#define VCMD_VRSP_SC_NO_PASID_AVAIL	1
> > +#define VCMD_VRSP_SC_INVALID_PASID	1
> > +#define VCMD_VRSP_RESULT_PASID(e)	(((e) >> 8) & 0xfffff)
> > +#define VCMD_CMD_OPERAND(e)		((e) << 8)
> >  /*
> >   * Domain ID reserved for pasid entries programmed for first-level
> >   * only and pass-through transfer modes.
> > @@ -113,5 +123,6 @@ int intel_pasid_setup_nested(struct intel_iommu
> > *iommu,
> >  			int addr_width);
> >  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> >  				 struct device *dev, int pasid);
> > -
> > +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
> > *pasid); +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned
> > int pasid); #endif /* __INTEL_PASID_H */
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index ccbf164fb711..9cbf5357138b
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -169,6 +169,7 @@
> >  #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
> >  #define ecap_flts(e)		(((e) >> 47) & 0x1)
> >  #define ecap_slts(e)		(((e) >> 46) & 0x1)
> > +#define ecap_vcs(e)		(((e) >> 44) & 0x1)
> >  #define ecap_smts(e)		(((e) >> 43) & 0x1)
> >  #define ecap_dit(e)		((e >> 41) & 0x1)
> >  #define ecap_pasid(e)		((e >> 40) & 0x1)
> > --
> > 2.7.4  
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-31 18:13     ` Jacob Pan
@ 2020-04-01  6:24       ` Tian, Kevin
  2020-04-01  6:57         ` Liu, Yi L
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-04-01  6:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Wednesday, April 1, 2020 2:14 AM
> 
> On Sat, 28 Mar 2020 10:01:42 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > vIOMMU, we need to provide invalidation support at IOMMU API and
> > > driver level. This patch adds Intel VT-d specific function to
> > > implement iommu passdown invalidate API for shared virtual address.
> > >
> > > The use case is for supporting caching structure invalidation
> > > of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> True, but it does not invalidate this statement about emulated IOMMU. I
> will add another statement saying "the same interface can be used for
> virtio-IOMMU as well". OK?

sure

> 
> > > invalidation capability and passes down all descriptors from the
> > > guest to the physical IOMMU.
> > >
> > > The assumption is that guest to host device ID mapping should be
> > > resolved prior to calling IOMMU driver. Based on the device handle,
> > > host IOMMU driver can replace certain fields before submit to the
> > > invalidation queue.
> > >
> > > ---
> > > v7 review fixed in v10
> > > ---
> > >
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > ---
> > >  drivers/iommu/intel-iommu.c | 182
> > > ++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 182 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -5619,6 +5619,187 @@ static void
> > > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > >  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > >  }
> > >
> > > +/*
> > > + * 2D array for converting and sanitizing IOMMU generic TLB
> > > granularity to
> > > + * VT-d granularity. Invalidation is typically included in the
> > > unmap operation
> > > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > > guest
> > > + * owns the first level page tables. Invalidations of translation
> > > caches in the
> > > + * guest are trapped and passed down to the host.
> > > + *
> > > + * vIOMMU in the guest will only expose first level page tables,
> > > therefore
> > > + * we do not include IOTLB granularity for request without PASID
> > > (second level).
> >
> > I would revise above as "We do not support IOTLB granularity for
> > request without PASID (second level), therefore any vIOMMU
> > implementation that exposes the SVA capability to the guest should
> > only expose the first level page tables, implying all invalidation
> > requests from the guest will include a valid PASID"
> >
> Sounds good.
> 
> > > + *
> > > + * For example, to find the VT-d granularity encoding for IOTLB
> > > + * type and page selective granularity within PASID:
> > > + * X: indexed by iommu cache type
> > > + * Y: indexed by enum iommu_inv_granularity
> > > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > + *
> > > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > > invalid
> > > + *
> > > + */
> > > +const static int
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > NR] = {
> > > +	/*
> > > +	 * PASID based IOTLB invalidation: PASID selective (per
> > > PASID),
> > > +	 * page selective (address granularity)
> > > +	 */
> > > +	{0, 1, 1},
> > > +	/* PASID based dev TLBs, only support all PASIDs or single
> > > PASID */
> > > +	{1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it
> > is essentially a page-selective invalidation since you need provide
> > Address and Size.
> >
> This is for translation between generic UAPI granu to VT-d granu, it
> has nothing to do with address and size.

Generic UAPI defines three granularities: domain, pasid and addr.
from the definition domain applies all entries related to did, pasid
applies to all entries related to pasid, while addr is specific for a
range.

from what we just confirmed internally with VT-d spec owner, our
PASID based dev TLB invalidation always requires addr and size, 
while current uAPI doesn't support multiple PASIDs based range
invaliation. It sounds to me that you want to use domain to replace 
multiple PASIDs case (G=1), but it then changes the meaning of 
the domain granularity and easily lead to confusion.

I feel Eric's proposal makes more sense. Here we'd better use {0, 0, 1}
to indicate only addr range invalidation is allowed, matching the
spec definition. We may use a special flag in iommu_inv_addr_info
to indicate G=1 case, if necessary.

> e.g.
> If user passes IOMMU_INV_GRANU_PASID for the single PASID case as you
> mentioned, this map table shows it is valid.
> 
> Then the lookup result will get VT-d granu:
> QI_DEV_IOTLB_GRAN_PASID_SEL, which means G=0.
> 
> 
> > > +	/* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache
> > invalidation is interpreted by vIOMMU for bind and unbind operations.
> > I don't think we should accept any PASID cache invalidation from
> > userspace or guest.
> >
> 
> True for vIOMMU, this is here for completeness. Can be used by virtio
> IOMMU, since PC flush is inclusive (IOTLB, devTLB), it is more
> efficient.

I think it is not correct in concept. We should not allow the userspace or
guest to request an operation which is beyond its privilege (just because
doing so may bring some performance benefit). You can always introduce
new cmd for such purpose.

> 
> > > +	{1, 1, 0}
> > > +};
> > > +
> > > +const static int
> > >
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> > > _NR] = {
> > > +	/* PASID based IOTLB */
> > > +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > > +	/* PASID based dev TLBs */
> > > +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > > +	/* PASID cache */
> > > +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > > +};
> > > +
> > > +static inline int to_vtd_granularity(int type, int granu, int
> > > *vtd_granu) +{
> > > +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> > > IOMMU_INV_GRANU_NR ||
> > > +		!inv_type_granu_map[type][granu])
> > > +		return -EINVAL;
> > > +
> > > +	*vtd_granu = inv_type_granu_table[type][granu];
> > > +
> >
> > btw do we really need both map and table here? Can't we just
> > use one table with unsupported granularity marked as a special
> > value?
> >
> Yes, for value = 1. e.g. G=0 but still valid.
> 
> > > +	return 0;
> > > +}
> > > +
> > > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > > +{
> > > +	u64 nr_pages = (granu_size * nr_granules) >>
> > > VTD_PAGE_SHIFT; +
> > > +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > > for 2MB, etc.
> > > +	 * IOMMU cache invalidate API passes granu_size in bytes,
> > > and number of
> > > +	 * granu size in contiguous memory.
> > > +	 */
> > > +	return order_base_2(nr_pages);
> > > +}
> > > +
> > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > > +		struct device *dev, struct
> > > iommu_cache_invalidate_info *inv_info)
> > > +{
> > > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > +	struct device_domain_info *info;
> > > +	struct intel_iommu *iommu;
> > > +	unsigned long flags;
> > > +	int cache_type;
> > > +	u8 bus, devfn;
> > > +	u16 did, sid;
> > > +	int ret = 0;
> > > +	u64 size = 0;
> > > +
> > > +	if (!inv_info || !dmar_domain ||
> > > +		inv_info->version !=
> > > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > > +		return -EINVAL;
> > > +
> > > +	if (!dev || !dev_is_pci(dev))
> > > +		return -ENODEV;
> > > +
> > > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > > +	if (!iommu)
> > > +		return -ENODEV;
> > > +
> > > +	spin_lock_irqsave(&device_domain_lock, flags);
> > > +	spin_lock(&iommu->lock);
> > > +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > > devfn);
> > > +	if (!info) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> >
> > -ENOTSUPP?
> >
> I guess it can go either way in that the error is based on invalid
> inputs.
> 
> > > +	}
> > > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > > +	sid = PCI_DEVID(bus, devfn);
> > > +
> > > +	/* Size is only valid in non-PASID selective invalidation
> > > */
> > > +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > > +		size =
> > > to_vtd_size(inv_info->addr_info.granule_size,
> > > +
> > > inv_info->addr_info.nb_granules); +
> > > +	for_each_set_bit(cache_type, (unsigned long
> > > *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > > +		int granu = 0;
> > > +		u64 pasid = 0;
> > > +
> > > +		ret = to_vtd_granularity(cache_type,
> > > inv_info->granularity, &granu);
> > > +		if (ret) {
> > > +			pr_err("Invalid cache type and granu
> > > combination %d/%d\n", cache_type,
> > > +				inv_info->granularity);
> > > +			break;
> > > +		}
> > > +
> > > +		/* PASID is stored in different locations based on
> > > granularity */
> > > +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID
> > > &&
> > > +			inv_info->pasid_info.flags &
> > > IOMMU_INV_PASID_FLAGS_PASID)
> > > +			pasid = inv_info->pasid_info.pasid;
> > > +		else if (inv_info->granularity ==
> > > IOMMU_INV_GRANU_ADDR &&
> > > +			inv_info->addr_info.flags &
> > > IOMMU_INV_ADDR_FLAGS_PASID)
> > > +			pasid = inv_info->addr_info.pasid;
> > > +		else {
> > > +			pr_err("Cannot find PASID for given cache
> > > type and granularity\n");
> > > +			break;
> > > +		}
> > > +
> > > +		switch (BIT(cache_type)) {
> > > +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > > +			if ((inv_info->granularity !=
> > > IOMMU_INV_GRANU_PASID) &&
> >
> > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> >
> Good point! will fix.
> 
> > > +				size && (inv_info->addr_info.addr &
> > > ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > > +				pr_err("Address out of range,
> > > 0x%llx, size order %llu\n",
> > > +					inv_info->addr_info.addr,
> > > size);
> > > +				ret = -ERANGE;
> > > +				goto out_unlock;
> > > +			}
> > > +
> > > +			qi_flush_piotlb(iommu, did,
> > > +					pasid,
> > > +					mm_to_dma_pfn(inv_info-
> > > >addr_info.addr),
> > > +					(granu ==
> > > QI_GRAN_NONG_PASID) ? - 1 : 1 << size,
> > > +					inv_info->addr_info.flags &
> > > IOMMU_INV_ADDR_FLAGS_LEAF);
> > > +
> > > +			/*
> > > +			 * Always flush device IOTLB if ATS is
> > > enabled since guest
> > > +			 * vIOMMU exposes CM = 1, no device IOTLB
> > > flush will be passed
> > > +			 * down.
> > > +			 */
> >
> > Does VT-d spec mention that no device IOTLB flush is required when
> > CM=1?
> >
> Not explicitly. Just following the guideline in CH6.1 for efficient
> virtualization. Early on, we also had discussion on supporting virtio
> where IOTLB flush is inclusive.
> Let me rephrase the comment:
> /*
>  * Always flush device IOTLB if ATS is enabled. vIOMMU
>  * in the guest may assume IOTLB flush is inclusive,
>  * which is more efficient.
>  */

this looks better.

> 
> 
> > > +			if (info->ats_enabled) {
> > > +				qi_flush_dev_iotlb_pasid(iommu,
> > > sid, info-
> > > >pfsid,
> > > +						pasid,
> > > info->ats_qdep,
> > > +
> > > inv_info->addr_info.addr, size,
> > > +						granu);
> > > +			}
> > > +			break;
> > > +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > > +			if (info->ats_enabled) {
> > > +				qi_flush_dev_iotlb_pasid(iommu,
> > > sid, info-
> > > >pfsid,
> > > +
> > > inv_info->addr_info.pasid, info->ats_qdep,
> > > +
> > > inv_info->addr_info.addr, size,
> > > +						granu);
> >
> > I'm confused here. There are two granularities allowed for devtlb,
> > but here you only handle one of them?
> >
> granu is passed into the flush function, which can be 1 or 0.
> 
> > > +			} else
> > > +				pr_warn("Passdown device IOTLB
> > > flush w/o ATS!\n");
> > > +
> > > +			break;
> > > +		case IOMMU_CACHE_INV_TYPE_PASID:
> > > +			qi_flush_pasid_cache(iommu, did, granu,
> > > inv_info-
> > > >pasid_info.pasid);
> > > +
> >
> > as earlier comment, we shouldn't allow userspace or guest to
> > invalidate PASID cache
> >
> same explanation :)
> 
> > > +			break;
> > > +		default:
> > > +			dev_err(dev, "Unsupported IOMMU
> > > invalidation type %d\n",
> > > +				cache_type);
> > > +			ret = -EINVAL;
> > > +		}
> > > +	}
> > > +out_unlock:
> > > +	spin_unlock(&iommu->lock);
> > > +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > > +
> > > +	return ret;
> > > +}
> > > +#endif
> > > +
> > >  static int intel_iommu_map(struct iommu_domain *domain,
> > >  			   unsigned long iova, phys_addr_t hpa,
> > >  			   size_t size, int iommu_prot, gfp_t gfp)
> > > @@ -6204,6 +6385,7 @@ const struct iommu_ops intel_iommu_ops = {
> > >  	.is_attach_deferred	=
> > > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > > INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > > +	.cache_invalidate	= intel_iommu_sva_invalidate,
> > >  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > >  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > >  #endif
> > > --
> > > 2.7.4
> >
> 
> [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-31 20:58         ` Jacob Pan
@ 2020-04-01  6:29           ` Tian, Kevin
  2020-04-01  7:13             ` Liu, Yi L
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-04-01  6:29 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, LKML, iommu,
	David Woodhouse, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Wednesday, April 1, 2020 4:58 AM
> 
> On Tue, 31 Mar 2020 02:49:21 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Auger Eric <eric.auger@redhat.com>
> > > Sent: Sunday, March 29, 2020 11:34 PM
> > >
> > > Hi,
> > >
> > > On 3/28/20 11:01 AM, Tian, Kevin wrote:
> > > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > >> Sent: Saturday, March 21, 2020 7:28 AM
> > > >>
> > > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > > >> and
> > > driver
> > > >> level. This patch adds Intel VT-d specific function to implement
> > > >> iommu passdown invalidate API for shared virtual address.
> > > >>
> > > >> The use case is for supporting caching structure invalidation
> > > >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >  [...]
> >  [...]
> > > to
> > > >> + * VT-d granularity. Invalidation is typically included in the
> > > >> unmap
> > > operation
> > > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > > >> devices
> > > guest
> > > >> + * owns the first level page tables. Invalidations of
> > > >> translation caches in
> > > the
> >  [...]
> >  [...]
> >  [...]
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > >> NR] = {
> > > >> +	/*
> > > >> +	 * PASID based IOTLB invalidation: PASID selective (per
> > > >> PASID),
> > > >> +	 * page selective (address granularity)
> > > >> +	 */
> > > >> +	{0, 1, 1},
> > > >> +	/* PASID based dev TLBs, only support all PASIDs or
> > > >> single PASID */
> > > >> +	{1, 1, 0},
> > > >
> > > > Is this combination correct? when single PASID is being
> > > > specified, it is essentially a page-selective invalidation since
> > > > you need provide Address and Size.
> > > Isn't it the same when G=1? Still the addr/size is used. Doesn't
> > > it
> >
> > I thought addr/size is not used when G=1, but it might be wrong. I'm
> > checking with our vt-d spec owner.
> >
> 
> > > correspond to IOMMU_INV_GRANU_ADDR with
> > > IOMMU_INV_ADDR_FLAGS_PASID flag
> > > unset?
> > >
> > > so {0, 0, 1}?
> >
> I am not sure I got your logic. The three fields correspond to
> 	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective
> invalidation */
> 	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
> 	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation *
> 
> For devTLB, we use domain as global since there is no domain. Then I
> came up with {1, 1, 0}, which means we could have global and pasid
> granu invalidation for PASID based devTLB.
> 
> If the caller also provide addr and S bit, the flush routine will put

"also" -> "must", because vt-d requires addr/size must be provided
in devtlb descriptor, that is why Eric suggests {0, 0, 1}.

> that into QI descriptor. I know this is a little odd, but from the
> granu translation p.o.v. VT-d spec has no G bit for page selective
> invalidation.

We don't need such odd way if can do it properly. 😊

> 
> > I have one more open:
> >
> > How does userspace know which invalidation type/gran is supported?
> > I didn't see such capability reporting in Yi's VFIO vSVA patch set.
> > Do we want the user/kernel assume the same capability set if they are
> > architectural? However the kernel could also do some optimization
> > e.g. hide devtlb invalidation capability given that the kernel
> > already invalidate devtlb automatically when serving iotlb
> > invalidation...
> >
> In general, we are trending to use VFIO capability chain to expose iommu
> capabilities.
> 
> But for architectural features such as type/granu, we have to assume
> the same capability between host & guest. Granu and types are not
> enumerated on the host IOMMU either.
> 
> For devTLB optimization, I agree we need to expose a capability to
> the guest stating that implicit devtlb invalidation is supported.
> Otherwise, if Linux guest runs on other OSes may not support implicit
> devtlb invalidation.
> 
> Right Yi?

Thanks for explanation. So we are assumed to support all operations
defined in spec, so no need to expose them one-by-one. For
optimization, I'm fine to do it later. 

> 
> > Thanks
> > Kevin
> >
> > >
> > > Thanks
> > >
> > > Eric
> > >
> > > >
> > > >> +	/* PASID cache */
> > > >
> > > > PASID cache is fully managed by the host. Guest PASID cache
> > > > invalidation is interpreted by vIOMMU for bind and unbind
> > > > operations. I don't think we should accept any PASID cache
> > > > invalidation from userspace or guest.
> >  [...]
> > >
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >  [...]
> > > >
> > > > btw do we really need both map and table here? Can't we just
> > > > use one table with unsupported granularity marked as a special
> > > > value?
> > > >
> >  [...]
> > > >
> > > > -ENOTSUPP?
> > > >
> >  [...]
> > > >
> > > > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > > > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> > > >
> >  [...]
> > > >>> addr_info.addr),
> >  [...]
> >  [...]
> > > >> +			if (info->ats_enabled) {
> > > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > > >> sid, info-
> > > >>> pfsid,
> >  [...]
> > > >>> pfsid,
> > > >> +
> > > >> inv_info->addr_info.pasid, info->ats_qdep,
> > > >> +
> > > >> inv_info->addr_info.addr, size,
> > > >> +						granu);
> >  [...]
> >  [...]
> > > >>> pasid_info.pasid);
> > > >> +
> > > >
> > > > as earlier comment, we shouldn't allow userspace or guest to
> > > > invalidate PASID cache
> > > >
> >  [...]
> > > >
> >
> 
> [Jacob Pan]

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-03-31 21:07         ` Jacob Pan
@ 2020-04-01  6:32           ` Tian, Kevin
  0 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2020-04-01  6:32 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, LKML, iommu,
	David Woodhouse, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Wednesday, April 1, 2020 5:08 AM
> 
> On Tue, 31 Mar 2020 03:34:22 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Auger Eric <eric.auger@redhat.com>
> > > Sent: Monday, March 30, 2020 12:05 AM
> > >
> > > On 3/28/20 11:01 AM, Tian, Kevin wrote:
> > > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > >> Sent: Saturday, March 21, 2020 7:28 AM
> > > >>
> > > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > > >> and
> > > driver
> > > >> level. This patch adds Intel VT-d specific function to implement
> > > >> iommu passdown invalidate API for shared virtual address.
> > > >>
> > > >> The use case is for supporting caching structure invalidation
> > > >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> > > >
> > > > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > > > interface as well.
> > > >
> > > >> invalidation capability and passes down all descriptors from the
> > > >> guest to the physical IOMMU.
> > > >>
> > > >> The assumption is that guest to host device ID mapping should be
> > > >> resolved prior to calling IOMMU driver. Based on the device
> > > >> handle, host IOMMU driver can replace certain fields before
> > > >> submit to the invalidation queue.
> > > >>
> > > >> ---
> > > >> v7 review fixed in v10
> > > >> ---
> > > >>
> > > >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > >> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > >> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > >> ---
> > > >>  drivers/iommu/intel-iommu.c | 182
> > > >> ++++++++++++++++++++++++++++++++++++++++++++
> > > >>  1 file changed, 182 insertions(+)
> > > >>
> > > >> diff --git a/drivers/iommu/intel-iommu.c
> > > >> b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > >> 100644 --- a/drivers/iommu/intel-iommu.c
> > > >> +++ b/drivers/iommu/intel-iommu.c
> > > >> @@ -5619,6 +5619,187 @@ static void
> > > >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > > >>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > > >>  }
> > > >>
> > > >> +/*
> > > >> + * 2D array for converting and sanitizing IOMMU generic TLB
> > > >> granularity
> > > to
> > > >> + * VT-d granularity. Invalidation is typically included in the
> > > >> unmap
> > > operation
> > > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > > >> devices
> > > guest
> > > >> + * owns the first level page tables. Invalidations of
> > > >> translation caches in
> > > the
> > > >> + * guest are trapped and passed down to the host.
> > > >> + *
> > > >> + * vIOMMU in the guest will only expose first level page
> > > >> tables, therefore
> > > >> + * we do not include IOTLB granularity for request without
> > > >> PASID (second level).
> > > >
> > > > I would revise above as "We do not support IOTLB granularity for
> > > > request without PASID (second level), therefore any vIOMMU
> > > > implementation that exposes the SVA capability to the guest
> > > > should only expose the first level page tables, implying all
> > > > invalidation requests from the guest will include a valid PASID"
> > > >
> > > >> + *
> > > >> + * For example, to find the VT-d granularity encoding for IOTLB
> > > >> + * type and page selective granularity within PASID:
> > > >> + * X: indexed by iommu cache type
> > > >> + * Y: indexed by enum iommu_inv_granularity
> > > >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > >> + *
> > > >> + * Granu_map array indicates validity of the table. 1: valid,
> > > >> 0: invalid
> > > >> + *
> > > >> + */
> > > >> +const static int
> > > >>
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > >> NR] = {
> > > >> +	/*
> > > >> +	 * PASID based IOTLB invalidation: PASID selective (per
> > > >> PASID),
> > > >> +	 * page selective (address granularity)
> > > >> +	 */
> > > >> +	{0, 1, 1},
> > > >> +	/* PASID based dev TLBs, only support all PASIDs or
> > > >> single PASID */
> > > >> +	{1, 1, 0},
> > > >
> > > > Is this combination correct? when single PASID is being
> > > > specified, it is essentially a page-selective invalidation since
> > > > you need provide Address and Size.
> > > >
> > > >> +	/* PASID cache */
> > > >
> > > > PASID cache is fully managed by the host. Guest PASID cache
> > > > invalidation is interpreted by vIOMMU for bind and unbind
> > > > operations. I don't think we should accept any PASID cache
> > > > invalidation from userspace or guest.
> > > I tend to agree here.
> > > >
> > > >> +	{1, 1, 0}
> > > >> +};
> > > >> +
> > > >> +const static int
> > > >>
> > >
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> > > >> _NR] = {
> > > >> +	/* PASID based IOTLB */
> > > >> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > > >> +	/* PASID based dev TLBs */
> > > >> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > > >> +	/* PASID cache */
> > > >> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > > >> +};
> > > >> +
> > > >> +static inline int to_vtd_granularity(int type, int granu, int
> > > >> *vtd_granu) +{
> > > >> +	if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> > > >> IOMMU_INV_GRANU_NR ||
> > > >> +		!inv_type_granu_map[type][granu])
> > > >> +		return -EINVAL;
> > > >> +
> > > >> +	*vtd_granu = inv_type_granu_table[type][granu];
> > > >> +
> > > >
> > > > btw do we really need both map and table here? Can't we just
> > > > use one table with unsupported granularity marked as a special
> > > > value?
> > > I asked the same question some time ago. If I remember correctly the
> > > issue is while a granu can be supported in inv_type_granu_map, the
> > > associated value in inv_type_granu_table can be 0. This typically
> > > matches both values of G field (0 or 1) in the invalidation cmd. See
> > > other comment below.
> >
> > I didn't fully understand it. Also what does a value '0' imply? also
> > it's interesting to see below in [PATCH 07/11]:
> >
> 0 in 2D map array means invalid.
> 0 in granu table can be either valid or invalid
> That is why we need the map table to tell the difference.
> I will add following comments since this causes lots of confusion.
> 
>  * Granu_map array indicates validity of the table. 1: valid, 0: invalid
>  * This is useful when the entry in the granu table has a value of 0,
>  * which can be a valid or invalid value.
> 
> 
> > +/* QI Dev-IOTLB inv granu */
> > +#define QI_DEV_IOTLB_GRAN_ALL		1
> > +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> > +
> >
> Sorry I didn't get the point? These are the valid vt-d granu values.
> Per Spec CH 6.5.2.6

well, I thought '0' means invalid then why we define a valid granu
as invalid. But with your latest explanation I think I get the
rationale behind now. 

just one more thought, since the element type is int in the
granular table, why not using -1 to represent invalid then you
can still use one table?

> 
> > > >
> > > >> +	return 0;
> > > >> +}
> > > >> +
> > > >> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > > >> +{
> > > >> +	u64 nr_pages = (granu_size * nr_granules) >>
> > > >> VTD_PAGE_SHIFT; +
> > > >> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for
> > > >> 4k, 9 for 2MB, etc.
> > > >> +	 * IOMMU cache invalidate API passes granu_size in
> > > >> bytes, and number of
> > > >> +	 * granu size in contiguous memory.
> > > >> +	 */
> > > >> +	return order_base_2(nr_pages);
> > > >> +}
> > > >> +
> > > >> +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > >> +static int intel_iommu_sva_invalidate(struct iommu_domain
> > > >> *domain,
> > > >> +		struct device *dev, struct
> > > >> iommu_cache_invalidate_info *inv_info)
> > > >> +{
> > > >> +	struct dmar_domain *dmar_domain =
> > > >> to_dmar_domain(domain);
> > > >> +	struct device_domain_info *info;
> > > >> +	struct intel_iommu *iommu;
> > > >> +	unsigned long flags;
> > > >> +	int cache_type;
> > > >> +	u8 bus, devfn;
> > > >> +	u16 did, sid;
> > > >> +	int ret = 0;
> > > >> +	u64 size = 0;
> > > >> +
> > > >> +	if (!inv_info || !dmar_domain ||
> > > >> +		inv_info->version !=
> > > >> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > > >> +		return -EINVAL;
> > > >> +
> > > >> +	if (!dev || !dev_is_pci(dev))
> > > >> +		return -ENODEV;
> > > >> +
> > > >> +	iommu = device_to_iommu(dev, &bus, &devfn);
> > > >> +	if (!iommu)
> > > >> +		return -ENODEV;
> > > >> +
> > > >> +	spin_lock_irqsave(&device_domain_lock, flags);
> > > >> +	spin_lock(&iommu->lock);
> > > >> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > > >> devfn);
> > > >> +	if (!info) {
> > > >> +		ret = -EINVAL;
> > > >> +		goto out_unlock;
> > > >
> > > > -ENOTSUPP?
> > > >
> > > >> +	}
> > > >> +	did = dmar_domain->iommu_did[iommu->seq_id];
> > > >> +	sid = PCI_DEVID(bus, devfn);
> > > >> +
> > > >> +	/* Size is only valid in non-PASID selective
> > > >> invalidation */
> > > >> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > > >> +		size =
> > > >> to_vtd_size(inv_info->addr_info.granule_size,
> > > >> +
> > > >> inv_info->addr_info.nb_granules); +
> > > >> +	for_each_set_bit(cache_type, (unsigned long
> > > >> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > > >> +		int granu = 0;
> > > >> +		u64 pasid = 0;
> > > >> +
> > > >> +		ret = to_vtd_granularity(cache_type,
> > > >> inv_info->granularity, &granu);
> > > >> +		if (ret) {
> > > >> +			pr_err("Invalid cache type and granu
> > > >> combination %d/%d\n", cache_type,
> > > >> +				inv_info->granularity);
> > > >> +			break;
> > > >> +		}
> > > >> +
> > > >> +		/* PASID is stored in different locations based
> > > >> on granularity */
> > > >> +		if (inv_info->granularity ==
> > > >> IOMMU_INV_GRANU_PASID &&
> > > >> +			inv_info->pasid_info.flags &
> > > >> IOMMU_INV_PASID_FLAGS_PASID)
> > > >> +			pasid = inv_info->pasid_info.pasid;
> > > >> +		else if (inv_info->granularity ==
> > > >> IOMMU_INV_GRANU_ADDR &&
> > > >> +			inv_info->addr_info.flags &
> > > >> IOMMU_INV_ADDR_FLAGS_PASID)
> > > >> +			pasid = inv_info->addr_info.pasid;
> > > >> +		else {
> > > >> +			pr_err("Cannot find PASID for given
> > > >> cache type and granularity\n");
> > > >> +			break;
> > > >> +		}
> > > >> +
> > > >> +		switch (BIT(cache_type)) {
> > > >> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > > >> +			if ((inv_info->granularity !=
> > > >> IOMMU_INV_GRANU_PASID) &&
> > > >
> > > > granularity == IOMMU_INV_GRANU_ADDR? otherwise it's unclear
> > > > why IOMMU_INV_GRANU_DOMAIN also needs size check.
> > > >
> > > >> +				size &&
> > > >> (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1)))
> > > >> {
> > > >> +				pr_err("Address out of range,
> > > >> 0x%llx, size order %llu\n",
> > > >> +
> > > >> inv_info->addr_info.addr, size);
> > > >> +				ret = -ERANGE;
> > > >> +				goto out_unlock;
> > > >> +			}
> > > >> +
> > > >> +			qi_flush_piotlb(iommu, did,
> > > >> +					pasid,
> > > >> +
> > > >> mm_to_dma_pfn(inv_info-
> > > >>> addr_info.addr),
> > > >> +					(granu ==
> > > >> QI_GRAN_NONG_PASID) ? - 1 : 1 << size,
> > > >> +
> > > >> inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> > > >> +
> > > >> +			/*
> > > >> +			 * Always flush device IOTLB if ATS is
> > > >> enabled since guest
> > > >> +			 * vIOMMU exposes CM = 1, no device
> > > >> IOTLB flush will be passed
> > > >> +			 * down.
> > > >> +			 */
> > > >
> > > > Does VT-d spec mention that no device IOTLB flush is required
> > > > when CM=1?
> > > >> +			if (info->ats_enabled) {
> > > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > > >> sid, info-
> > > >>> pfsid,
> > > >> +						pasid,
> > > >> info->ats_qdep,
> > > >> +
> > > >> inv_info->addr_info.addr, size,
> > > >> +						granu);
> > > >> +			}
> > > >> +			break;
> > > >> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > > >> +			if (info->ats_enabled) {
> > > >> +				qi_flush_dev_iotlb_pasid(iommu,
> > > >> sid, info-
> > > >>> pfsid,
> > > >> +
> > > >> inv_info->addr_info.pasid, info->ats_qdep,
> > > >> +
> > > >> inv_info->addr_info.addr, size,
> > > >> +						granu);
> > > >
> > > > I'm confused here. There are two granularities allowed for
> > > > devtlb, but here you only handle one of them?
> > > granu is the result of to_vtd_granularity() so it can take either
> > > of the 2 values.
> >
> > yes, you're right.
> >
> > >
> > > Thanks
> > >
> > > Eric
> > > >
> > > >> +			} else
> > > >> +				pr_warn("Passdown device IOTLB
> > > >> flush w/o ATS!\n");
> > > >> +
> > > >> +			break;
> > > >> +		case IOMMU_CACHE_INV_TYPE_PASID:
> > > >> +			qi_flush_pasid_cache(iommu, did, granu,
> > > >> inv_info-
> > > >>> pasid_info.pasid);
> > > >> +
> > > >
> > > > as earlier comment, we shouldn't allow userspace or guest to
> > > > invalidate PASID cache
> > > >
> > > >> +			break;
> > > >> +		default:
> > > >> +			dev_err(dev, "Unsupported IOMMU
> > > >> invalidation type %d\n",
> > > >> +				cache_type);
> > > >> +			ret = -EINVAL;
> > > >> +		}
> > > >> +	}
> > > >> +out_unlock:
> > > >> +	spin_unlock(&iommu->lock);
> > > >> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > > >> +
> > > >> +	return ret;
> > > >> +}
> > > >> +#endif
> > > >> +
> > > >>  static int intel_iommu_map(struct iommu_domain *domain,
> > > >>  			   unsigned long iova, phys_addr_t hpa,
> > > >>  			   size_t size, int iommu_prot, gfp_t
> > > >> gfp) @@ -6204,6 +6385,7 @@ const struct iommu_ops
> > > >> intel_iommu_ops = { .is_attach_deferred	=
> > > >> intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > > >> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > > >> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> > > >>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > > >>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > > >>  #endif
> > > >> --
> > > >> 2.7.4
> > > >
> >
> 
> [Jacob Pan]

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  6:24       ` Tian, Kevin
@ 2020-04-01  6:57         ` Liu, Yi L
  2020-04-01 16:03           ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Liu, Yi L @ 2020-04-01  6:57 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, April 1, 2020 2:24 PM
> To: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Subject: RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
> 
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Wednesday, April 1, 2020 2:14 AM
> >
> > On Sat, 28 Mar 2020 10:01:42 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Sent: Saturday, March 21, 2020 7:28 AM
> > > >
> > > > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > > vIOMMU, we need to provide invalidation support at IOMMU API and
> > > > driver level. This patch adds Intel VT-d specific function to
> > > > implement iommu passdown invalidate API for shared virtual address.
> > > >
> > > > The use case is for supporting caching structure invalidation
> > > > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > >
> > > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > > interface as well.
> > >
> > True, but it does not invalidate this statement about emulated IOMMU. I
> > will add another statement saying "the same interface can be used for
> > virtio-IOMMU as well". OK?
> 
> sure
> 
> >
> > > > invalidation capability and passes down all descriptors from the
> > > > guest to the physical IOMMU.
> > > >
> > > > The assumption is that guest to host device ID mapping should be
> > > > resolved prior to calling IOMMU driver. Based on the device handle,
> > > > host IOMMU driver can replace certain fields before submit to the
> > > > invalidation queue.
> > > >
> > > > ---
> > > > v7 review fixed in v10
> > > > ---
> > > >
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  drivers/iommu/intel-iommu.c | 182
> > > > ++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 182 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > +++ b/drivers/iommu/intel-iommu.c
> > > > @@ -5619,6 +5619,187 @@ static void
> > > > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > > >  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > > >  }
> > > >
> > > > +/*
> > > > + * 2D array for converting and sanitizing IOMMU generic TLB
> > > > granularity to
> > > > + * VT-d granularity. Invalidation is typically included in the
> > > > unmap operation
> > > > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > > > guest
> > > > + * owns the first level page tables. Invalidations of translation
> > > > caches in the
> > > > + * guest are trapped and passed down to the host.
> > > > + *
> > > > + * vIOMMU in the guest will only expose first level page tables,
> > > > therefore
> > > > + * we do not include IOTLB granularity for request without PASID
> > > > (second level).
> > >
> > > I would revise above as "We do not support IOTLB granularity for
> > > request without PASID (second level), therefore any vIOMMU
> > > implementation that exposes the SVA capability to the guest should
> > > only expose the first level page tables, implying all invalidation
> > > requests from the guest will include a valid PASID"
> > >
> > Sounds good.
> >
> > > > + *
> > > > + * For example, to find the VT-d granularity encoding for IOTLB
> > > > + * type and page selective granularity within PASID:
> > > > + * X: indexed by iommu cache type
> > > > + * Y: indexed by enum iommu_inv_granularity
> > > > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > > + *
> > > > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > > > invalid
> > > > + *
> > > > + */
> > > > +const static int
> > > >
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > > NR] = {
> > > > +	/*
> > > > +	 * PASID based IOTLB invalidation: PASID selective (per
> > > > PASID),
> > > > +	 * page selective (address granularity)
> > > > +	 */
> > > > +	{0, 1, 1},
> > > > +	/* PASID based dev TLBs, only support all PASIDs or single
> > > > PASID */
> > > > +	{1, 1, 0},
> > >
> > > Is this combination correct? when single PASID is being specified, it
> > > is essentially a page-selective invalidation since you need provide
> > > Address and Size.
> > >
> > This is for translation between generic UAPI granu to VT-d granu, it
> > has nothing to do with address and size.
> 
> Generic UAPI defines three granularities: domain, pasid and addr.
> from the definition domain applies all entries related to did, pasid
> applies to all entries related to pasid, while addr is specific for a
> range.
> 
> from what we just confirmed internally with VT-d spec owner, our
> PASID based dev TLB invalidation always requires addr and size,
> while current uAPI doesn't support multiple PASIDs based range
> invaliation. It sounds to me that you want to use domain to replace
> multiple PASIDs case (G=1), but it then changes the meaning of
> the domain granularity and easily lead to confusion.
>
> I feel Eric's proposal makes more sense. Here we'd better use {0, 0, 1}
> to indicate only addr range invalidation is allowed, matching the
> spec definition. We may use a special flag in iommu_inv_addr_info
> to indicate G=1 case, if necessary.

I agree. G=1 case should be supported. I think we had a flag for global
as there is GL bit in p_iotlb_inv_dsc (a.k.a ext_iotlb_inv_dsc), but it was
dropped as 3.0 spec dropped GL bit. Let's add it back as for DevTLB
flush case.

> > e.g.
> > If user passes IOMMU_INV_GRANU_PASID for the single PASID case as you
> > mentioned, this map table shows it is valid.
> >
> > Then the lookup result will get VT-d granu:
> > QI_DEV_IOTLB_GRAN_PASID_SEL, which means G=0.
> >
> >
> > > > +	/* PASID cache */
> > >
> > > PASID cache is fully managed by the host. Guest PASID cache
> > > invalidation is interpreted by vIOMMU for bind and unbind operations.
> > > I don't think we should accept any PASID cache invalidation from
> > > userspace or guest.
> > >
> >
> > True for vIOMMU, this is here for completeness. Can be used by virtio
> > IOMMU, since PC flush is inclusive (IOTLB, devTLB), it is more
> > efficient.
> 
> I think it is not correct in concept. We should not allow the userspace or
> guest to request an operation which is beyond its privilege (just because
> doing so may bring some performance benefit). You can always introduce
> new cmd for such purpose.

I guess it was added for the pasid table binding case? Now, our platform
doesn't support it. So I guess we can just make it as unsupported in the
2D table.

Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  6:29           ` Tian, Kevin
@ 2020-04-01  7:13             ` Liu, Yi L
  2020-04-01  7:32               ` Auger Eric
  0 siblings, 1 reply; 67+ messages in thread
From: Liu, Yi L @ 2020-04-01  7:13 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, LKML, iommu,
	David Woodhouse, Jonathan Cameron

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, April 1, 2020 2:30 PM
> To: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Subject: RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
> 
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Wednesday, April 1, 2020 4:58 AM
> >
> > On Tue, 31 Mar 2020 02:49:21 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Auger Eric <eric.auger@redhat.com>
> > > > Sent: Sunday, March 29, 2020 11:34 PM
> > > >
> > > > Hi,
> > > >
> > > > On 3/28/20 11:01 AM, Tian, Kevin wrote:
> > > > >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > >> Sent: Saturday, March 21, 2020 7:28 AM
> > > > >>
> > > > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > > > >> and
> > > > driver
> > > > >> level. This patch adds Intel VT-d specific function to
> > > > >> implement iommu passdown invalidate API for shared virtual address.
> > > > >>
> > > > >> The use case is for supporting caching structure invalidation
> > > > >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> > >  [...]
> > >  [...]
> > > > to
> > > > >> + * VT-d granularity. Invalidation is typically included in the
> > > > >> unmap
> > > > operation
> > > > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > > > >> devices
> > > > guest
> > > > >> + * owns the first level page tables. Invalidations of
> > > > >> translation caches in
> > > > the
> > >  [...]
> > >  [...]
> > >  [...]
> > > >
> > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > > >> NR] = {
> > > > >> +	/*
> > > > >> +	 * PASID based IOTLB invalidation: PASID selective (per
> > > > >> PASID),
> > > > >> +	 * page selective (address granularity)
> > > > >> +	 */
> > > > >> +	{0, 1, 1},
> > > > >> +	/* PASID based dev TLBs, only support all PASIDs or
> > > > >> single PASID */
> > > > >> +	{1, 1, 0},
> > > > >
> > > > > Is this combination correct? when single PASID is being
> > > > > specified, it is essentially a page-selective invalidation since
> > > > > you need provide Address and Size.
> > > > Isn't it the same when G=1? Still the addr/size is used. Doesn't
> > > > it
> > >
> > > I thought addr/size is not used when G=1, but it might be wrong. I'm
> > > checking with our vt-d spec owner.
> > >
> >
> > > > correspond to IOMMU_INV_GRANU_ADDR with
> IOMMU_INV_ADDR_FLAGS_PASID
> > > > flag unset?
> > > >
> > > > so {0, 0, 1}?
> > >
> > I am not sure I got your logic. The three fields correspond to
> > 	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective
> > invalidation */
> > 	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
> > 	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation *
> >
> > For devTLB, we use domain as global since there is no domain. Then I
> > came up with {1, 1, 0}, which means we could have global and pasid
> > granu invalidation for PASID based devTLB.
> >
> > If the caller also provide addr and S bit, the flush routine will put
> 
> "also" -> "must", because vt-d requires addr/size must be provided in
> devtlb
> descriptor, that is why Eric suggests {0, 0, 1}.

I think it should be {0, 0, 1} :-) addr field and S field are must, pasid
field depends on G bit.

I didn’t read through all comments. Here is a concern with this 2-D table,
the iommu cache type is defined as below. I suppose there is a problem here.
If I'm using IOMMU_CACHE_INV_TYPE_PASID, it will beyond the 2-D table.

/* IOMMU paging structure cache */
#define IOMMU_CACHE_INV_TYPE_IOTLB      (1 << 0) /* IOMMU IOTLB */
#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device IOTLB */
#define IOMMU_CACHE_INV_TYPE_PASID      (1 << 2) /* PASID cache */
#define IOMMU_CACHE_INV_TYPE_NR         (3)

> >
> > > I have one more open:
> > >
> > > How does userspace know which invalidation type/gran is supported?
> > > I didn't see such capability reporting in Yi's VFIO vSVA patch set.
> > > Do we want the user/kernel assume the same capability set if they
> > > are architectural? However the kernel could also do some
> > > optimization e.g. hide devtlb invalidation capability given that the
> > > kernel already invalidate devtlb automatically when serving iotlb
> > > invalidation...
> > >
> > In general, we are trending to use VFIO capability chain to expose
> > iommu capabilities.
> >
> > But for architectural features such as type/granu, we have to assume
> > the same capability between host & guest. Granu and types are not
> > enumerated on the host IOMMU either.
> >
> > For devTLB optimization, I agree we need to expose a capability to the
> > guest stating that implicit devtlb invalidation is supported.
> > Otherwise, if Linux guest runs on other OSes may not support implicit
> > devtlb invalidation.
> >
> > Right Yi?
> 
> Thanks for explanation. So we are assumed to support all operations
> defined in spec, so no need to expose them one-by-one. For optimization,
> I'm fine to do it later.

yes. :-)

Regards,
Yi Liu

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  7:13             ` Liu, Yi L
@ 2020-04-01  7:32               ` Auger Eric
  2020-04-01 16:05                 ` Jacob Pan
  2020-04-02 15:54                 ` Jacob Pan
  0 siblings, 2 replies; 67+ messages in thread
From: Auger Eric @ 2020-04-01  7:32 UTC (permalink / raw)
  To: Liu, Yi L, Tian, Kevin, Jacob Pan
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

Hi,

On 4/1/20 9:13 AM, Liu, Yi L wrote:
>> From: Tian, Kevin <kevin.tian@intel.com>
>> Sent: Wednesday, April 1, 2020 2:30 PM
>> To: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Subject: RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
>>
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Sent: Wednesday, April 1, 2020 4:58 AM
>>>
>>> On Tue, 31 Mar 2020 02:49:21 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>
>>>>> From: Auger Eric <eric.auger@redhat.com>
>>>>> Sent: Sunday, March 29, 2020 11:34 PM
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 3/28/20 11:01 AM, Tian, Kevin wrote:
>>>>>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>>>> Sent: Saturday, March 21, 2020 7:28 AM
>>>>>>>
>>>>>>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>>>>>>> vIOMMU, we need to provide invalidation support at IOMMU API
>>>>>>> and
>>>>> driver
>>>>>>> level. This patch adds Intel VT-d specific function to
>>>>>>> implement iommu passdown invalidate API for shared virtual address.
>>>>>>>
>>>>>>> The use case is for supporting caching structure invalidation
>>>>>>> of assigned SVM capable devices. Emulated IOMMU exposes queue
>>>>  [...]
>>>>  [...]
>>>>> to
>>>>>>> + * VT-d granularity. Invalidation is typically included in the
>>>>>>> unmap
>>>>> operation
>>>>>>> + * as a result of DMA or VFIO unmap. However, for assigned
>>>>>>> devices
>>>>> guest
>>>>>>> + * owns the first level page tables. Invalidations of
>>>>>>> translation caches in
>>>>> the
>>>>  [...]
>>>>  [...]
>>>>  [...]
>>>>>
>>> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
>>>>>>> NR] = {
>>>>>>> +	/*
>>>>>>> +	 * PASID based IOTLB invalidation: PASID selective (per
>>>>>>> PASID),
>>>>>>> +	 * page selective (address granularity)
>>>>>>> +	 */
>>>>>>> +	{0, 1, 1},
>>>>>>> +	/* PASID based dev TLBs, only support all PASIDs or
>>>>>>> single PASID */
>>>>>>> +	{1, 1, 0},
>>>>>>
>>>>>> Is this combination correct? when single PASID is being
>>>>>> specified, it is essentially a page-selective invalidation since
>>>>>> you need provide Address and Size.
>>>>> Isn't it the same when G=1? Still the addr/size is used. Doesn't
>>>>> it
>>>>
>>>> I thought addr/size is not used when G=1, but it might be wrong. I'm
>>>> checking with our vt-d spec owner.
>>>>
>>>
>>>>> correspond to IOMMU_INV_GRANU_ADDR with
>> IOMMU_INV_ADDR_FLAGS_PASID
>>>>> flag unset?
>>>>>
>>>>> so {0, 0, 1}?
>>>>
>>> I am not sure I got your logic. The three fields correspond to
>>> 	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective
>>> invalidation */
>>> 	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
>>> 	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation *
>>>
>>> For devTLB, we use domain as global since there is no domain. Then I
>>> came up with {1, 1, 0}, which means we could have global and pasid
>>> granu invalidation for PASID based devTLB.
>>>
>>> If the caller also provide addr and S bit, the flush routine will put
>>
>> "also" -> "must", because vt-d requires addr/size must be provided in
>> devtlb
>> descriptor, that is why Eric suggests {0, 0, 1}.
> 
> I think it should be {0, 0, 1} :-) addr field and S field are must, pasid
> field depends on G bit.

On my side, I understood from the spec that addr/S are always used
whatever the granularity, hence the above suggestion.

As a comparison, for PASID based IOTLB invalidation, it is clearly
stated that if G matches PASID selective invalidation, address field is
ignored. This is not written that way for PASID-based device TLB inv.
> 
> I didn’t read through all comments. Here is a concern with this 2-D table,
> the iommu cache type is defined as below. I suppose there is a problem here.
> If I'm using IOMMU_CACHE_INV_TYPE_PASID, it will beyond the 2-D table.
> 
> /* IOMMU paging structure cache */
> #define IOMMU_CACHE_INV_TYPE_IOTLB      (1 << 0) /* IOMMU IOTLB */
> #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device IOTLB */
> #define IOMMU_CACHE_INV_TYPE_PASID      (1 << 2) /* PASID cache */
> #define IOMMU_CACHE_INV_TYPE_NR         (3)
oups indeed

Thanks

Eric
> 
>>>
>>>> I have one more open:
>>>>
>>>> How does userspace know which invalidation type/gran is supported?
>>>> I didn't see such capability reporting in Yi's VFIO vSVA patch set.
>>>> Do we want the user/kernel assume the same capability set if they
>>>> are architectural? However the kernel could also do some
>>>> optimization e.g. hide devtlb invalidation capability given that the
>>>> kernel already invalidate devtlb automatically when serving iotlb
>>>> invalidation...
>>>>
>>> In general, we are trending to use VFIO capability chain to expose
>>> iommu capabilities.
>>>
>>> But for architectural features such as type/granu, we have to assume
>>> the same capability between host & guest. Granu and types are not
>>> enumerated on the host IOMMU either.
>>>
>>> For devTLB optimization, I agree we need to expose a capability to the
>>> guest stating that implicit devtlb invalidation is supported.
>>> Otherwise, if Linux guest runs on other OSes may not support implicit
>>> devtlb invalidation.
>>>
>>> Right Yi?
>>
>> Thanks for explanation. So we are assumed to support all operations
>> defined in spec, so no need to expose them one-by-one. For optimization,
>> I'm fine to do it later.
> 
> yes. :-)
> 
> Regards,
> Yi Liu
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID
  2020-03-28 10:22   ` Tian, Kevin
@ 2020-04-01 15:47     ` Jacob Pan
  2020-04-02  2:18       ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jacob Pan @ 2020-04-01 15:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Sat, 28 Mar 2020 10:22:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > When VT-d driver runs in the guest, PASID allocation must be
> > performed via virtual command interface. This patch registers a
> > custom IOASID allocator which takes precedence over the default
> > XArray based allocator. The resulting IOASID allocation will always
> > come from the host. This ensures that PASID namespace is system-
> > wide.
> > 
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 84
> > +++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/intel-iommu.h |  2 ++
> >  2 files changed, 86 insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct intel_iommu
> > *iommu)
> >  		if (ecap_prs(iommu->ecap))
> >  			intel_svm_finish_prq(iommu);
> >  	}
> > +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > +
> > ioasid_unregister_allocator(&iommu->pasid_allocator); +
> >  #endif
> >  }
> > 
> > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > intel_iommu *iommu)
> >  	return ret;
> >  }
> > 
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > void *data)  
> 
> the name is too generic... can we add vcmd in the name to clarify
> its purpose, e.g. intel_vcmd_ioasid_alloc?
> 
I feel the intel_ prefix is a natural extension of a generic API, we do
that for other IOMMU APIs, right?

> > +{
> > +	struct intel_iommu *iommu = data;
> > +	ioasid_t ioasid;
> > +
> > +	if (!iommu)
> > +		return INVALID_IOASID;
> > +	/*
> > +	 * VT-d virtual command interface always uses the full 20
> > bit
> > +	 * PASID range. Host can partition guest PASID range based
> > on
> > +	 * policies but it is out of guest's control.
> > +	 */
> > +	if (min < PASID_MIN || max > intel_pasid_max_id)
> > +		return INVALID_IOASID;
> > +
> > +	if (vcmd_alloc_pasid(iommu, &ioasid))
> > +		return INVALID_IOASID;
> > +
> > +	return ioasid;
> > +}
> > +
> > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > +{
> > +	struct intel_iommu *iommu = data;
> > +
> > +	if (!iommu)
> > +		return;
> > +	/*
> > +	 * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > +	 * We can only free the PASID when all the devices are
> > unbound.
> > +	 */
> > +	if (ioasid_find(NULL, ioasid, NULL)) {
> > +		pr_alert("Cannot free active IOASID %d\n", ioasid);
> > +		return;
> > +	}  
> 
> However the sanity check is not done in default_free. Is there a
> reason why using vcmd adds such  new requirement?
> 
Since we don't support nested guest. This vcmd allocator is only used
by the guest IOMMU driver not VFIO. We expect IOMMU driver to have
control of the free()/unbind() ordering.

For default_free, it can come from user space and host VFIO which can
be out of order. But we will solve that issue with the blocking
notifier.

> > +	vcmd_free_pasid(iommu, ioasid);
> > +}
> > +
> > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > +{
> > +	/*
> > +	 * If we are running in the host, no need for custom
> > allocator
> > +	 * in that PASIDs are allocated from the host system-wide.
> > +	 */
> > +	if (!cap_caching_mode(iommu->cap))
> > +		return;  
> 
> is it more accurate to check against vcmd capability?
> 
I think this is sufficient. The spec says if vcmd is present, we must
use it but not the other way.

> > +
> > +	if (!sm_supported(iommu)) {
> > +		pr_warn("VT-d Scalable Mode not enabled, no PASID
> > allocation\n");
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * Register a custom PASID allocator if we are running in
> > a guest,
> > +	 * guest PASID must be obtained via virtual command
> > interface.
> > +	 * There can be multiple vIOMMUs in each guest but only one
> > allocator
> > +	 * is active. All vIOMMU allocators will eventually be
> > calling the same  
> 
> which one? the first or last?
> 
All allocators share the same ops, so first=last. IOASID code will
inspect the ops function and see if they are shared with others then
use the same ops.

> > +	 * host allocator.
> > +	 */
> > +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> > +		pr_info("Register custom PASID allocator\n");
> > +		iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> > +		iommu->pasid_allocator.free = intel_ioasid_free;
> > +		iommu->pasid_allocator.pdata = (void *)iommu;
> > +		if
> > (ioasid_register_allocator(&iommu->pasid_allocator)) {
> > +			pr_warn("Custom PASID allocator failed,
> > scalable mode disabled\n");
> > +			/*
> > +			 * Disable scalable mode on this IOMMU if
> > there
> > +			 * is no custom allocator. Mixing SM
> > capable vIOMMU
> > +			 * and non-SM vIOMMU are not supported.
> > +			 */
> > +			intel_iommu_sm = 0;  
> 
> since you register an allocator for every vIOMMU, means previously
> registered allocators should also be unregistered here?
> 
True, but it is not necessary for two reasons:
1. This should not happen unless something went seriously wrong.
All vIOMMU shares the same alloc/free function, so they are put under
the same bucket by IOASID. So the case for the first vIOMMU to succeed
then fail in later vIOMMU registration should not happen. Unless kernel
run out of memory etc.

2. Once SM is disabled, there is no user of ioasid allocator.

> > +		}
> > +	}
> > +}
> > +#endif
> > +
> >  static int __init init_dmars(void)
> >  {
> >  	struct dmar_drhd_unit *drhd;
> > @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
> >  	 */
> >  	for_each_active_iommu(iommu, drhd) {
> >  		iommu_flush_write_buffer(iommu);
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +		register_pasid_allocator(iommu);
> > +#endif
> >  		iommu_set_root_entry(iommu);
> >  		iommu->flush.flush_context(iommu, 0, 0, 0,
> > DMA_CCMD_GLOBAL_INVL);
> >  		iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> > DMA_TLB_GLOBAL_FLUSH);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 9cbf5357138b..9c357a325c72
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -19,6 +19,7 @@
> >  #include <linux/iommu.h>
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/dmar.h>
> > +#include <linux/ioasid.h>
> > 
> >  #include <asm/cacheflush.h>
> >  #include <asm/iommu.h>
> > @@ -563,6 +564,7 @@ struct intel_iommu {
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  	struct page_req_dsc *prq;
> >  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> > +	struct ioasid_allocator_ops pasid_allocator; /* Custom
> > allocator for PASIDs */
> >  #endif
> >  	struct q_inval  *qi;            /* Queued invalidation
> > info */ u32 *iommu_state; /* Store iommu states between suspend and
> > resume.*/
> > --
> > 2.7.4  
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  6:57         ` Liu, Yi L
@ 2020-04-01 16:03           ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-01 16:03 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jean-Philippe Brucker,
	LKML, iommu, David Woodhouse, Jonathan Cameron

On Wed, 1 Apr 2020 06:57:42 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Wednesday, April 1, 2020 2:24 PM
> > To: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Subject: RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate
> > function 
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Wednesday, April 1, 2020 2:14 AM
> > >
> > > On Sat, 28 Mar 2020 10:01:42 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >  
> > > > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Sent: Saturday, March 21, 2020 7:28 AM
> > > > >
> > > > > When Shared Virtual Address (SVA) is enabled for a guest OS
> > > > > via vIOMMU, we need to provide invalidation support at IOMMU
> > > > > API and driver level. This patch adds Intel VT-d specific
> > > > > function to implement iommu passdown invalidate API for
> > > > > shared virtual address.
> > > > >
> > > > > The use case is for supporting caching structure invalidation
> > > > > of assigned SVM capable devices. Emulated IOMMU exposes
> > > > > queue  
> > > >
> > > > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > > > interface as well.
> > > >  
> > > True, but it does not invalidate this statement about emulated
> > > IOMMU. I will add another statement saying "the same interface
> > > can be used for virtio-IOMMU as well". OK?  
> > 
> > sure
> >   
> > >  
> > > > > invalidation capability and passes down all descriptors from
> > > > > the guest to the physical IOMMU.
> > > > >
> > > > > The assumption is that guest to host device ID mapping should
> > > > > be resolved prior to calling IOMMU driver. Based on the
> > > > > device handle, host IOMMU driver can replace certain fields
> > > > > before submit to the invalidation queue.
> > > > >
> > > > > ---
> > > > > v7 review fixed in v10
> > > > > ---
> > > > >
> > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > > > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  drivers/iommu/intel-iommu.c | 182
> > > > > ++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 182 insertions(+)
> > > > >
> > > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > > +++ b/drivers/iommu/intel-iommu.c
> > > > > @@ -5619,6 +5619,187 @@ static void
> > > > > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > > > >  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * 2D array for converting and sanitizing IOMMU generic TLB
> > > > > granularity to
> > > > > + * VT-d granularity. Invalidation is typically included in
> > > > > the unmap operation
> > > > > + * as a result of DMA or VFIO unmap. However, for assigned
> > > > > devices guest
> > > > > + * owns the first level page tables. Invalidations of
> > > > > translation caches in the
> > > > > + * guest are trapped and passed down to the host.
> > > > > + *
> > > > > + * vIOMMU in the guest will only expose first level page
> > > > > tables, therefore
> > > > > + * we do not include IOTLB granularity for request without
> > > > > PASID (second level).  
> > > >
> > > > I would revise above as "We do not support IOTLB granularity for
> > > > request without PASID (second level), therefore any vIOMMU
> > > > implementation that exposes the SVA capability to the guest
> > > > should only expose the first level page tables, implying all
> > > > invalidation requests from the guest will include a valid PASID"
> > > >  
> > > Sounds good.
> > >  
> > > > > + *
> > > > > + * For example, to find the VT-d granularity encoding for
> > > > > IOTLB
> > > > > + * type and page selective granularity within PASID:
> > > > > + * X: indexed by iommu cache type
> > > > > + * Y: indexed by enum iommu_inv_granularity
> > > > > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > > > + *
> > > > > + * Granu_map array indicates validity of the table. 1:
> > > > > valid, 0: invalid
> > > > > + *
> > > > > + */
> > > > > +const static int
> > > > >  
> > > inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_  
> > > > > NR] = {
> > > > > +	/*
> > > > > +	 * PASID based IOTLB invalidation: PASID selective
> > > > > (per PASID),
> > > > > +	 * page selective (address granularity)
> > > > > +	 */
> > > > > +	{0, 1, 1},
> > > > > +	/* PASID based dev TLBs, only support all PASIDs or
> > > > > single PASID */
> > > > > +	{1, 1, 0},  
> > > >
> > > > Is this combination correct? when single PASID is being
> > > > specified, it is essentially a page-selective invalidation
> > > > since you need provide Address and Size.
> > > >  
> > > This is for translation between generic UAPI granu to VT-d granu,
> > > it has nothing to do with address and size.  
> > 
> > Generic UAPI defines three granularities: domain, pasid and addr.
> > from the definition domain applies all entries related to did, pasid
> > applies to all entries related to pasid, while addr is specific for
> > a range.
> > 
> > from what we just confirmed internally with VT-d spec owner, our
> > PASID based dev TLB invalidation always requires addr and size,
> > while current uAPI doesn't support multiple PASIDs based range
> > invaliation. It sounds to me that you want to use domain to replace
> > multiple PASIDs case (G=1), but it then changes the meaning of
> > the domain granularity and easily lead to confusion.
> >
> > I feel Eric's proposal makes more sense. Here we'd better use {0,
> > 0, 1} to indicate only addr range invalidation is allowed, matching
> > the spec definition. We may use a special flag in
> > iommu_inv_addr_info to indicate G=1 case, if necessary.  
> 
> I agree. G=1 case should be supported. I think we had a flag for
> global as there is GL bit in p_iotlb_inv_dsc (a.k.a
> ext_iotlb_inv_dsc), but it was dropped as 3.0 spec dropped GL bit.
> Let's add it back as for DevTLB flush case.
> 
Make sense. I will change that to

--- a/drivers/iommu/intel-iommu.c                                                                      
+++ b/drivers/iommu/intel-iommu.c                                                                      
@@ -5741,7 +5741,7 @@ const static int inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] 
         */                                                                                            
        {0, 1, 1},                                                                                     
        /* PASID based dev TLBs, only support all PASIDs or single PASID */                            
-       {1, 1, 0},                                                                                     
+       {0, 0, 1},                                                                                     
        /* PASID cache */                                                                              
        {1, 1, 0}                                                                                      
 };                                                                                                    
@@ -5750,7 +5750,7 @@ const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_N 
        /* PASID based IOTLB */                                                                        
        {0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},                                                    
        /* PASID based dev TLBs */                                                                     
-       {QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},                                       
+       {0, 0, QI_DEV_IOTLB_GRAN_PASID_SEL},                                                           
        /* PASID cache */                                                                              
        {QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},                                                        
 };                                                                                                    

> > > e.g.
> > > If user passes IOMMU_INV_GRANU_PASID for the single PASID case as
> > > you mentioned, this map table shows it is valid.
> > >
> > > Then the lookup result will get VT-d granu:
> > > QI_DEV_IOTLB_GRAN_PASID_SEL, which means G=0.
> > >
> > >  
> > > > > +	/* PASID cache */  
> > > >
> > > > PASID cache is fully managed by the host. Guest PASID cache
> > > > invalidation is interpreted by vIOMMU for bind and unbind
> > > > operations. I don't think we should accept any PASID cache
> > > > invalidation from userspace or guest.
> > > >  
> > >
> > > True for vIOMMU, this is here for completeness. Can be used by
> > > virtio IOMMU, since PC flush is inclusive (IOTLB, devTLB), it is
> > > more efficient.  
> > 
> > I think it is not correct in concept. We should not allow the
> > userspace or guest to request an operation which is beyond its
> > privilege (just because doing so may bring some performance
> > benefit). You can always introduce new cmd for such purpose.  
> 
> I guess it was added for the pasid table binding case? Now, our
> platform doesn't support it. So I guess we can just make it as
> unsupported in the 2D table.
Sounds good.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  7:32               ` Auger Eric
@ 2020-04-01 16:05                 ` Jacob Pan
  2020-04-02 15:54                 ` Jacob Pan
  1 sibling, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-01 16:05 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jean-Philippe Brucker,
	LKML, iommu, David Woodhouse, Jonathan Cameron

On Wed, 1 Apr 2020 09:32:37 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> >> devtlb
> >> descriptor, that is why Eric suggests {0, 0, 1}.  
> > 
> > I think it should be {0, 0, 1} :-) addr field and S field are must,
> > pasid field depends on G bit.  
> 
> On my side, I understood from the spec that addr/S are always used
> whatever the granularity, hence the above suggestion.
> 
> As a comparison, for PASID based IOTLB invalidation, it is clearly
> stated that if G matches PASID selective invalidation, address field
> is ignored. This is not written that way for PASID-based device TLB
> inv.
> > 
I misread the S bit. It all makes sense now. Thanks for the proposal
and explanation.

Jacob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support
  2020-03-31  3:43       ` Tian, Kevin
@ 2020-04-01 17:13         ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-01 17:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Tue, 31 Mar 2020 03:43:39 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > > >  struct intel_svm_dev {
> > > > @@ -698,9 +700,13 @@ struct intel_svm_dev {
> > > >  struct intel_svm {
> > > >  	struct mmu_notifier notifier;
> > > >  	struct mm_struct *mm;
> > > > +
> > > >  	struct intel_iommu *iommu;
> > > >  	int flags;
> > > >  	int pasid;
> > > > +	int gpasid; /* Guest PASID in case of vSVA bind with
> > > > non-identity host
> > > > +		     * to guest PASID mapping.
> > > > +		     */  
> > >
> > > we don't need to highlight identity or non-identity thing, since
> > > either way shares the same infrastructure here and it is not the
> > > knowledge that the kernel driver should assume
> > >  
> > Sorry, I don't get your point.
> > 
> > What I meant was that this field "gpasid" is only used for
> > non-identity case. For identity case, we don't have
> > SVM_FLAG_GUEST_PASID.  
> 
> what's the problem if a guest tries to set gpasid even in identity
> case? do you want to add check to reject it? Also I remember we
> discussed before that we want to provide a consistent interface 
> to other consumer e.g. KVM to setup VMCS PASID translation table.
> In that case, regardless of identity or non-identity, we need provide
> such mapping info.
The solution is in flux a little. For KVM to set up VMCS, we are
planning to use IOASID set private ID as guest PASID. So this part of
the code will go away, i.e. G-H PASID mapping will no longer stored in
IOMMU driver. Perhaps we can address this after the transition?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function
  2020-03-29 11:35   ` Auger Eric
@ 2020-04-01 20:06     ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-01 20:06 UTC (permalink / raw)
  To: Auger Eric
  Cc: Yi L, Tian, Kevin, Raj Ashok, Jean-Philippe Brucker, iommu, LKML,
	Alex Williamson, David Woodhouse, Jonathan Cameron

On Sun, 29 Mar 2020 13:35:15 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> > With PASID granular translation type set to 0x11b, translation
> > result from the first level(FL) also subject to a second level(SL)
> > page table translation. This mode is used for SVA virtualization,
> > where FL performs guest virtual to guest physical translation and
> > SL performs guest physical to host physical translation.
> > 
> > This patch adds a helper function for setting up nested translation
> > where second level comes from a domain and first level comes from
> > a guest PGD.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-pasid.c | 240
> > +++++++++++++++++++++++++++++++++++++++++++-
> > drivers/iommu/intel-pasid.h |  12 +++ include/linux/intel-iommu.h
> > |   3 + 3 files changed, 252 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> > 100644 --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> > value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> >  }
> >  
> > +/*
> > + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_emt(struct pasid_entry *pe, u64 value)
> > +{
> > +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> > 27); +}
> > +
> > +/*
> > + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_pat(struct pasid_entry *pe, u64 value)
> > +{
> > +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> > 32); +}
> > +
> > +/*
> > + * Setup the Cache Disable (CD) field (Bit 89)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_cd(struct pasid_entry *pe)
> > +{
> > +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> > +}
> > +
> > +/*
> > + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_emte(struct pasid_entry *pe)
> > +{
> > +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> > +}
> > +
> > +/*
> > + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_eafe(struct pasid_entry *pe)
> > +{
> > +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> > +}
> > +
> > +/*
> > + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_pcd(struct pasid_entry *pe)
> > +{
> > +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> > +}
> > +
> > +/*
> > + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> > + * of a scalable mode PASID entry.
> > + */
> > +static inline void
> > +pasid_set_pwt(struct pasid_entry *pe)
> > +{
> > +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> > +}
> > +
> >  static void
> >  pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
> >  				    u16 did, int pasid)
> > @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct
> > intel_iommu *iommu,
> > pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); 
> >  	/* Setup Present and PASID Granular Transfer Type: */
> > -	pasid_set_translation_type(pte, 1);
> > +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
> >  	pasid_set_present(pte);
> >  	pasid_flush_caches(iommu, pte, pasid, did);
> >  
> > @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> > intel_iommu *iommu, pasid_set_domain_id(pte, did);
> >  	pasid_set_slptr(pte, pgd_val);
> >  	pasid_set_address_width(pte, agaw);
> > -	pasid_set_translation_type(pte, 2);
> > +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
> >  	pasid_set_fault_enable(pte);
> >  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >  
> > @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> > intel_iommu *iommu, pasid_clear_entry(pte);
> >  	pasid_set_domain_id(pte, did);
> >  	pasid_set_address_width(pte, iommu->agaw);
> > -	pasid_set_translation_type(pte, 4);
> > +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
> >  	pasid_set_fault_enable(pte);
> >  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));  
> 
> All above looks good to me
> >  
> > @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> > intel_iommu *iommu, 
> >  	return 0;
> >  }
> > +
> > +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> > +				struct pasid_entry *pte,
> > +				struct iommu_gpasid_bind_data_vtd
> > *pasid_data) +{
> > +	/*
> > +	 * Not all guest PASID table entry fields are passed down
> > during bind,
> > +	 * here we only set up the ones that are dependent on
> > guest settings.
> > +	 * Execution related bits such as NXE, SMEP are not
> > meaningful to IOMMU,
> > +	 * therefore not set. Other fields, such as snoop related,
> > are set based
> > +	 * on host needs regardless of guest settings.
> > +	 */
> > +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> > +		if (!ecap_srs(iommu->ecap)) {
> > +			pr_err("No supervisor request support on
> > %s\n",
> > +			       iommu->name);
> > +			return -EINVAL;
> > +		}
> > +		pasid_set_sre(pte);
> > +	}
> > +
> > +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> > +		if (!ecap_eafs(iommu->ecap)) {
> > +			pr_err("No extended access flag support on
> > %s\n",
> > +				iommu->name);
> > +			return -EINVAL;
> > +		}
> > +		pasid_set_eafe(pte);
> > +	}
> > +
> > +	/*
> > +	 * Memory type is only applicable to devices inside
> > processor coherent
> > +	 * domain. PCIe devices are not included. We can skip the
> > rest of the
> > +	 * flags if IOMMU does not support MTS.
> > +	 */  
> nit:
> 	if (!pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK)
> 		return 0;
> 
> 	if (!ecap_mts(iommu->ecap) {
> 		pr_err("No memory type support for bind guest PASID
> on %s\n", iommu->name);
> 		return -EINVAL;
> 	}
> 
> 	settings ../..
Looks better in flow. Will change to:
	if (!(pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK))
		return 0;

	if (!ecap_mts(iommu->ecap)) {
		pr_err("No memory type support for bind guest PASID on %s\n", iommu->name);
		return -EINVAL;
	}

	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
		pasid_set_emte(pte);
		pasid_set_emt(pte, pasid_data->emt);
	}
	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
		pasid_set_pcd(pte);
	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
		pasid_set_pwt(pte);
	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
		pasid_set_cd(pte);
	pasid_set_pat(pte, pasid_data->pat);


> > +	if (ecap_mts(iommu->ecap)) {
> > +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE)
> > {
> > +			pasid_set_emte(pte);
> > +			pasid_set_emt(pte, pasid_data->emt);
> > +		}
> > +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> > +			pasid_set_pcd(pte);
> > +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> > +			pasid_set_pwt(pte);
> > +		if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> > +			pasid_set_cd(pte);
> > +		pasid_set_pat(pte, pasid_data->pat);
> > +	} else if (pasid_data->flags &
> > IOMMU_SVA_VTD_GPASID_MTS_MASK) {
> > +		pr_err("No memory type support for bind guest
> > PASID on %s\n",
> > +			iommu->name);
> > +		return -EINVAL;
> > +	}
> > +
> > +	return 0;
> > +
> > +}
> > +
> > +/**
> > + * intel_pasid_setup_nested() - Set up PASID entry for nested
> > translation.
> > + * This could be used for guest shared virtual address. In this
> > case, the
> > + * first level page tables are used for GVA-GPA translation in the
> > guest,
> > + * second level page tables are used for GPA-HPA translation.
> > + *
> > + * @iommu:      IOMMU which the device belong to
> > + * @dev:        Device to be set up for translation
> > + * @gpgd:       FLPTPTR: First Level Page translation pointer in
> > GPA
> > + * @pasid:      PASID to be programmed in the device PASID table
> > + * @pasid_data: Additional PASID info from the guest bind request
> > + * @domain:     Domain info for setting up second level page tables
> > + * @addr_width: Address width of the first level (guest)
> > + */
> > +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > +			struct device *dev, pgd_t *gpgd,
> > +			int pasid, struct
> > iommu_gpasid_bind_data_vtd *pasid_data,
> > +			struct dmar_domain *domain,
> > +			int addr_width)
> > +{
> > +	struct pasid_entry *pte;
> > +	struct dma_pte *pgd;
> > +	int ret = 0;
> > +	u64 pgd_val;
> > +	int agaw;
> > +	u16 did;
> > +
> > +	if (!ecap_nest(iommu->ecap)) {
> > +		pr_err("IOMMU: %s: No nested translation
> > support\n",
> > +		       iommu->name);
> > +		return -EINVAL;
> > +	}  
> I am surprised you don't check that the dmar_domain has the
> DOMAIN_FLAG_NESTED_MODE flag (ir I missed it). Don't you have any risk
> that the userspace overwrites the PTE of a device attached to an usual
> domain, ie. fulled handled by the host?
Good point, this domain attr is recently added to vt-d. So I will add
the check and move the flags to intel-iommu.h
	if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
		pr_err("Domain is not in nesting mode, %x\n", domain->flags);
		return -EINVAL;
	}



> > +
> > +	pte = intel_pasid_get_entry(dev, pasid);
> > +	if (WARN_ON(!pte))
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Caller must ensure PASID entry is not in use, i.e. not
> > bind the
> > +	 * same PASID to the same device twice.
> > +	 */
> > +	if (pasid_pte_is_present(pte))
> > +		return -EBUSY;  
> Here you check the PTE is not valid, is it sufficient to guarantee the
> above?
The caller has already checked the condition above. Here is just
additional sanity checking.

> Also refering to the race potential issue pointed out by Kevin.
Will add spin_lock(&iommu->lock) around nested setup.


> > +
> > +	pasid_clear_entry(pte);
> > +
> > +	/* Sanity checking performed by caller to make sure address
> > +	 * width matching in two dimensions:
> > +	 * 1. CPU vs. IOMMU
> > +	 * 2. Guest vs. Host.
> > +	 */
> > +	switch (addr_width) {
> > +	case ADDR_WIDTH_5LEVEL:
> > +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> > +			cap_5lp_support(iommu->cap)) {
> > +			pasid_set_flpm(pte, 1);
> > +		} else {
> > +			dev_err(dev, "5-level paging not
> > supported\n");
> > +			return -EINVAL;
> > +		}
> > +		break;
> > +	case ADDR_WIDTH_4LEVEL:
> > +		pasid_set_flpm(pte, 0);
> > +		break;
> > +	default:
> > +		dev_err(dev, "Invalid guest address width %d\n",
> > addr_width);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* First level PGD is in GPA, must be supported by the
> > second level */
> > +	if ((u64)gpgd > domain->max_addr) {
> > +		dev_err(dev, "Guest PGD %llx not supported, max
> > %llx\n",
> > +			(u64)gpgd, domain->max_addr);
> > +		return -EINVAL;
> > +	}
> > +	pasid_set_flptr(pte, (u64)gpgd);
> > +
> > +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> > +	if (ret) {
> > +		dev_err(dev, "Guest PASID bind data not
> > supported\n");  
> Shall we output all those traces without limit? They are triggered by
> userspace, meaning this latter can trigger a storm of those.
Good point, will change all guest invoked calls to _ratelimited version.

Thanks!

> > +		return ret;
> > +	}
> > +
> > +	/* Setup the second level based on the given domain */
> > +	pgd = domain->pgd;
> > +
> > +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> > +	if (agaw < 0) {
> > +		dev_err(dev, "Invalid domain page table\n");
> > +		return -EINVAL;
> > +	}
> > +	pgd_val = virt_to_phys(pgd);
> > +	pasid_set_slptr(pte, pgd_val);
> > +	pasid_set_fault_enable(pte);
> > +
> > +	did = domain->iommu_did[iommu->seq_id];
> > +	pasid_set_domain_id(pte, did);
> > +
> > +	pasid_set_address_width(pte, agaw);
> > +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > +
> > +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> > +	pasid_set_present(pte);
> > +	pasid_flush_caches(iommu, pte, pasid, did);
> > +
> > +	return ret;
> > +}
> > diff --git a/drivers/iommu/intel-pasid.h
> > b/drivers/iommu/intel-pasid.h index 92de6df24ccb..698015ee3f04
> > 100644 --- a/drivers/iommu/intel-pasid.h
> > +++ b/drivers/iommu/intel-pasid.h
> > @@ -36,6 +36,7 @@
> >   * to vmalloc or even module mappings.
> >   */
> >  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> > +#define PASID_FLAG_NESTED		BIT(1)
> >  
> >  /*
> >   * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for
> > first- @@ -51,6 +52,11 @@ struct pasid_entry {
> >  	u64 val[8];
> >  };
> >  
> > +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> > +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> > +#define PASID_ENTRY_PGTT_NESTED		(3)
> > +#define PASID_ENTRY_PGTT_PT		(4)
> > +
> >  /* The representative of a PASID table */
> >  struct pasid_table {
> >  	void			*table;		/*
> > pasid table pointer */ @@ -99,6 +105,12 @@ int
> > intel_pasid_setup_second_level(struct intel_iommu *iommu, int
> > intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct
> > dmar_domain *domain, struct device *dev, int pasid);
> > +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > +			struct device *dev, pgd_t *pgd,
> > +			int pasid,
> > +			struct iommu_gpasid_bind_data_vtd
> > *pasid_data,
> > +			struct dmar_domain *domain,
> > +			int addr_width);
> >  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> >  				 struct device *dev, int pasid);
> >  
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index ed7171d2ae1f..eda1d6687144
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -42,6 +42,9 @@
> >  #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
> >  #define DMA_FL_PTE_XD		BIT_ULL(63)
> >  
> > +#define ADDR_WIDTH_5LEVEL	(57)
> > +#define ADDR_WIDTH_4LEVEL	(48)
> > +
> >  #define CONTEXT_TT_MULTI_LEVEL	0
> >  #define CONTEXT_TT_DEV_IOTLB	1
> >  #define CONTEXT_TT_PASS_THROUGH 2
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID
  2020-04-01 15:47     ` Jacob Pan
@ 2020-04-02  2:18       ` Tian, Kevin
  2020-04-02 20:28         ` Jacob Pan
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2020-04-02  2:18 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Wednesday, April 1, 2020 11:48 PM
> 
> On Sat, 28 Mar 2020 10:22:41 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When VT-d driver runs in the guest, PASID allocation must be
> > > performed via virtual command interface. This patch registers a
> > > custom IOASID allocator which takes precedence over the default
> > > XArray based allocator. The resulting IOASID allocation will always
> > > come from the host. This ensures that PASID namespace is system-
> > > wide.
> > >
> > > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/iommu/intel-iommu.c | 84
> > > +++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/intel-iommu.h |  2 ++
> > >  2 files changed, 86 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct
> intel_iommu
> > > *iommu)
> > >  		if (ecap_prs(iommu->ecap))
> > >  			intel_svm_finish_prq(iommu);
> > >  	}
> > > +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > > +
> > > ioasid_unregister_allocator(&iommu->pasid_allocator); +
> > >  #endif
> > >  }
> > >
> > > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > > intel_iommu *iommu)
> > >  	return ret;
> > >  }
> > >
> > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > > void *data)
> >
> > the name is too generic... can we add vcmd in the name to clarify
> > its purpose, e.g. intel_vcmd_ioasid_alloc?
> >
> I feel the intel_ prefix is a natural extension of a generic API, we do
> that for other IOMMU APIs, right?

other IOMMU APIs have no difference between host and guest, but
this one only applies to guest with vcmd interface. 

> 
> > > +{
> > > +	struct intel_iommu *iommu = data;
> > > +	ioasid_t ioasid;
> > > +
> > > +	if (!iommu)
> > > +		return INVALID_IOASID;
> > > +	/*
> > > +	 * VT-d virtual command interface always uses the full 20
> > > bit
> > > +	 * PASID range. Host can partition guest PASID range based
> > > on
> > > +	 * policies but it is out of guest's control.
> > > +	 */
> > > +	if (min < PASID_MIN || max > intel_pasid_max_id)
> > > +		return INVALID_IOASID;
> > > +
> > > +	if (vcmd_alloc_pasid(iommu, &ioasid))
> > > +		return INVALID_IOASID;
> > > +
> > > +	return ioasid;
> > > +}
> > > +
> > > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > > +{
> > > +	struct intel_iommu *iommu = data;
> > > +
> > > +	if (!iommu)
> > > +		return;
> > > +	/*
> > > +	 * Sanity check the ioasid owner is done at upper layer,
> > > e.g. VFIO
> > > +	 * We can only free the PASID when all the devices are
> > > unbound.
> > > +	 */
> > > +	if (ioasid_find(NULL, ioasid, NULL)) {
> > > +		pr_alert("Cannot free active IOASID %d\n", ioasid);
> > > +		return;
> > > +	}
> >
> > However the sanity check is not done in default_free. Is there a
> > reason why using vcmd adds such  new requirement?
> >
> Since we don't support nested guest. This vcmd allocator is only used
> by the guest IOMMU driver not VFIO. We expect IOMMU driver to have
> control of the free()/unbind() ordering.
> 
> For default_free, it can come from user space and host VFIO which can
> be out of order. But we will solve that issue with the blocking
> notifier.
> 
> > > +	vcmd_free_pasid(iommu, ioasid);
> > > +}
> > > +
> > > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > > +{
> > > +	/*
> > > +	 * If we are running in the host, no need for custom
> > > allocator
> > > +	 * in that PASIDs are allocated from the host system-wide.
> > > +	 */
> > > +	if (!cap_caching_mode(iommu->cap))
> > > +		return;
> >
> > is it more accurate to check against vcmd capability?
> >
> I think this is sufficient. The spec says if vcmd is present, we must
> use it but not the other way.

No, what about an vIOMMU implementation reports CM but not
VCMD? I didn't get the rationale why we check an indirect capability
when there is already one well defined for the purpose.

> 
> > > +
> > > +	if (!sm_supported(iommu)) {
> > > +		pr_warn("VT-d Scalable Mode not enabled, no PASID
> > > allocation\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Register a custom PASID allocator if we are running in
> > > a guest,
> > > +	 * guest PASID must be obtained via virtual command
> > > interface.
> > > +	 * There can be multiple vIOMMUs in each guest but only one
> > > allocator
> > > +	 * is active. All vIOMMU allocators will eventually be
> > > calling the same
> >
> > which one? the first or last?
> >
> All allocators share the same ops, so first=last. IOASID code will
> inspect the ops function and see if they are shared with others then
> use the same ops.

ok, got you.

> 
> > > +	 * host allocator.
> > > +	 */
> > > +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> > > +		pr_info("Register custom PASID allocator\n");
> > > +		iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> > > +		iommu->pasid_allocator.free = intel_ioasid_free;
> > > +		iommu->pasid_allocator.pdata = (void *)iommu;
> > > +		if
> > > (ioasid_register_allocator(&iommu->pasid_allocator)) {
> > > +			pr_warn("Custom PASID allocator failed,
> > > scalable mode disabled\n");
> > > +			/*
> > > +			 * Disable scalable mode on this IOMMU if
> > > there
> > > +			 * is no custom allocator. Mixing SM
> > > capable vIOMMU
> > > +			 * and non-SM vIOMMU are not supported.
> > > +			 */
> > > +			intel_iommu_sm = 0;
> >
> > since you register an allocator for every vIOMMU, means previously
> > registered allocators should also be unregistered here?
> >
> True, but it is not necessary for two reasons:
> 1. This should not happen unless something went seriously wrong.
> All vIOMMU shares the same alloc/free function, so they are put under
> the same bucket by IOASID. So the case for the first vIOMMU to succeed
> then fail in later vIOMMU registration should not happen. Unless kernel
> run out of memory etc.
> 
> 2. Once SM is disabled, there is no user of ioasid allocator.
> 
> > > +		}
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >  static int __init init_dmars(void)
> > >  {
> > >  	struct dmar_drhd_unit *drhd;
> > > @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
> > >  	 */
> > >  	for_each_active_iommu(iommu, drhd) {
> > >  		iommu_flush_write_buffer(iommu);
> > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > +		register_pasid_allocator(iommu);
> > > +#endif
> > >  		iommu_set_root_entry(iommu);
> > >  		iommu->flush.flush_context(iommu, 0, 0, 0,
> > > DMA_CCMD_GLOBAL_INVL);
> > >  		iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> > > DMA_TLB_GLOBAL_FLUSH);
> > > diff --git a/include/linux/intel-iommu.h
> > > b/include/linux/intel-iommu.h index 9cbf5357138b..9c357a325c72
> > > 100644 --- a/include/linux/intel-iommu.h
> > > +++ b/include/linux/intel-iommu.h
> > > @@ -19,6 +19,7 @@
> > >  #include <linux/iommu.h>
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > >  #include <linux/dmar.h>
> > > +#include <linux/ioasid.h>
> > >
> > >  #include <asm/cacheflush.h>
> > >  #include <asm/iommu.h>
> > > @@ -563,6 +564,7 @@ struct intel_iommu {
> > >  #ifdef CONFIG_INTEL_IOMMU_SVM
> > >  	struct page_req_dsc *prq;
> > >  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> > > +	struct ioasid_allocator_ops pasid_allocator; /* Custom
> > > allocator for PASIDs */
> > >  #endif
> > >  	struct q_inval  *qi;            /* Queued invalidation
> > > info */ u32 *iommu_state; /* Store iommu states between suspend and
> > > resume.*/
> > > --
> > > 2.7.4
> >
> 
> [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function
  2020-04-01  7:32               ` Auger Eric
  2020-04-01 16:05                 ` Jacob Pan
@ 2020-04-02 15:54                 ` Jacob Pan
  1 sibling, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-02 15:54 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jean-Philippe Brucker,
	LKML, iommu, David Woodhouse, Jonathan Cameron

On Wed, 1 Apr 2020 09:32:37 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> > I didn’t read through all comments. Here is a concern with this 2-D
> > table, the iommu cache type is defined as below. I suppose there is
> > a problem here. If I'm using IOMMU_CACHE_INV_TYPE_PASID, it will
> > beyond the 2-D table.
> > 
> > /* IOMMU paging structure cache */
> > #define IOMMU_CACHE_INV_TYPE_IOTLB      (1 << 0) /* IOMMU IOTLB */
> > #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device IOTLB */
> > #define IOMMU_CACHE_INV_TYPE_PASID      (1 << 2) /* PASID cache */
> > #define IOMMU_CACHE_INV_TYPE_NR         (3)  
> oups indeed

I think it is not an issue, since we use bit position not the raw cache
type as index into the 2D array. Right?

for_each_set_bit(cache_type, 

 ret = to_vtd_granularity(cache_type, inv_info->granularity, &

static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
{

	*vtd_granu = inv_type_granu_table[type][granu];
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID
  2020-04-02  2:18       ` Tian, Kevin
@ 2020-04-02 20:28         ` Jacob Pan
  0 siblings, 0 replies; 67+ messages in thread
From: Jacob Pan @ 2020-04-02 20:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jean-Philippe Brucker, iommu, LKML, Alex Williamson,
	David Woodhouse, Jonathan Cameron

On Thu, 2 Apr 2020 02:18:45 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Wednesday, April 1, 2020 11:48 PM
> > 
> > On Sat, 28 Mar 2020 10:22:41 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Sent: Saturday, March 21, 2020 7:28 AM
> > > >
> > > > When VT-d driver runs in the guest, PASID allocation must be
> > > > performed via virtual command interface. This patch registers a
> > > > custom IOASID allocator which takes precedence over the default
> > > > XArray based allocator. The resulting IOASID allocation will
> > > > always come from the host. This ensures that PASID namespace is
> > > > system- wide.
> > > >
> > > > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/iommu/intel-iommu.c | 84
> > > > +++++++++++++++++++++++++++++++++++++++++++++
> > > >  include/linux/intel-iommu.h |  2 ++
> > > >  2 files changed, 86 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > +++ b/drivers/iommu/intel-iommu.c
> > > > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct  
> > intel_iommu  
> > > > *iommu)
> > > >  		if (ecap_prs(iommu->ecap))
> > > >  			intel_svm_finish_prq(iommu);
> > > >  	}
> > > > +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > > > +
> > > > ioasid_unregister_allocator(&iommu->pasid_allocator); +
> > > >  #endif
> > > >  }
> > > >
> > > > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > > > intel_iommu *iommu)
> > > >  	return ret;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > > > void *data)  
> > >
> > > the name is too generic... can we add vcmd in the name to clarify
> > > its purpose, e.g. intel_vcmd_ioasid_alloc?
> > >  
> > I feel the intel_ prefix is a natural extension of a generic API,
> > we do that for other IOMMU APIs, right?  
> 
> other IOMMU APIs have no difference between host and guest, but
> this one only applies to guest with vcmd interface. 
> 
OK, sounds good. It is more explicit, improves readability.

> >   
> > > > +{
> > > > +	struct intel_iommu *iommu = data;
> > > > +	ioasid_t ioasid;
> > > > +
> > > > +	if (!iommu)
> > > > +		return INVALID_IOASID;
> > > > +	/*
> > > > +	 * VT-d virtual command interface always uses the full
> > > > 20 bit
> > > > +	 * PASID range. Host can partition guest PASID range
> > > > based on
> > > > +	 * policies but it is out of guest's control.
> > > > +	 */
> > > > +	if (min < PASID_MIN || max > intel_pasid_max_id)
> > > > +		return INVALID_IOASID;
> > > > +
> > > > +	if (vcmd_alloc_pasid(iommu, &ioasid))
> > > > +		return INVALID_IOASID;
> > > > +
> > > > +	return ioasid;
> > > > +}
> > > > +
> > > > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > > > +{
> > > > +	struct intel_iommu *iommu = data;
> > > > +
> > > > +	if (!iommu)
> > > > +		return;
> > > > +	/*
> > > > +	 * Sanity check the ioasid owner is done at upper
> > > > layer, e.g. VFIO
> > > > +	 * We can only free the PASID when all the devices are
> > > > unbound.
> > > > +	 */
> > > > +	if (ioasid_find(NULL, ioasid, NULL)) {
> > > > +		pr_alert("Cannot free active IOASID %d\n",
> > > > ioasid);
> > > > +		return;
> > > > +	}  
> > >
> > > However the sanity check is not done in default_free. Is there a
> > > reason why using vcmd adds such  new requirement?
> > >  
> > Since we don't support nested guest. This vcmd allocator is only
> > used by the guest IOMMU driver not VFIO. We expect IOMMU driver to
> > have control of the free()/unbind() ordering.
> > 
> > For default_free, it can come from user space and host VFIO which
> > can be out of order. But we will solve that issue with the blocking
> > notifier.
> >   
> > > > +	vcmd_free_pasid(iommu, ioasid);
> > > > +}
> > > > +
> > > > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > > > +{
> > > > +	/*
> > > > +	 * If we are running in the host, no need for custom
> > > > allocator
> > > > +	 * in that PASIDs are allocated from the host
> > > > system-wide.
> > > > +	 */
> > > > +	if (!cap_caching_mode(iommu->cap))
> > > > +		return;  
> > >
> > > is it more accurate to check against vcmd capability?
> > >  
> > I think this is sufficient. The spec says if vcmd is present, we
> > must use it but not the other way.  
> 
> No, what about an vIOMMU implementation reports CM but not
> VCMD?
> I didn't get the rationale why we check an indirect capability
> when there is already one well defined for the purpose.
> 
We _do_ check ecap_vcs() later on. Just an ordering thing, my thinking
was a quick check if we are running in a host.
...
	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {

> >   
>  [...]  
> > >
> > > which one? the first or last?
> > >  
> > All allocators share the same ops, so first=last. IOASID code will
> > inspect the ops function and see if they are shared with others then
> > use the same ops.  
> 
> ok, got you.
> 
> >   
>  [...]  
> > >
> > > since you register an allocator for every vIOMMU, means previously
> > > registered allocators should also be unregistered here?
> > >  
>  [...]  
> > > > +		}
> > > > +	}
> > > > +}
> > > > +#endif
> > > > +
> > > >  static int __init init_dmars(void)
> > > >  {
> > > >  	struct dmar_drhd_unit *drhd;
> > > > @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
> > > >  	 */
> > > >  	for_each_active_iommu(iommu, drhd) {
> > > >  		iommu_flush_write_buffer(iommu);
> > > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > > +		register_pasid_allocator(iommu);
> > > > +#endif
> > > >  		iommu_set_root_entry(iommu);
> > > >  		iommu->flush.flush_context(iommu, 0, 0, 0,
> > > > DMA_CCMD_GLOBAL_INVL);
> > > >  		iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> > > > DMA_TLB_GLOBAL_FLUSH);
> > > > diff --git a/include/linux/intel-iommu.h
> > > > b/include/linux/intel-iommu.h index 9cbf5357138b..9c357a325c72
> > > > 100644 --- a/include/linux/intel-iommu.h
> > > > +++ b/include/linux/intel-iommu.h
> > > > @@ -19,6 +19,7 @@
> > > >  #include <linux/iommu.h>
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/dmar.h>
> > > > +#include <linux/ioasid.h>
> > > >
> > > >  #include <asm/cacheflush.h>
> > > >  #include <asm/iommu.h>
> > > > @@ -563,6 +564,7 @@ struct intel_iommu {
> > > >  #ifdef CONFIG_INTEL_IOMMU_SVM
> > > >  	struct page_req_dsc *prq;
> > > >  	unsigned char prq_name[16];    /* Name for PRQ
> > > > interrupt */
> > > > +	struct ioasid_allocator_ops pasid_allocator; /* Custom
> > > > allocator for PASIDs */
> > > >  #endif
> > > >  	struct q_inval  *qi;            /* Queued invalidation
> > > > info */ u32 *iommu_state; /* Store iommu states between suspend
> > > > and resume.*/
> > > > --
> > > > 2.7.4  
> > >  
> > 
> > [Jacob Pan]  

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2020-04-02 20:22 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-20 23:27 [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
2020-03-20 23:27 ` [PATCH V10 01/11] iommu/vt-d: Move domain helper to header Jacob Pan
2020-03-27 11:48   ` Tian, Kevin
2020-03-20 23:27 ` [PATCH V10 02/11] iommu/uapi: Define a mask for bind data Jacob Pan
2020-03-22  1:29   ` Lu Baolu
2020-03-23 19:37     ` Jacob Pan
2020-03-24  1:50       ` Lu Baolu
2020-03-27 11:50   ` Tian, Kevin
2020-03-27 14:13   ` Auger Eric
2020-03-20 23:27 ` [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw Jacob Pan
2020-03-27 11:53   ` Tian, Kevin
2020-03-29  7:20     ` Lu Baolu
2020-03-30 17:50       ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL Jacob Pan
2020-03-27 11:55   ` Tian, Kevin
2020-03-27 16:05     ` Auger Eric
2020-03-29  7:35       ` Lu Baolu
2020-03-20 23:27 ` [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function Jacob Pan
2020-03-27 12:21   ` Tian, Kevin
2020-03-29  8:03     ` Lu Baolu
2020-03-30 18:21       ` Jacob Pan
2020-03-31  3:36         ` Tian, Kevin
2020-03-29 11:35   ` Auger Eric
2020-04-01 20:06     ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support Jacob Pan
2020-03-28  8:02   ` Tian, Kevin
2020-03-30 20:51     ` Jacob Pan
2020-03-31  3:43       ` Tian, Kevin
2020-04-01 17:13         ` Jacob Pan
2020-03-29 13:40   ` Auger Eric
2020-03-30 22:53     ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types Jacob Pan
2020-03-27 14:46   ` Auger Eric
2020-03-30 23:28     ` Jacob Pan
2020-03-31 16:13       ` Jacob Pan
2020-03-31 16:15         ` Auger Eric
2020-03-20 23:27 ` [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
2020-03-28 10:01   ` Tian, Kevin
2020-03-29 15:34     ` Auger Eric
2020-03-31  2:49       ` Tian, Kevin
2020-03-31 20:58         ` Jacob Pan
2020-04-01  6:29           ` Tian, Kevin
2020-04-01  7:13             ` Liu, Yi L
2020-04-01  7:32               ` Auger Eric
2020-04-01 16:05                 ` Jacob Pan
2020-04-02 15:54                 ` Jacob Pan
2020-03-29 16:05     ` Auger Eric
2020-03-31  3:34       ` Tian, Kevin
2020-03-31 21:07         ` Jacob Pan
2020-04-01  6:32           ` Tian, Kevin
2020-03-31 18:13     ` Jacob Pan
2020-04-01  6:24       ` Tian, Kevin
2020-04-01  6:57         ` Liu, Yi L
2020-04-01 16:03           ` Jacob Pan
2020-03-29 16:05   ` Auger Eric
2020-03-31 22:28     ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register Jacob Pan
2020-03-28 10:04   ` Tian, Kevin
2020-03-31 22:33     ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation Jacob Pan
2020-03-28 10:08   ` Tian, Kevin
2020-03-31 22:37     ` Jacob Pan
2020-03-20 23:27 ` [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
2020-03-28 10:22   ` Tian, Kevin
2020-04-01 15:47     ` Jacob Pan
2020-04-02  2:18       ` Tian, Kevin
2020-04-02 20:28         ` Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).