linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support
@ 2020-04-03 18:42 Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 01/10] iommu/vt-d: Move domain helper to header Jacob Pan
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to enable SVA virtualization, i.e. enable use of SVA
within a guest user application.

This is the remaining portion of the original patchset that is based on
Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)

Only IOMMU portion of the changes are included in this series. Additional
support is needed in VFIO and QEMU (will be submitted separately) to complete
this functionality.

To make incremental changes and reduce the size of each patchset. This series
does not inlcude support for page request services.

In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

This is the remaining VT-d only portion of V5 since the uAPIs and IOASID common
code have been applied to Joerg's IOMMU core branch.
(https://lkml.org/lkml/2019/10/2/833)

The complete set with VFIO patches are here:
https://github.com/jacobpan/linux.git:siov_sva

The complete nested SVA upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. Page Request Services (PRS) support
    3. Mediated device assignment

With this set and the accompanied VFIO code, we will achieve phase #1.

Thanks,

Jacob

ChangeLog:
	- v11 Misc fixes and improvements based on review by Kevin & Eric
	  - Fixed devTLB granularity conversion
	  - Simplified VT-d granulairy lookup by replacing 2D map array
	    with invalid entries.
	  - Fixed locking in bind guest PASID
	  - Added nesting domain attr check
	  - Squashed agaw checking patch with user
	  - Use rate limitted error message for all user originated calls
 
	- v10
	  - Addressed Eric's review in v7 and v9. Most fixes are in 3/10 and
	    6/10. Extra condition checks and consolidation of duplicated codes.

	- v9
	  - Addressed Baolu's comments for v8 for IOTLB flush consolidation,
	    bug fixes
	  - Removed IOASID notifier code which will be submitted separately
	    to address PASID life cycle management with multiple users.

	- v8
	  - Extracted cleanup patches from V7 and accepted into maintainer's
	    tree (https://lkml.org/lkml/2019/12/2/514).
	  - Added IOASID notifier and VT-d handler for termination of PASID
	    IOMMU context upon free. This will ensure success of VFIO IOASID
	    free API regardless PASID is in use.
	    (https://lore.kernel.org/linux-iommu/1571919983-3231-1-git-send-email-yi.l.liu@intel.com/)

	- V7
	  - Respect vIOMMU PASID range in virtual command PASID/IOASID allocator
	  - Caching virtual command capabilities to avoid runtime checks that
	    could cause vmexits.

	- V6
	  - Rebased on top of Joerg's core branch
	  (git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
	  - Adapt to new uAPIs and IOASID allocators

	- V5
	  Rebased on v5.3-rc4 which has some of the IOMMU fault APIs merged.
 	  Addressed v4 review comments from Eric Auger, Baolu Lu, and
	    Jonathan Cameron. Specific changes are as follows:
	  - Refined custom IOASID allocator to support multiple vIOMMU, hotplug
	    cases.
	  - Extracted vendor data from IOMMU guest PASID bind data, for VT-d
	    will support all necessary guest PASID entry fields for PASID
	    bind.
	  - Support non-identity host-guest PASID mapping
	  - Exception handling in various cases

	- V4
	  - Redesigned IOASID allocator such that it can support custom
	  allocators with shared helper functions. Use separate XArray
	  to store IOASIDs per allocator. Took advice from Eric Auger to
	  have default allocator use the generic allocator structure.
	  Combined into one patch in that the default allocator is just
	  "another" allocator now. Can be built as a module in case of
	  driver use without IOMMU.
	  - Extended bind guest PASID data to support SMMU and non-identity
	  guest to host PASID mapping https://lkml.org/lkml/2019/5/21/802
	  - Rebased on Jean's sva/api common tree, new patches starts with
	   [PATCH v4 10/22]

	- V3
	  - Addressed thorough review comments from Eric Auger (Thank you!)
	  - Moved IOASID allocator from driver core to IOMMU code per
	    suggestion by Christoph Hellwig
	    (https://lkml.org/lkml/2019/4/26/462)
	  - Rebased on top of Jean's SVA API branch and Eric's v7[1]
	    (git://linux-arm.org/linux-jpb.git sva/api)
	  - All IOMMU APIs are unmodified (except the new bind guest PASID
	    call in patch 9/16)

	- V2
	  - Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
	  - Integrated with Eric Auger's new v7 series for common APIs
	  (https://github.com/eauger/linux/tree/v5.1-rc3-2stage-v7)
	  - Addressed review comments from Andy Shevchenko and Alex Williamson on
	    IOASID custom allocator.
	  - Support multiple custom IOASID allocators (vIOMMUs) and dynamic
	    registration.


Jacob Pan (9):
  iommu/vt-d: Move domain helper to header
  iommu/uapi: Define a mask for bind data
  iommu/vt-d: Use a helper function to skip agaw for SL
  iommu/vt-d: Add nested translation helper function
  iommu/vt-d: Add bind guest PASID support
  iommu/vt-d: Support flushing more translation cache types
  iommu/vt-d: Add svm/sva invalidate function
  iommu/vt-d: Cache virtual command capability register
  iommu/vt-d: Add custom allocator for IOASID

Lu Baolu (1):
  iommu/vt-d: Enlightened PASID allocation

 drivers/iommu/dmar.c        |  37 +++++
 drivers/iommu/intel-iommu.c | 277 ++++++++++++++++++++++++++++++++----
 drivers/iommu/intel-pasid.c | 340 ++++++++++++++++++++++++++++++++++++++++++--
 drivers/iommu/intel-pasid.h |  25 +++-
 drivers/iommu/intel-svm.c   | 206 +++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  70 ++++++++-
 include/linux/intel-svm.h   |  17 +++
 include/uapi/linux/iommu.h  |   5 +-
 8 files changed, 925 insertions(+), 52 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 01/10] iommu/vt-d: Move domain helper to header
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 02/10] iommu/uapi: Define a mask for bind data Jacob Pan
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

Move domain helper to header to be used by SVA code.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/iommu/intel-iommu.c | 6 ------
 include/linux/intel-iommu.h | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4be549478691..e599b2537b1c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -446,12 +446,6 @@ static void init_translation_status(struct intel_iommu *iommu)
 		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
 }
 
-/* Convert generic 'struct iommu_domain to private struct dmar_domain */
-static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
-{
-	return container_of(dom, struct dmar_domain, domain);
-}
-
 static int __init intel_iommu_setup(char *str)
 {
 	if (!str)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 980234ae0312..ed7171d2ae1f 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -595,6 +595,12 @@ static inline void __iommu_flush_cache(
 		clflush_cache_range(addr, size);
 }
 
+/* Convert generic struct iommu_domain to private struct dmar_domain */
+static inline struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
+{
+	return container_of(dom, struct dmar_domain, domain);
+}
+
 /*
  * 0: readable
  * 1: writable
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 02/10] iommu/uapi: Define a mask for bind data
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 01/10] iommu/vt-d: Move domain helper to header Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-08 13:07   ` Joerg Roedel
  2020-04-03 18:42 ` [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL Jacob Pan
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

Memory type related flags can be grouped together for one simple check.

---
v9 renamed from EMT to MTS since these are memory type support flags.
---

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/uapi/linux/iommu.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4ad3496e5c43..d7bcbc5f79b0 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
 	__u32 pat;
 	__u32 emt;
 };
-
+#define IOMMU_SVA_VTD_GPASID_MTS_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
 /**
  * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
  * @version:	Version of this data structure
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 01/10] iommu/vt-d: Move domain helper to header Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 02/10] iommu/uapi: Define a mask for bind data Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-05  6:37   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function Jacob Pan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

An Intel iommu domain uses 5-level page table by default. If the
iommu that the domain tries to attach supports less page levels,
the top level page tables should be skipped. Add a helper to do
this so that it could be used in other places.

---
v11 Added Baolu's commit message and squashed 4 & 5 from v10.
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 22b30f10b396..66c364719ad1 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -500,6 +500,26 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 }
 
 /*
+ * Skip top levels of page tables for iommu which has less agaw
+ * than default. Unnecessary for PT mode.
+ */
+static inline int iommu_skip_agaw(struct dmar_domain *domain,
+				  struct intel_iommu *iommu,
+				  struct dma_pte **pgd)
+{
+	int agaw;
+
+	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
+		*pgd = phys_to_virt(dma_pte_addr(*pgd));
+		if (!dma_pte_present(*pgd)) {
+			return -EINVAL;
+		}
+	}
+
+	return agaw;
+}
+
+/*
  * Set up the scalable mode pasid entry for second only translation type.
  */
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
@@ -522,17 +542,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 		return -EINVAL;
 	}
 
-	/*
-	 * Skip top levels of page tables for iommu which has less agaw
-	 * than default. Unnecessary for PT mode.
-	 */
 	pgd = domain->pgd;
-	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
-		pgd = phys_to_virt(dma_pte_addr(pgd));
-		if (!dma_pte_present(pgd)) {
-			dev_err(dev, "Invalid domain page table\n");
-			return -EINVAL;
-		}
+	agaw = iommu_skip_agaw(domain, iommu, &pgd);
+	if (agaw < 0) {
+		dev_err(dev, "Invalid domain page table\n");
+		return -EINVAL;
 	}
 
 	pgd_val = virt_to_phys(pgd);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (2 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-08 13:33   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support Jacob Pan
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan, Liu

Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
With PASID granular translation type set to 0x11b, translation
result from the first level(FL) also subject to a second level(SL)
page table translation. This mode is used for SVA virtualization,
where FL performs guest virtual to guest physical translation and
SL performs guest physical to host physical translation.

This patch adds a helper function for setting up nested translation
where second level comes from a domain and first level comes from
a guest PGD.

---
v11 Added check for nesting domain attr. Moved flags to header file.
    Improved flow in MTS handling suggested by Eric.
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/intel-iommu.c |  25 -----
 drivers/iommu/intel-pasid.c | 246 +++++++++++++++++++++++++++++++++++++++++++-
 drivers/iommu/intel-pasid.h |  12 +++
 include/linux/intel-iommu.h |  28 +++++
 4 files changed, 283 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e599b2537b1c..c0dadec5a6b3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -296,31 +296,6 @@ static inline void context_clear_entry(struct context_entry *context)
 static struct dmar_domain *si_domain;
 static int hw_pass_through = 1;
 
-/* si_domain contains mulitple devices */
-#define DOMAIN_FLAG_STATIC_IDENTITY		BIT(0)
-
-/*
- * This is a DMA domain allocated through the iommu domain allocation
- * interface. But one or more devices belonging to this domain have
- * been chosen to use a private domain. We should avoid to use the
- * map/unmap/iova_to_phys APIs on it.
- */
-#define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
-
-/*
- * When VT-d works in the scalable mode, it allows DMA translation to
- * happen through either first level or second level page table. This
- * bit marks that the DMA translation for the domain goes through the
- * first level page table, otherwise, it goes through the second level.
- */
-#define DOMAIN_FLAG_USE_FIRST_LEVEL		BIT(2)
-
-/*
- * Domain represents a virtual machine which demands iommu nested
- * translation mode support.
- */
-#define DOMAIN_FLAG_NESTING_MODE		BIT(3)
-
 #define for_each_domain_iommu(idx, domain)			\
 	for (idx = 0; idx < g_num_of_iommus; idx++)		\
 		if (domain->iommu_refcnt[idx])
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 66c364719ad1..fcd015644c4f 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
 	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
 }
 
+/*
+ * Setup the Extended Memory Type(EMT) field (Bits 91-93)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emt(struct pasid_entry *pe, u64 value)
+{
+	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
+}
+
+/*
+ * Setup the Page Attribute Table (PAT) field (Bits 96-127)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pat(struct pasid_entry *pe, u64 value)
+{
+	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
+}
+
+/*
+ * Setup the Cache Disable (CD) field (Bit 89)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_cd(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
+}
+
+/*
+ * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emte(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
+}
+
+/*
+ * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_eafe(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
+}
+
+/*
+ * Setup the Page-level Cache Disable (PCD) field (Bit 95)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pcd(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
+}
+
+/*
+ * Setup the Page-level Write-Through (PWT)) field (Bit 94)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pwt(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
+}
+
 static void
 pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
 				    u16 did, int pasid)
@@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
 	/* Setup Present and PASID Granular Transfer Type: */
-	pasid_set_translation_type(pte, 1);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
 	pasid_set_present(pte);
 	pasid_flush_caches(iommu, pte, pasid, did);
 
@@ -562,7 +632,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	pasid_set_domain_id(pte, did);
 	pasid_set_slptr(pte, pgd_val);
 	pasid_set_address_width(pte, agaw);
-	pasid_set_translation_type(pte, 2);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
@@ -596,7 +666,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_address_width(pte, iommu->agaw);
-	pasid_set_translation_type(pte, 4);
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
@@ -610,3 +680,173 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 
 	return 0;
 }
+
+static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
+				struct pasid_entry *pte,
+				struct iommu_gpasid_bind_data_vtd *pasid_data)
+{
+	/*
+	 * Not all guest PASID table entry fields are passed down during bind,
+	 * here we only set up the ones that are dependent on guest settings.
+	 * Execution related bits such as NXE, SMEP are not meaningful to IOMMU,
+	 * therefore not set. Other fields, such as snoop related, are set based
+	 * on host needs regardless of guest settings.
+	 */
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
+		if (!ecap_srs(iommu->ecap)) {
+			pr_err("No supervisor request support on %s\n",
+			       iommu->name);
+			return -EINVAL;
+		}
+		pasid_set_sre(pte);
+	}
+
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
+		if (!ecap_eafs(iommu->ecap)) {
+			pr_err("No extended access flag support on %s\n",
+				iommu->name);
+			return -EINVAL;
+		}
+		pasid_set_eafe(pte);
+	}
+
+	/*
+	 * Memory type is only applicable to devices inside processor coherent
+	 * domain. PCIe devices are not included. We can skip the rest of the
+	 * flags if IOMMU does not support MTS.
+	 */
+	if (!(pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK))
+		return 0;
+
+	if (!ecap_mts(iommu->ecap)) {
+		pr_err("No memory type support for bind guest PASID on %s\n", iommu->name);
+		return -EINVAL;
+	}
+
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
+		pasid_set_emte(pte);
+		pasid_set_emt(pte, pasid_data->emt);
+	}
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
+		pasid_set_pcd(pte);
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
+		pasid_set_pwt(pte);
+	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
+		pasid_set_cd(pte);
+	pasid_set_pat(pte, pasid_data->pat);
+
+	return 0;
+}
+
+/**
+ * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
+ * This could be used for guest shared virtual address. In this case, the
+ * first level page tables are used for GVA-GPA translation in the guest,
+ * second level page tables are used for GPA-HPA translation.
+ *
+ * @iommu:      IOMMU which the device belong to
+ * @dev:        Device to be set up for translation
+ * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
+ * @pasid:      PASID to be programmed in the device PASID table
+ * @pasid_data: Additional PASID info from the guest bind request
+ * @domain:     Domain info for setting up second level page tables
+ * @addr_width: Address width of the first level (guest)
+ */
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *gpgd,
+			int pasid, struct iommu_gpasid_bind_data_vtd *pasid_data,
+			struct dmar_domain *domain,
+			int addr_width)
+{
+	struct pasid_entry *pte;
+	struct dma_pte *pgd;
+	int ret = 0;
+	u64 pgd_val;
+	int agaw;
+	u16 did;
+
+	if (!ecap_nest(iommu->ecap)) {
+		pr_err_ratelimited("IOMMU: %s: No nested translation support\n",
+		       iommu->name);
+		return -EINVAL;
+	}
+
+	if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
+		pr_err_ratelimited("Domain is not in nesting mode, %x\n", domain->flags);
+		return -EINVAL;
+	}
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	/*
+	 * Caller must ensure PASID entry is not in use, i.e. not bind the
+	 * same PASID to the same device twice.
+	 */
+	if (pasid_pte_is_present(pte))
+		return -EBUSY;
+
+	pasid_clear_entry(pte);
+
+	/* Sanity checking performed by caller to make sure address
+	 * width matching in two dimensions:
+	 * 1. CPU vs. IOMMU
+	 * 2. Guest vs. Host.
+	 */
+	switch (addr_width) {
+	case ADDR_WIDTH_5LEVEL:
+		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+			cap_5lp_support(iommu->cap)) {
+			pasid_set_flpm(pte, 1);
+		} else {
+			dev_err_ratelimited(dev, "5-level paging not supported\n");
+			return -EINVAL;
+		}
+		break;
+	case ADDR_WIDTH_4LEVEL:
+		pasid_set_flpm(pte, 0);
+		break;
+	default:
+		dev_err_ratelimited(dev, "Invalid guest address width %d\n", addr_width);
+		return -EINVAL;
+	}
+
+	/* First level PGD is in GPA, must be supported by the second level */
+	if ((u64)gpgd > domain->max_addr) {
+		dev_err_ratelimited(dev, "Guest PGD %llx not supported, max %llx\n",
+			(u64)gpgd, domain->max_addr);
+		return -EINVAL;
+	}
+	pasid_set_flptr(pte, (u64)gpgd);
+
+	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
+	if (ret) {
+		dev_err_ratelimited(dev, "Guest PASID bind data not supported\n");
+		return ret;
+	}
+
+	/* Setup the second level based on the given domain */
+	pgd = domain->pgd;
+
+	agaw = iommu_skip_agaw(domain, iommu, &pgd);
+	if (agaw < 0) {
+		dev_err_ratelimited(dev, "Invalid domain page table\n");
+		return -EINVAL;
+	}
+	pgd_val = virt_to_phys(pgd);
+	pasid_set_slptr(pte, pgd_val);
+	pasid_set_fault_enable(pte);
+
+	did = domain->iommu_did[iommu->seq_id];
+	pasid_set_domain_id(pte, did);
+
+	pasid_set_address_width(pte, agaw);
+	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
+	pasid_set_present(pte);
+	pasid_flush_caches(iommu, pte, pasid, did);
+
+	return ret;
+}
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 92de6df24ccb..698015ee3f04 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -36,6 +36,7 @@
  * to vmalloc or even module mappings.
  */
 #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
+#define PASID_FLAG_NESTED		BIT(1)
 
 /*
  * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
@@ -51,6 +52,11 @@ struct pasid_entry {
 	u64 val[8];
 };
 
+#define PASID_ENTRY_PGTT_FL_ONLY	(1)
+#define PASID_ENTRY_PGTT_SL_ONLY	(2)
+#define PASID_ENTRY_PGTT_NESTED		(3)
+#define PASID_ENTRY_PGTT_PT		(4)
+
 /* The representative of a PASID table */
 struct pasid_table {
 	void			*table;		/* pasid table pointer */
@@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, int pasid);
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *pgd,
+			int pasid,
+			struct iommu_gpasid_bind_data_vtd *pasid_data,
+			struct dmar_domain *domain,
+			int addr_width);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ed7171d2ae1f..6da03f627ba3 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -42,6 +42,9 @@
 #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define ADDR_WIDTH_5LEVEL	(57)
+#define ADDR_WIDTH_4LEVEL	(48)
+
 #define CONTEXT_TT_MULTI_LEVEL	0
 #define CONTEXT_TT_DEV_IOTLB	1
 #define CONTEXT_TT_PASS_THROUGH 2
@@ -480,6 +483,31 @@ struct context_entry {
 	u64 hi;
 };
 
+/* si_domain contains mulitple devices */
+#define DOMAIN_FLAG_STATIC_IDENTITY		BIT(0)
+
+/*
+ * This is a DMA domain allocated through the iommu domain allocation
+ * interface. But one or more devices belonging to this domain have
+ * been chosen to use a private domain. We should avoid to use the
+ * map/unmap/iova_to_phys APIs on it.
+ */
+#define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
+
+/*
+ * When VT-d works in the scalable mode, it allows DMA translation to
+ * happen through either first level or second level page table. This
+ * bit marks that the DMA translation for the domain goes through the
+ * first level page table, otherwise, it goes through the second level.
+ */
+#define DOMAIN_FLAG_USE_FIRST_LEVEL		BIT(2)
+
+/*
+ * Domain represents a virtual machine which demands iommu nested
+ * translation mode support.
+ */
+#define DOMAIN_FLAG_NESTING_MODE		BIT(3)
+
 struct dmar_domain {
 	int	nid;			/* node id */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (3 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-09  7:41   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types Jacob Pan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan, Liu, Yi L

When supporting guest SVA with emulated IOMMU, the guest PASID
table is shadowed in VMM. Updates to guest vIOMMU PASID table
will result in PASID cache flush which will be passed down to
the host as bind guest PASID calls.

For the SL page tables, it will be harvested from device's
default domain (request w/o PASID), or aux domain in case of
mediated device.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

---
v11: Fixed locking, avoid duplicated paging mode check, added helper to
free svm if device list is empty. Use rate limited error message since
the bind gpasid call comes from user space.
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c |   4 +
 drivers/iommu/intel-svm.c   | 206 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |   8 +-
 include/linux/intel-svm.h   |  17 ++++
 4 files changed, 234 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index c0dadec5a6b3..94c7993dac6a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.dev_disable_feat	= intel_iommu_dev_disable_feat,
 	.is_attach_deferred	= intel_iommu_is_attach_deferred,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.sva_bind_gpasid	= intel_svm_bind_gpasid,
+	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
+#endif
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d7f2a5358900..7cf711318b87 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
 	list_for_each_entry((sdev), &(svm)->devs, list)	\
 		if ((d) != (sdev)->dev) {} else
 
+
+static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
+{
+	if (list_empty(&svm->devs)) {
+		ioasid_set_data(pasid, NULL);
+		kfree(svm);
+	}
+}
+
+int intel_svm_bind_gpasid(struct iommu_domain *domain,
+			struct device *dev,
+			struct iommu_gpasid_bind_data *data)
+{
+	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+	struct dmar_domain *dmar_domain;
+	struct intel_svm_dev *sdev;
+	struct intel_svm *svm;
+	int ret = 0;
+
+	if (WARN_ON(!iommu) || !data)
+		return -EINVAL;
+
+	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
+	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
+		return -EINVAL;
+
+	if (dev_is_pci(dev)) {
+		/* VT-d supports devices with full 20 bit PASIDs only */
+		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
+			return -EINVAL;
+	} else {
+		return -ENOTSUPP;
+	}
+
+	/*
+	 * We only check host PASID range, we have no knowledge to check
+	 * guest PASID range.
+	 */
+	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
+		return -EINVAL;
+
+	dmar_domain = to_dmar_domain(domain);
+
+	mutex_lock(&pasid_mutex);
+	svm = ioasid_find(NULL, data->hpasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
+	if (svm) {
+		/*
+		 * If we found svm for the PASID, there must be at
+		 * least one device bond, otherwise svm should be freed.
+		 */
+		if (WARN_ON(list_empty(&svm->devs))) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		for_each_svm_dev(sdev, svm, dev) {
+			/* In case of multiple sub-devices of the same pdev
+			 * assigned, we should allow multiple bind calls with
+			 * the same PASID and pdev.
+			 */
+			sdev->users++;
+			goto out;
+		}
+	} else {
+		/* We come here when PASID has never been bond to a device. */
+		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+		if (!svm) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
+		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
+		 * ownership. We can drop the following line once VFIO and IOASID set
+		 * check is in place.
+		 */
+		svm->mm = get_task_mm(current);
+		svm->pasid = data->hpasid;
+		if (data->flags & IOMMU_SVA_GPASID_VAL) {
+			svm->gpasid = data->gpasid;
+			svm->flags |= SVM_FLAG_GUEST_PASID;
+		}
+		ioasid_set_data(data->hpasid, svm);
+		INIT_LIST_HEAD_RCU(&svm->devs);
+		mmput(svm->mm);
+	}
+	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+	if (!sdev) {
+		/*
+		 * If this is a new PASID that never bond to a device, then
+		 * the device list must be empty which indicates struct svm
+		 * was allocated in this function.
+		 */
+		intel_svm_free_if_empty(svm, data->hpasid);
+		ret = -ENOMEM;
+		goto out;
+	}
+	sdev->dev = dev;
+	sdev->users = 1;
+
+	/* Set up device context entry for PASID if not enabled already */
+	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+	if (ret) {
+		dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
+		kfree(sdev);
+		intel_svm_free_if_empty(svm, data->hpasid);
+		goto out;
+	}
+
+	/*
+	 * PASID table is per device for better security. Therefore, for
+	 * each bind of a new device even with an existing PASID, we need to
+	 * call the nested mode setup function here.
+	 */
+	spin_lock(&iommu->lock);
+	ret = intel_pasid_setup_nested(iommu,
+				       dev,
+				       (pgd_t *)data->gpgd,
+				       data->hpasid,
+				       &data->vtd,
+				       dmar_domain,
+				       data->addr_width);
+	if (ret) {
+		dev_err_ratelimited(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
+			data->hpasid, ret);
+		/*
+		 * PASID entry should be in cleared state if nested mode
+		 * set up failed. So we only need to clear IOASID tracking
+		 * data such that free call will succeed.
+		 */
+		kfree(sdev);
+		intel_svm_free_if_empty(svm, data->hpasid);
+		spin_unlock(&iommu->lock);
+		goto out;
+	}
+	spin_unlock(&iommu->lock);
+	svm->flags |= SVM_FLAG_GUEST_MODE;
+
+	init_rcu_head(&sdev->rcu);
+	list_add_rcu(&sdev->list, &svm->devs);
+ out:
+	mutex_unlock(&pasid_mutex);
+	return ret;
+}
+
+int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+{
+	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+	struct intel_svm_dev *sdev;
+	struct intel_svm *svm;
+	int ret = -EINVAL;
+
+	if (WARN_ON(!iommu))
+		return -EINVAL;
+
+	mutex_lock(&pasid_mutex);
+	svm = ioasid_find(NULL, pasid, NULL);
+	if (!svm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
+	for_each_svm_dev(sdev, svm, dev) {
+		ret = 0;
+		sdev->users--;
+		if (!sdev->users) {
+			list_del_rcu(&sdev->list);
+			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+			/* TODO: Drain in flight PRQ for the PASID since it
+			 * may get reused soon, we don't want to
+			 * confuse with its previous life.
+			 * intel_svm_drain_prq(dev, pasid);
+			 */
+			kfree_rcu(sdev, rcu);
+
+			if (list_empty(&svm->devs)) {
+				/*
+				 * We do not free the IOASID here in that
+				 * IOMMU driver did not allocate it.
+				 * Unlike native SVM, IOASID for guest use was
+				 * allocated prior to the bind call.
+				 * In any case, if the free call comes before
+				 * the unbind, IOMMU driver will get notified
+				 * and perform cleanup.
+				 */
+				ioasid_set_data(pasid, NULL);
+				kfree(svm);
+			}
+		}
+		break;
+	}
+out:
+	mutex_unlock(&pasid_mutex);
+
+	return ret;
+}
+
 int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 6da03f627ba3..a5bd53cf190c 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device *dev);
 extern void intel_svm_check(struct intel_iommu *iommu);
 extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
-
+extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
+		struct device *dev, struct iommu_gpasid_bind_data *data);
+extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
 struct svm_dev_ops;
 
 struct intel_svm_dev {
@@ -723,9 +725,13 @@ struct intel_svm_dev {
 struct intel_svm {
 	struct mmu_notifier notifier;
 	struct mm_struct *mm;
+
 	struct intel_iommu *iommu;
 	int flags;
 	int pasid;
+	int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
+		     * to guest PASID mapping.
+		     */
 	struct list_head devs;
 	struct list_head list;
 };
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index d7c403d0dd27..c19690937540 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -44,6 +44,23 @@ struct svm_dev_ops {
  * do such IOTLB flushes automatically.
  */
 #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
+/*
+ * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
+ * In this case the mm_struct is in the guest kernel or userspace, its life
+ * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
+ * means to bind/unbind guest CR3 with PASIDs allocated for a device.
+ */
+#define SVM_FLAG_GUEST_MODE	(1<<2)
+/*
+ * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
+ * which requires guest and host PASID translation at both directions. We keep
+ * track of guest PASID in order to provide lookup service to device drivers.
+ * One such example is a physical function (PF) driver that supports mediated
+ * device (mdev) assignment. Guest programming of mdev configuration space can
+ * only be done with guest PASID, therefore PF driver needs to find the matching
+ * host PASID to program the real hardware.
+ */
+#define SVM_FLAG_GUEST_PASID	(1<<3)
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (4 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-05  6:37   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

When Shared Virtual Memory is exposed to a guest via vIOMMU, scalable
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v9 -> v10:
Fix off by 1 in pasid device iotlb flush

Address v7 missed review from Eric

---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c        | 36 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.c |  3 ++-
 include/linux/intel-iommu.h | 20 ++++++++++++++++----
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f77dae7ba7d4..4d6b7b5b37ee 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
+/* PASID-based device IOTLB Invalidate */
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size_order, u64 granu)
+{
+	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
+	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
+
+	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+		QI_DEV_IOTLB_PFSID(pfsid);
+	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
+
+	/*
+	 * If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the invalidation address
+	 * range. VT-d spec 6.5.2.6.
+	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
+	 * size order = 0 is PAGE_SIZE 4KB
+	 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
+	 * ECAP.
+	 */
+	desc.qw1 |= addr & ~mask;
+	if (size_order)
+		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
+
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
+
+	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_TYPE;
+	qi_submit_sync(&desc, iommu);
+}
+
 /*
  * Disable Queued Invalidation interface.
  */
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index fcd015644c4f..698e661ccca7 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
 {
 	struct qi_desc desc;
 
-	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
+	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
+		QI_PC_PASID(pasid) | QI_PC_TYPE;
 	desc.qw1 = 0;
 	desc.qw2 = 0;
 	desc.qw3 = 0;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index a5bd53cf190c..f775bb825343 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -334,7 +334,7 @@ enum {
 #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
 #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
 #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
-#define QI_IOTLB_AM(am)		(((u8)am))
+#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
 
 #define QI_CC_FM(fm)		(((u64)fm) << 48)
 #define QI_CC_SID(sid)		(((u64)sid) << 32)
@@ -353,16 +353,21 @@ enum {
 #define QI_PC_DID(did)		(((u64)did) << 16)
 #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
 
-#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS	0
+#define QI_PC_PASID_SEL		1
 
 #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
-#define QI_EIOTLB_AM(am)	(((u64)am))
+#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
 #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
 #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
 #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
 
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
+
 #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
 #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
 #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
@@ -687,8 +692,15 @@ extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
 extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask);
+
 void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
 		     unsigned long npages, bool ih);
+
+extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u32 pasid, u16 qdep, u64 addr, unsigned size_order, u64 granu);
+
+extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
+
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
 extern int dmar_ir_support(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (5 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-09  8:50   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register Jacob Pan
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan, Liu

When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

---
v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
      Fixed devTLB invalidation granularity mapping. Disregard G=1 case
      and use address selective invalidation only.

v7 review fixed in v10
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/intel-iommu.c | 158 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 94c7993dac6a..045c5c08d71d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5594,6 +5594,163 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
 	aux_domain_remove_dev(to_dmar_domain(domain), dev);
 }
 
+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned devices guest
+ * owns the first level page tables. Invalidations of translation caches in the
+ * guest are trapped and passed down to the host.
+ *
+ * vIOMMU in the guest will only expose first level page tables, therefore
+ * we do not support IOTLB granularity for request without PASID (second level).
+ *
+ * For example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by iommu cache type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
+ */
+
+const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
+	/*
+	 * PASID based IOTLB invalidation: PASID selective (per PASID),
+	 * page selective (address granularity)
+	 */
+	{-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+	/* PASID based dev TLBs */
+	{-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
+	/* PASID cache */
+	{-EINVAL, -EINVAL, -EINVAL}
+};
+
+static inline int to_vtd_granularity(int type, int granu)
+{
+	return inv_type_granu_table[type][granu];
+}
+
+static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
+{
+	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
+
+	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
+	 * granu size in contiguous memory.
+	 */
+	return order_base_2(nr_pages);
+}
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct intel_iommu *iommu;
+	unsigned long flags;
+	int cache_type;
+	u8 bus, devfn;
+	u16 did, sid;
+	int ret = 0;
+	u64 size = 0;
+
+	if (!inv_info || !dmar_domain ||
+		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+		return -EINVAL;
+
+	if (!dev || !dev_is_pci(dev))
+		return -ENODEV;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	spin_lock(&iommu->lock);
+	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+	if (!info) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	sid = PCI_DEVID(bus, devfn);
+
+	/* Size is only valid in non-PASID selective invalidation */
+	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
+		size = to_vtd_size(inv_info->addr_info.granule_size,
+				   inv_info->addr_info.nb_granules);
+
+	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
+		int granu = 0;
+		u64 pasid = 0;
+
+		granu = to_vtd_granularity(cache_type, inv_info->granularity);
+		if (granu == -EINVAL) {
+			pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
+				inv_info->granularity);
+			break;
+		}
+
+		/* PASID is stored in different locations based on granularity */
+		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
+			inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
+			pasid = inv_info->pasid_info.pasid;
+		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
+			inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
+			pasid = inv_info->addr_info.pasid;
+
+		switch (BIT(cache_type)) {
+		case IOMMU_CACHE_INV_TYPE_IOTLB:
+			if ((inv_info->granularity == IOMMU_INV_GRANU_ADDR) &&
+				size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
+				pr_err("Address out of range, 0x%llx, size order %llu\n",
+					inv_info->addr_info.addr, size);
+				ret = -ERANGE;
+				goto out_unlock;
+			}
+
+			qi_flush_piotlb(iommu, did,
+					pasid,
+					mm_to_dma_pfn(inv_info->addr_info.addr),
+					(granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
+					inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
+
+			/*
+			 * Always flush device IOTLB if ATS is enabled. vIOMMU
+			 * in the guest may assume IOTLB flush is inclusive,
+			 * which is more efficient.
+			 */
+			if (info->ats_enabled)
+				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+						pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			break;
+		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
+			if (info->ats_enabled)
+				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+						inv_info->addr_info.pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			else
+				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
+			break;
+		case IOMMU_CACHE_INV_TYPE_PASID:
+			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
+			break;
+		default:
+			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
+				cache_type);
+			ret = -EINVAL;
+		}
+	}
+out_unlock:
+	spin_unlock(&iommu->lock);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+#endif
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot, gfp_t gfp)
@@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.is_attach_deferred	= intel_iommu_is_attach_deferred,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
 #ifdef CONFIG_INTEL_IOMMU_SVM
+	.cache_invalidate	= intel_iommu_sva_invalidate,
 	.sva_bind_gpasid	= intel_svm_bind_gpasid,
 	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (6 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-09 10:14   ` Auger Eric
  2020-04-03 18:42 ` [PATCH v11 09/10] iommu/vt-d: Enlightened PASID allocation Jacob Pan
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

Virtual command registers are used in the guest only, to prevent
vmexit cost, we cache the capability and store it during initialization.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

---
v7 Reviewed by Eric & Baolu
---

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c        | 1 +
 include/linux/intel-iommu.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 4d6b7b5b37ee..3b36491c8bbb 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
 		warn_invalid_dmar(phys_addr, " returns all ones");
 		goto unmap;
 	}
+	iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
 
 	/* the registers might be more than one page */
 	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index f775bb825343..e02a96848586 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -194,6 +194,9 @@
 #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
 #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping Control */
 
+/* Virtual command interface capability */
+#define vccap_pasid(v)		((v & DMA_VCS_PAS)) /* PASID allocation */
+
 /* IOTLB_REG */
 #define DMA_TLB_FLUSH_GRANU_OFFSET  60
 #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
@@ -287,6 +290,7 @@
 
 /* PRS_REG */
 #define DMA_PRS_PPR	((u32)1)
+#define DMA_VCS_PAS	((u64)1)
 
 #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)			\
 do {									\
@@ -562,6 +566,7 @@ struct intel_iommu {
 	u64		reg_size; /* size of hw register set */
 	u64		cap;
 	u64		ecap;
+	u64		vccap;
 	u32		gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
 	raw_spinlock_t	register_lock; /* protect register handling */
 	int		seq_id;	/* sequence id of the iommu */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 09/10] iommu/vt-d: Enlightened PASID allocation
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (7 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-03 18:42 ` [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
  2020-04-05  8:28 ` [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Lu Baolu
  10 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan

From: Lu Baolu <baolu.lu@linux.intel.com>

Enabling IOMMU in a guest requires communication with the host
driver for certain aspects. Use of PASID ID to enable Shared Virtual
Addressing (SVA) requires managing PASID's in the host. VT-d 3.0 spec
provides a Virtual Command Register (VCMD) to facilitate this.
Writes to this register in the guest are trapped by vIOMMU which
proxies the call to the host driver.

This virtual command interface consists of a capability register,
a virtual command register, and a virtual response register. Refer
to section 10.4.42, 10.4.43, 10.4.44 for more information.

This patch adds the enlightened PASID allocation/free interfaces
via the virtual command interface.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/iommu/intel-pasid.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.h | 13 ++++++++++-
 include/linux/intel-iommu.h |  1 +
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 698e661ccca7..834ff1d71b70 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -27,6 +27,63 @@
 static DEFINE_SPINLOCK(pasid_lock);
 u32 intel_pasid_max_id = PASID_MAX;
 
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
+{
+	unsigned long flags;
+	u8 status_code;
+	int ret = 0;
+	u64 res;
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	status_code = VCMD_VRSP_SC(res);
+	switch (status_code) {
+	case VCMD_VRSP_SC_SUCCESS:
+		*pasid = VCMD_VRSP_RESULT_PASID(res);
+		break;
+	case VCMD_VRSP_SC_NO_PASID_AVAIL:
+		pr_info("IOMMU: %s: No PASID available\n", iommu->name);
+		ret = -ENOSPC;
+		break;
+	default:
+		ret = -ENODEV;
+		pr_warn("IOMMU: %s: Unexpected error code %d\n",
+			iommu->name, status_code);
+	}
+
+	return ret;
+}
+
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
+{
+	unsigned long flags;
+	u8 status_code;
+	u64 res;
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG,
+		    VCMD_CMD_OPERAND(pasid) | VCMD_CMD_FREE);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	status_code = VCMD_VRSP_SC(res);
+	switch (status_code) {
+	case VCMD_VRSP_SC_SUCCESS:
+		break;
+	case VCMD_VRSP_SC_INVALID_PASID:
+		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
+		break;
+	default:
+		pr_warn("IOMMU: %s: Unexpected error code %d\n",
+			iommu->name, status_code);
+	}
+}
+
 /*
  * Per device pasid table management:
  */
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 698015ee3f04..cd3d63f3e936 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -23,6 +23,16 @@
 #define is_pasid_enabled(entry)		(((entry)->lo >> 3) & 0x1)
 #define get_pasid_dir_size(entry)	(1 << ((((entry)->lo >> 9) & 0x7) + 7))
 
+/* Virtual command interface for enlightened pasid management. */
+#define VCMD_CMD_ALLOC			0x1
+#define VCMD_CMD_FREE			0x2
+#define VCMD_VRSP_IP			0x1
+#define VCMD_VRSP_SC(e)			(((e) >> 1) & 0x3)
+#define VCMD_VRSP_SC_SUCCESS		0
+#define VCMD_VRSP_SC_NO_PASID_AVAIL	1
+#define VCMD_VRSP_SC_INVALID_PASID	1
+#define VCMD_VRSP_RESULT_PASID(e)	(((e) >> 8) & 0xfffff)
+#define VCMD_CMD_OPERAND(e)		((e) << 8)
 /*
  * Domain ID reserved for pasid entries programmed for first-level
  * only and pass-through transfer modes.
@@ -113,5 +123,6 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu,
 			int addr_width);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
-
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
 #endif /* __INTEL_PASID_H */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index e02a96848586..f652db3198d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -169,6 +169,7 @@
 #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
 #define ecap_flts(e)		(((e) >> 47) & 0x1)
 #define ecap_slts(e)		(((e) >> 46) & 0x1)
+#define ecap_vcs(e)		(((e) >> 44) & 0x1)
 #define ecap_smts(e)		(((e) >> 43) & 0x1)
 #define ecap_dit(e)		((e >> 41) & 0x1)
 #define ecap_pasid(e)		((e >> 40) & 0x1)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (8 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 09/10] iommu/vt-d: Enlightened PASID allocation Jacob Pan
@ 2020-04-03 18:42 ` Jacob Pan
  2020-04-09 10:31   ` Auger Eric
  2020-04-05  8:28 ` [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Lu Baolu
  10 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-03 18:42 UTC (permalink / raw)
  To: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Jacob Pan, Liu

When VT-d driver runs in the guest, PASID allocation must be
performed via virtual command interface. This patch registers a
custom IOASID allocator which takes precedence over the default
XArray based allocator. The resulting IOASID allocation will always
come from the host. This ensures that PASID namespace is system-
wide.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 84 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  2 ++
 2 files changed, 86 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 045c5c08d71d..ff3f0386951f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1732,6 +1732,9 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
 		if (ecap_prs(iommu->ecap))
 			intel_svm_finish_prq(iommu);
 	}
+	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
+		ioasid_unregister_allocator(&iommu->pasid_allocator);
+
 #endif
 }
 
@@ -3266,6 +3269,84 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 	return ret;
 }
 
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static ioasid_t intel_vcmd_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
+{
+	struct intel_iommu *iommu = data;
+	ioasid_t ioasid;
+
+	if (!iommu)
+		return INVALID_IOASID;
+	/*
+	 * VT-d virtual command interface always uses the full 20 bit
+	 * PASID range. Host can partition guest PASID range based on
+	 * policies but it is out of guest's control.
+	 */
+	if (min < PASID_MIN || max > intel_pasid_max_id)
+		return INVALID_IOASID;
+
+	if (vcmd_alloc_pasid(iommu, &ioasid))
+		return INVALID_IOASID;
+
+	return ioasid;
+}
+
+static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
+{
+	struct intel_iommu *iommu = data;
+
+	if (!iommu)
+		return;
+	/*
+	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
+	 * We can only free the PASID when all the devices are unbound.
+	 */
+	if (ioasid_find(NULL, ioasid, NULL)) {
+		pr_alert("Cannot free active IOASID %d\n", ioasid);
+		return;
+	}
+	vcmd_free_pasid(iommu, ioasid);
+}
+
+static void register_pasid_allocator(struct intel_iommu *iommu)
+{
+	/*
+	 * If we are running in the host, no need for custom allocator
+	 * in that PASIDs are allocated from the host system-wide.
+	 */
+	if (!cap_caching_mode(iommu->cap))
+		return;
+
+	if (!sm_supported(iommu)) {
+		pr_warn("VT-d Scalable Mode not enabled, no PASID allocation\n");
+		return;
+	}
+
+	/*
+	 * Register a custom PASID allocator if we are running in a guest,
+	 * guest PASID must be obtained via virtual command interface.
+	 * There can be multiple vIOMMUs in each guest but only one allocator
+	 * is active. All vIOMMU allocators will eventually be calling the same
+	 * host allocator.
+	 */
+	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
+		pr_info("Register custom PASID allocator\n");
+		iommu->pasid_allocator.alloc = intel_vcmd_ioasid_alloc;
+		iommu->pasid_allocator.free = intel_vcmd_ioasid_free;
+		iommu->pasid_allocator.pdata = (void *)iommu;
+		if (ioasid_register_allocator(&iommu->pasid_allocator)) {
+			pr_warn("Custom PASID allocator failed, scalable mode disabled\n");
+			/*
+			 * Disable scalable mode on this IOMMU if there
+			 * is no custom allocator. Mixing SM capable vIOMMU
+			 * and non-SM vIOMMU are not supported.
+			 */
+			intel_iommu_sm = 0;
+		}
+	}
+}
+#endif
+
 static int __init init_dmars(void)
 {
 	struct dmar_drhd_unit *drhd;
@@ -3383,6 +3464,9 @@ static int __init init_dmars(void)
 	 */
 	for_each_active_iommu(iommu, drhd) {
 		iommu_flush_write_buffer(iommu);
+#ifdef CONFIG_INTEL_IOMMU_SVM
+		register_pasid_allocator(iommu);
+#endif
 		iommu_set_root_entry(iommu);
 		iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
 		iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index f652db3198d9..e122cb30388e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/dmar.h>
+#include <linux/ioasid.h>
 
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
@@ -588,6 +589,7 @@ struct intel_iommu {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	struct page_req_dsc *prq;
 	unsigned char prq_name[16];    /* Name for PRQ interrupt */
+	struct ioasid_allocator_ops pasid_allocator; /* Custom allocator for PASIDs */
 #endif
 	struct q_inval  *qi;            /* Queued invalidation info */
 	u32 *iommu_state; /* Store iommu states between suspend and resume.*/
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL
  2020-04-03 18:42 ` [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL Jacob Pan
@ 2020-04-05  6:37   ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-05  6:37 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> An Intel iommu domain uses 5-level page table by default. If the
> iommu that the domain tries to attach supports less page levels,
> the top level page tables should be skipped. Add a helper to do
> this so that it could be used in other places.
> 
> ---
> v11 Added Baolu's commit message and squashed 4 & 5 from v10.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  drivers/iommu/intel-pasid.c | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 22b30f10b396..66c364719ad1 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -500,6 +500,26 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
>  }
>  
>  /*
> + * Skip top levels of page tables for iommu which has less agaw
> + * than default. Unnecessary for PT mode.
> + */
> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> +				  struct intel_iommu *iommu,
> +				  struct dma_pte **pgd)
> +{
> +	int agaw;
> +
> +	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> +		*pgd = phys_to_virt(dma_pte_addr(*pgd));
> +		if (!dma_pte_present(*pgd)) {
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return agaw;
> +}
> +
> +/*
>   * Set up the scalable mode pasid entry for second only translation type.
>   */
>  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> @@ -522,17 +542,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  		return -EINVAL;
>  	}
>  
> -	/*
> -	 * Skip top levels of page tables for iommu which has less agaw
> -	 * than default. Unnecessary for PT mode.
> -	 */
>  	pgd = domain->pgd;
> -	for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> -		pgd = phys_to_virt(dma_pte_addr(pgd));
> -		if (!dma_pte_present(pgd)) {
> -			dev_err(dev, "Invalid domain page table\n");
> -			return -EINVAL;
> -		}
> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> +	if (agaw < 0) {
> +		dev_err(dev, "Invalid domain page table\n");
> +		return -EINVAL;
>  	}
>  
>  	pgd_val = virt_to_phys(pgd);
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types
  2020-04-03 18:42 ` [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2020-04-05  6:37   ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-05  6:37 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> When Shared Virtual Memory is exposed to a guest via vIOMMU, scalable
> IOTLB invalidation may be passed down from outside IOMMU subsystems.
> This patch adds invalidation functions that can be used for additional
> translation cache types.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> 
> ---
> v9 -> v10:
> Fix off by 1 in pasid device iotlb flush
> 
> Address v7 missed review from Eric
> 
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/dmar.c        | 36 ++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel-pasid.c |  3 ++-
>  include/linux/intel-iommu.h | 20 ++++++++++++++++----
>  3 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index f77dae7ba7d4..4d6b7b5b37ee 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
> +/* PASID-based device IOTLB Invalidate */
> +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +		u32 pasid,  u16 qdep, u64 addr, unsigned size_order, u64 granu)
> +{
> +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
> +	struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
> +
> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> +		QI_DEV_IOTLB_PFSID(pfsid);
> +	desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
> +
> +	/*
> +	 * If S bit is 0, we only flush a single page. If S bit is set,
> +	 * The least significant zero bit indicates the invalidation address
> +	 * range. VT-d spec 6.5.2.6.
> +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
> +	 * size order = 0 is PAGE_SIZE 4KB
> +	 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
> +	 * ECAP.
> +	 */
> +	desc.qw1 |= addr & ~mask;
> +	if (size_order)
> +		desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> +
> +	qi_submit_sync(&desc, iommu);
> +}
> +
> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
> +{
> +	struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
> +
> +	desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_TYPE;
> +	qi_submit_sync(&desc, iommu);
> +}
> +
>  /*
>   * Disable Queued Invalidation interface.
>   */
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index fcd015644c4f..698e661ccca7 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>  {
>  	struct qi_desc desc;
>  
> -	desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
> +	desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
> +		QI_PC_PASID(pasid) | QI_PC_TYPE;
>  	desc.qw1 = 0;
>  	desc.qw2 = 0;
>  	desc.qw3 = 0;
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index a5bd53cf190c..f775bb825343 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -334,7 +334,7 @@ enum {
>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
>  #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
>  #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
> -#define QI_IOTLB_AM(am)		(((u8)am))
> +#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
>  
>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> @@ -353,16 +353,21 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PASID cache invalidation granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> -#define QI_EIOTLB_AM(am)	(((u64)am))
> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		1
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -687,8 +692,15 @@ extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  			  unsigned int size_order, u64 type);
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
> +
>  void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
>  		     unsigned long npages, bool ih);
> +
> +extern void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u32 pasid, u16 qdep, u64 addr, unsigned size_order, u64 granu);
> +
> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
> +
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
>  
>  extern int dmar_ir_support(void);
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support
  2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
                   ` (9 preceding siblings ...)
  2020-04-03 18:42 ` [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
@ 2020-04-05  8:28 ` Lu Baolu
  10 siblings, 0 replies; 33+ messages in thread
From: Lu Baolu @ 2020-04-05  8:28 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Eric Auger
  Cc: baolu.lu, Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

On 2020/4/4 2:42, Jacob Pan wrote:
> Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
> platforms allow address space sharing between device DMA and applications.
> SVA can reduce programming complexity and enhance security.
> This series is intended to enable SVA virtualization, i.e. enable use of SVA
> within a guest user application.
> 
> This is the remaining portion of the original patchset that is based on
> Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
> (git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)

With Eric's reviewed-by added, all patches in this series have been
piled up for v5.7.

Thank you all!

Best regards,
baolu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 02/10] iommu/uapi: Define a mask for bind data
  2020-04-03 18:42 ` [PATCH v11 02/10] iommu/uapi: Define a mask for bind data Jacob Pan
@ 2020-04-08 13:07   ` Joerg Roedel
  2020-04-08 16:11     ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Joerg Roedel @ 2020-04-08 13:07 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lu Baolu, iommu, LKML, David Woodhouse, Jean-Philippe Brucker,
	Eric Auger, Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

On Fri, Apr 03, 2020 at 11:42:06AM -0700, Jacob Pan wrote:
> Memory type related flags can be grouped together for one simple check.
> 
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/uapi/linux/iommu.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>  	__u32 pat;
>  	__u32 emt;
>  };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
> +					 IOMMU_SVA_VTD_GPASID_EMTE | \
> +					 IOMMU_SVA_VTD_GPASID_PCD |  \
> +					 IOMMU_SVA_VTD_GPASID_PWT)

Where and how will this be used? Can you add this in the patch that
actually makes use of it?

Also please add newlines before and after that define.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function
  2020-04-03 18:42 ` [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function Jacob Pan
@ 2020-04-08 13:33   ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-08 13:33 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> With PASID granular translation type set to 0x11b, translation
> result from the first level(FL) also subject to a second level(SL)
> page table translation. This mode is used for SVA virtualization,
> where FL performs guest virtual to guest physical translation and
> SL performs guest physical to host physical translation.
> 
> This patch adds a helper function for setting up nested translation
> where second level comes from a domain and first level comes from
> a guest PGD.
> 
> ---
> v11 Added check for nesting domain attr. Moved flags to header file.
>     Improved flow in MTS handling suggested by Eric.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/intel-iommu.c |  25 -----
>  drivers/iommu/intel-pasid.c | 246 +++++++++++++++++++++++++++++++++++++++++++-
>  drivers/iommu/intel-pasid.h |  12 +++
>  include/linux/intel-iommu.h |  28 +++++
>  4 files changed, 283 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index e599b2537b1c..c0dadec5a6b3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -296,31 +296,6 @@ static inline void context_clear_entry(struct context_entry *context)
>  static struct dmar_domain *si_domain;
>  static int hw_pass_through = 1;
>  
> -/* si_domain contains mulitple devices */
> -#define DOMAIN_FLAG_STATIC_IDENTITY		BIT(0)
> -
> -/*
> - * This is a DMA domain allocated through the iommu domain allocation
> - * interface. But one or more devices belonging to this domain have
> - * been chosen to use a private domain. We should avoid to use the
> - * map/unmap/iova_to_phys APIs on it.
> - */
> -#define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
> -
> -/*
> - * When VT-d works in the scalable mode, it allows DMA translation to
> - * happen through either first level or second level page table. This
> - * bit marks that the DMA translation for the domain goes through the
> - * first level page table, otherwise, it goes through the second level.
> - */
> -#define DOMAIN_FLAG_USE_FIRST_LEVEL		BIT(2)
> -
> -/*
> - * Domain represents a virtual machine which demands iommu nested
> - * translation mode support.
> - */
> -#define DOMAIN_FLAG_NESTING_MODE		BIT(3)
> -
>  #define for_each_domain_iommu(idx, domain)			\
>  	for (idx = 0; idx < g_num_of_iommus; idx++)		\
>  		if (domain->iommu_refcnt[idx])
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 66c364719ad1..fcd015644c4f 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
>  	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
>  }
>  
> +/*
> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
> +}
> +
> +/*
> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> +{
> +	pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
> +}
> +
> +/*
> + * Setup the Cache Disable (CD) field (Bit 89)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_cd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> +}
> +
> +/*
> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emte(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> +}
> +
> +/*
> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_eafe(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> +}
> +
> +/*
> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pcd(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> +}
> +
> +/*
> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pwt(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> +}
> +
>  static void
>  pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>  				    u16 did, int pasid)
> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>  
>  	/* Setup Present and PASID Granular Transfer Type: */
> -	pasid_set_translation_type(pte, 1);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
>  	pasid_set_present(pte);
>  	pasid_flush_caches(iommu, pte, pasid, did);
>  
> @@ -562,7 +632,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_slptr(pte, pgd_val);
>  	pasid_set_address_width(pte, agaw);
> -	pasid_set_translation_type(pte, 2);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>  
> @@ -596,7 +666,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  	pasid_clear_entry(pte);
>  	pasid_set_domain_id(pte, did);
>  	pasid_set_address_width(pte, iommu->agaw);
> -	pasid_set_translation_type(pte, 4);
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
>  	pasid_set_fault_enable(pte);
>  	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>  
> @@ -610,3 +680,173 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  
>  	return 0;
>  }
> +
> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> +				struct pasid_entry *pte,
> +				struct iommu_gpasid_bind_data_vtd *pasid_data)
> +{
> +	/*
> +	 * Not all guest PASID table entry fields are passed down during bind,
> +	 * here we only set up the ones that are dependent on guest settings.
> +	 * Execution related bits such as NXE, SMEP are not meaningful to IOMMU,
> +	 * therefore not set. Other fields, such as snoop related, are set based
> +	 * on host needs regardless of guest settings.
> +	 */
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
> +		if (!ecap_srs(iommu->ecap)) {
> +			pr_err("No supervisor request support on %s\n",
> +			       iommu->name);
You still have a bunch of rate unlimited traces that can be initiated by
userspace. with that fixed.

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> +			return -EINVAL;
> +		}
> +		pasid_set_sre(pte);
> +	}
> +
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> +		if (!ecap_eafs(iommu->ecap)) {
> +			pr_err("No extended access flag support on %s\n",
> +				iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_eafe(pte);
> +	}
> +
> +	/*
> +	 * Memory type is only applicable to devices inside processor coherent
> +	 * domain. PCIe devices are not included. We can skip the rest of the
> +	 * flags if IOMMU does not support MTS.
> +	 */
> +	if (!(pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK))
> +		return 0;
> +
> +	if (!ecap_mts(iommu->ecap)) {
> +		pr_err("No memory type support for bind guest PASID on %s\n", iommu->name);
> +		return -EINVAL;
> +	}
> +
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
> +		pasid_set_emte(pte);
> +		pasid_set_emt(pte, pasid_data->emt);
> +	}
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
> +		pasid_set_pcd(pte);
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
> +		pasid_set_pwt(pte);
> +	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
> +		pasid_set_cd(pte);
> +	pasid_set_pat(pte, pasid_data->pat);
> +
> +	return 0;
> +}
> +
> +/**
> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
> + * This could be used for guest shared virtual address. In this case, the
> + * first level page tables are used for GVA-GPA translation in the guest,
> + * second level page tables are used for GPA-HPA translation.
> + *
> + * @iommu:      IOMMU which the device belong to
> + * @dev:        Device to be set up for translation
> + * @gpgd:       FLPTPTR: First Level Page translation pointer in GPA
> + * @pasid:      PASID to be programmed in the device PASID table
> + * @pasid_data: Additional PASID info from the guest bind request
> + * @domain:     Domain info for setting up second level page tables
> + * @addr_width: Address width of the first level (guest)
> + */
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *gpgd,
> +			int pasid, struct iommu_gpasid_bind_data_vtd *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width)
> +{
> +	struct pasid_entry *pte;
> +	struct dma_pte *pgd;
> +	int ret = 0;
> +	u64 pgd_val;
> +	int agaw;
> +	u16 did;
> +
> +	if (!ecap_nest(iommu->ecap)) {
> +		pr_err_ratelimited("IOMMU: %s: No nested translation support\n",
> +		       iommu->name);
> +		return -EINVAL;
> +	}
> +
> +	if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> +		pr_err_ratelimited("Domain is not in nesting mode, %x\n", domain->flags);
> +		return -EINVAL;
> +	}
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	/*
> +	 * Caller must ensure PASID entry is not in use, i.e. not bind the
> +	 * same PASID to the same device twice.
> +	 */
> +	if (pasid_pte_is_present(pte))
> +		return -EBUSY;
> +
> +	pasid_clear_entry(pte);
> +
> +	/* Sanity checking performed by caller to make sure address
> +	 * width matching in two dimensions:
> +	 * 1. CPU vs. IOMMU
> +	 * 2. Guest vs. Host.
> +	 */
> +	switch (addr_width) {
> +	case ADDR_WIDTH_5LEVEL:
> +		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> +			cap_5lp_support(iommu->cap)) {
> +			pasid_set_flpm(pte, 1);
> +		} else {
> +			dev_err_ratelimited(dev, "5-level paging not supported\n");
> +			return -EINVAL;
> +		}
> +		break;
> +	case ADDR_WIDTH_4LEVEL:
> +		pasid_set_flpm(pte, 0);
> +		break;
> +	default:
> +		dev_err_ratelimited(dev, "Invalid guest address width %d\n", addr_width);
> +		return -EINVAL;
> +	}
> +
> +	/* First level PGD is in GPA, must be supported by the second level */
> +	if ((u64)gpgd > domain->max_addr) {
> +		dev_err_ratelimited(dev, "Guest PGD %llx not supported, max %llx\n",
> +			(u64)gpgd, domain->max_addr);
> +		return -EINVAL;
> +	}
> +	pasid_set_flptr(pte, (u64)gpgd);
> +
> +	ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
> +	if (ret) {
> +		dev_err_ratelimited(dev, "Guest PASID bind data not supported\n");
> +		return ret;
> +	}
> +
> +	/* Setup the second level based on the given domain */
> +	pgd = domain->pgd;
> +
> +	agaw = iommu_skip_agaw(domain, iommu, &pgd);
> +	if (agaw < 0) {
> +		dev_err_ratelimited(dev, "Invalid domain page table\n");
> +		return -EINVAL;
> +	}
> +	pgd_val = virt_to_phys(pgd);
> +	pasid_set_slptr(pte, pgd_val);
> +	pasid_set_fault_enable(pte);
> +
> +	did = domain->iommu_did[iommu->seq_id];
> +	pasid_set_domain_id(pte, did);
> +
> +	pasid_set_address_width(pte, agaw);
> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> +
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> +	pasid_set_present(pte);
> +	pasid_flush_caches(iommu, pte, pasid, did);
> +
> +	return ret;
> +}
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 92de6df24ccb..698015ee3f04 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -36,6 +36,7 @@
>   * to vmalloc or even module mappings.
>   */
>  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> +#define PASID_FLAG_NESTED		BIT(1)
>  
>  /*
>   * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
> @@ -51,6 +52,11 @@ struct pasid_entry {
>  	u64 val[8];
>  };
>  
> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> +#define PASID_ENTRY_PGTT_NESTED		(3)
> +#define PASID_ENTRY_PGTT_PT		(4)
> +
>  /* The representative of a PASID table */
>  struct pasid_table {
>  	void			*table;		/* pasid table pointer */
> @@ -99,6 +105,12 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, int pasid);
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *pgd,
> +			int pasid,
> +			struct iommu_gpasid_bind_data_vtd *pasid_data,
> +			struct dmar_domain *domain,
> +			int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ed7171d2ae1f..6da03f627ba3 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -42,6 +42,9 @@
>  #define DMA_FL_PTE_PRESENT	BIT_ULL(0)
>  #define DMA_FL_PTE_XD		BIT_ULL(63)
>  
> +#define ADDR_WIDTH_5LEVEL	(57)
> +#define ADDR_WIDTH_4LEVEL	(48)
> +
>  #define CONTEXT_TT_MULTI_LEVEL	0
>  #define CONTEXT_TT_DEV_IOTLB	1
>  #define CONTEXT_TT_PASS_THROUGH 2
> @@ -480,6 +483,31 @@ struct context_entry {
>  	u64 hi;
>  };
>  
> +/* si_domain contains mulitple devices */
> +#define DOMAIN_FLAG_STATIC_IDENTITY		BIT(0)
> +
> +/*
> + * This is a DMA domain allocated through the iommu domain allocation
> + * interface. But one or more devices belonging to this domain have
> + * been chosen to use a private domain. We should avoid to use the
> + * map/unmap/iova_to_phys APIs on it.
> + */
> +#define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
> +
> +/*
> + * When VT-d works in the scalable mode, it allows DMA translation to
> + * happen through either first level or second level page table. This
> + * bit marks that the DMA translation for the domain goes through the
> + * first level page table, otherwise, it goes through the second level.
> + */
> +#define DOMAIN_FLAG_USE_FIRST_LEVEL		BIT(2)
> +
> +/*
> + * Domain represents a virtual machine which demands iommu nested
> + * translation mode support.
> + */
> +#define DOMAIN_FLAG_NESTING_MODE		BIT(3)
> +
>  struct dmar_domain {
>  	int	nid;			/* node id */
>  
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 02/10] iommu/uapi: Define a mask for bind data
  2020-04-08 13:07   ` Joerg Roedel
@ 2020-04-08 16:11     ` Jacob Pan
  0 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-08 16:11 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Lu Baolu, iommu, LKML, David Woodhouse, Jean-Philippe Brucker,
	Eric Auger, Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, jacob.jun.pan

On Wed, 8 Apr 2020 15:07:22 +0200
Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Apr 03, 2020 at 11:42:06AM -0700, Jacob Pan wrote:
> > Memory type related flags can be grouped together for one simple
> > check.
> > 
> > ---
> > v9 renamed from EMT to MTS since these are memory type support
> > flags. ---
> > 
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  include/uapi/linux/iommu.h | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 4ad3496e5c43..d7bcbc5f79b0 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
> >  	__u32 pat;
> >  	__u32 emt;
> >  };
> > -
> > +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
> > (IOMMU_SVA_VTD_GPASID_CD | \
> > +					 IOMMU_SVA_VTD_GPASID_EMTE
> > | \
> > +					 IOMMU_SVA_VTD_GPASID_PCD
> > |  \
> > +
> > IOMMU_SVA_VTD_GPASID_PWT)  
> 
> Where and how will this be used? Can you add this in the patch that
> actually makes use of it?
> 
They are used in later patch when setting up PASID entry for guest SVA.
[PATCH v11 04/10] iommu/vt-d: Add nested translation helper function

> Also please add newlines before and after that define.
> 
Will do, thanks.

> 
> Regards,
> 
> 	Joerg

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-03 18:42 ` [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2020-04-09  7:41   ` Auger Eric
  2020-04-10 19:45     ` Jacob Pan
  2020-04-10 21:06     ` Jacob Pan
  0 siblings, 2 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-09  7:41 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron, Yi L

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
> 
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> ---
> v11: Fixed locking, avoid duplicated paging mode check, added helper to
> free svm if device list is empty. Use rate limited error message since
> the bind gpasid call comes from user space.
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c |   4 +
>  drivers/iommu/intel-svm.c   | 206 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |   8 +-
>  include/linux/intel-svm.h   |  17 ++++
>  4 files changed, 234 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index c0dadec5a6b3..94c7993dac6a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> +#endif
>  };
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index d7f2a5358900..7cf711318b87 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
>  	list_for_each_entry((sdev), &(svm)->devs, list)	\
>  		if ((d) != (sdev)->dev) {} else
>  
> +
> +static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
> +{
> +	if (list_empty(&svm->devs)) {
> +		ioasid_set_data(pasid, NULL);
> +		kfree(svm);
> +	}
> +}
> +
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +			struct device *dev,
> +			struct iommu_gpasid_bind_data *data)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct dmar_domain *dmar_domain;
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = 0;
> +
> +	if (WARN_ON(!iommu) || !data)
> +		return -EINVAL;
> +
> +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> +		return -EINVAL;
> +
> +	if (dev_is_pci(dev)) {
> +		/* VT-d supports devices with full 20 bit PASIDs only */
> +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> +			return -EINVAL;
> +	} else {
> +		return -ENOTSUPP;
> +	}
> +
> +	/*
> +	 * We only check host PASID range, we have no knowledge to check
> +	 * guest PASID range.
> +	 */
> +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> +		return -EINVAL;
> +
> +	dmar_domain = to_dmar_domain(domain);
> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, data->hpasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	if (svm) {
> +		/*
> +		 * If we found svm for the PASID, there must be at
> +		 * least one device bond, otherwise svm should be freed.
> +		 */
> +		if (WARN_ON(list_empty(&svm->devs))) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		for_each_svm_dev(sdev, svm, dev) {
> +			/* In case of multiple sub-devices of the same pdev
> +			 * assigned, we should allow multiple bind calls with
> +			 * the same PASID and pdev.
> +			 */
> +			sdev->users++;
What if this is not an mdev device. Is it also allowed?
> +			goto out;
> +		}
> +	} else {
> +		/* We come here when PASID has never been bond to a device. */
> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> +		if (!svm) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
> +		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
> +		 * ownership. We can drop the following line once VFIO and IOASID set
> +		 * check is in place.
> +		 */
> +		svm->mm = get_task_mm(current);
> +		svm->pasid = data->hpasid;
> +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> +			svm->gpasid = data->gpasid;
> +			svm->flags |= SVM_FLAG_GUEST_PASID;
> +		}
> +		ioasid_set_data(data->hpasid, svm);
> +		INIT_LIST_HEAD_RCU(&svm->devs);
> +		mmput(svm->mm);
> +	}
> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> +	if (!sdev) {
> +		/*
> +		 * If this is a new PASID that never bond to a device, then
> +		 * the device list must be empty which indicates struct svm
> +		 * was allocated in this function.
> +		 */
> +		intel_svm_free_if_empty(svm, data->hpasid);
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sdev->dev = dev;
> +	sdev->users = 1;
> +
> +	/* Set up device context entry for PASID if not enabled already */
> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> +	if (ret) {
> +		dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
> +		kfree(sdev);
> +		intel_svm_free_if_empty(svm, data->hpasid);
> +		goto out;
> +	}
> +
> +	/*
> +	 * PASID table is per device for better security. Therefore, for
> +	 * each bind of a new device even with an existing PASID, we need to
> +	 * call the nested mode setup function here.
> +	 */
> +	spin_lock(&iommu->lock);
> +	ret = intel_pasid_setup_nested(iommu,
> +				       dev,
> +				       (pgd_t *)data->gpgd,
> +				       data->hpasid,
> +				       &data->vtd,
> +				       dmar_domain,
> +				       data->addr_width);
> +	if (ret) {
> +		dev_err_ratelimited(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
> +			data->hpasid, ret);
> +		/*
> +		 * PASID entry should be in cleared state if nested mode
> +		 * set up failed. So we only need to clear IOASID tracking
> +		 * data such that free call will succeed.
> +		 */
> +		kfree(sdev);
> +		intel_svm_free_if_empty(svm, data->hpasid);
> +		spin_unlock(&iommu->lock);
> +		goto out;
> +	}
> +	spin_unlock(&iommu->lock);
> +	svm->flags |= SVM_FLAG_GUEST_MODE;
> +
> +	init_rcu_head(&sdev->rcu);
> +	list_add_rcu(&sdev->list, &svm->devs);
> + out:
> +	mutex_unlock(&pasid_mutex);
> +	return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
This function's code looks very very similar to intel_svm_unbind_mm()
besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
have means to factorize the code by checking the SVM_FLAG_GUEST_MODE flag?

Please could you explain again what does happen if the guest process
dies/VM dies. How do we make sure the svm get freed. I fail to
understand the whole picture at the moment as you cannot fully rely on
the userspace to call unbind_gpasid.
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm;
> +	int ret = -EINVAL;
> +
> +	if (WARN_ON(!iommu))
> +		return -EINVAL;
> +
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (!svm) {
> +		ret = -EINVAL;
As per the discussion on the VFIO series, shall we return an error in
that case? Same below?
> +		goto out;
> +	}
> +
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	for_each_svm_dev(sdev, svm, dev) {
> +		ret = 0;
> +		sdev->users--;
> +		if (!sdev->users) {
> +			list_del_rcu(&sdev->list);
> +			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0, -1, 0);
Don't we need to flush the (DEV-)IOTLBs as well?
> +			/* TODO: Drain in flight PRQ for the PASID since it
> +			 * may get reused soon, we don't want to
> +			 * confuse with its previous life.
> +			 * intel_svm_drain_prq(dev, pasid);
> +			 */
> +			kfree_rcu(sdev, rcu);
> +
> +			if (list_empty(&svm->devs)) {
> +				/*
> +				 * We do not free the IOASID here in that
> +				 * IOMMU driver did not allocate it.
s/in/as?
> +				 * Unlike native SVM, IOASID for guest use was
> +				 * allocated prior to the bind call.> +				 * In any case, if the free call comes before
> +				 * the unbind, IOMMU driver will get notified
> +				 * and perform cleanup.
> +				 */
> +				ioasid_set_data(pasid, NULL);
> +				kfree(svm);
> +			}
nit: you may use intel_svm_free_if_empty()
> +		}
> +		break;
> +	}
> +out:
> +	mutex_unlock(&pasid_mutex);
> +
nit : spare new line
> +	return ret;
> +}
> +
>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
>  {
>  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 6da03f627ba3..a5bd53cf190c 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device *dev);
>  extern void intel_svm_check(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_gpasid_bind_data *data);
> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>  struct svm_dev_ops;
>  
>  struct intel_svm_dev {
> @@ -723,9 +725,13 @@ struct intel_svm_dev {
>  struct intel_svm {
>  	struct mmu_notifier notifier;
>  	struct mm_struct *mm;
> +
>  	struct intel_iommu *iommu;
>  	int flags;
>  	int pasid;
> +	int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
> +		     * to guest PASID mapping.
> +		     */
>  	struct list_head devs;
>  	struct list_head list;
>  };
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index d7c403d0dd27..c19690937540 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>   * do such IOTLB flushes automatically.
>   */
>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
> + * In this case the mm_struct is in the guest kernel or userspace, its life
> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE	(1<<2)
> +/*
> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
> + * which requires guest and host PASID translation at both directions. We keep
> + * track of guest PASID in order to provide lookup service to device drivers.
> + * One such example is a physical function (PF) driver that supports mediated
> + * device (mdev) assignment. Guest programming of mdev configuration space can
> + * only be done with guest PASID, therefore PF driver needs to find the matching
> + * host PASID to program the real hardware.
> + */
> +#define SVM_FLAG_GUEST_PASID	(1<<3)
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function
  2020-04-03 18:42 ` [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
@ 2020-04-09  8:50   ` Auger Eric
  2020-04-10 21:56     ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Auger Eric @ 2020-04-09  8:50 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> ---
> v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
>       Fixed devTLB invalidation granularity mapping. Disregard G=1 case
>       and use address selective invalidation only.
> 
> v7 review fixed in v10
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 158 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 158 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 94c7993dac6a..045c5c08d71d 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5594,6 +5594,163 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not support IOTLB granularity for request without PASID (second level).
> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + */
> +
> +const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> +	/*
> +	 * PASID based IOTLB invalidation: PASID selective (per PASID),
> +	 * page selective (address granularity)
> +	 */
> +	{-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> +	/* PASID based dev TLBs */
> +	{-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> +	/* PASID cache */
> +	{-EINVAL, -EINVAL, -EINVAL}
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu)
> +{
> +	return inv_type_granu_table[type][granu];
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> +	u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> +	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
> +	 * granu size in contiguous memory.
> +	 */
> +	return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct intel_iommu *iommu;
> +	unsigned long flags;
> +	int cache_type;
> +	u8 bus, devfn;
> +	u16 did, sid;
> +	int ret = 0;
> +	u64 size = 0;
> +
> +	if (!inv_info || !dmar_domain ||
> +		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> +		return -EINVAL;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;

Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	spin_lock(&iommu->lock);
> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +	if (!info) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +
> +	/* Size is only valid in non-PASID selective invalidation */
> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> +		size = to_vtd_size(inv_info->addr_info.granule_size,
> +				   inv_info->addr_info.nb_granules);
> +
> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> +		int granu = 0;
> +		u64 pasid = 0;
> +
> +		granu = to_vtd_granularity(cache_type, inv_info->granularity);
> +		if (granu == -EINVAL) {
> +			pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
> +				inv_info->granularity);
rate unlimited traces here and after.
ret = -EINVAL?
also related to the discussion we had on the VFIO series. What is the
error policy?
> +			break;
> +		}
> +
> +		/* PASID is stored in different locations based on granularity */
not sure the above comment really is requested
> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> +			inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
> +			pasid = inv_info->pasid_info.pasid;
> +		else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> +			inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
> +			pasid = inv_info->addr_info.pasid;
> +
> +		switch (BIT(cache_type)) {
> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> +			if ((inv_info->granularity == IOMMU_INV_GRANU_ADDR) &&
> +				size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> +				pr_err("Address out of range, 0x%llx, size order %llu\n",
> +					inv_info->addr_info.addr, size);
> +				ret = -ERANGE;
> +				goto out_unlock;
> +			}
> +
> +			qi_flush_piotlb(iommu, did,
> +					pasid,
> +					mm_to_dma_pfn(inv_info->addr_info.addr),
This does not sound correct to me:
inv_info->addr_info.addr and inv_info->addr_info.flags are not valid in
case of PASID-selective invalidation
> +					(granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> +					inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> +
> +			/*
> +			 * Always flush device IOTLB if ATS is enabled. vIOMMU
> +			 * in the guest may assume IOTLB flush is inclusive,
> +			 * which is more efficient.
> +			 */
> +			if (info->ats_enabled)
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> +						pasid, info->ats_qdep,
> +						inv_info->addr_info.addr, size,
same
> +						granu);
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> +			if (info->ats_enabled)
> +				qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> +						inv_info->addr_info.pasid, info->ats_qdep,
nit: use pasid directly
> +						inv_info->addr_info.addr, size,
> +						granu);
> +			else
> +				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_PASID:
> +			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
> +			break;
> +		default:
> +			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
> +				cache_type);
> +			ret = -EINVAL;
> +		}
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +#endif
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot, gfp_t gfp)
> @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>  #endif
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register
  2020-04-03 18:42 ` [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register Jacob Pan
@ 2020-04-09 10:14   ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-09 10:14 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> Virtual command registers are used in the guest only, to prevent
> vmexit cost, we cache the capability and store it during initialization.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
nit: following Joerg's comment on 02/10, may be worth squashing it into
the patch that uses it (10/10)

Thanks

Eric
> 
> ---
> v7 Reviewed by Eric & Baolu
> ---
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/dmar.c        | 1 +
>  include/linux/intel-iommu.h | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 4d6b7b5b37ee..3b36491c8bbb 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
>  		warn_invalid_dmar(phys_addr, " returns all ones");
>  		goto unmap;
>  	}
> +	iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
>  
>  	/* the registers might be more than one page */
>  	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index f775bb825343..e02a96848586 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -194,6 +194,9 @@
>  #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
>  #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping Control */
>  
> +/* Virtual command interface capability */
> +#define vccap_pasid(v)		((v & DMA_VCS_PAS)) /* PASID allocation */
> +
>  /* IOTLB_REG */
>  #define DMA_TLB_FLUSH_GRANU_OFFSET  60
>  #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
> @@ -287,6 +290,7 @@
>  
>  /* PRS_REG */
>  #define DMA_PRS_PPR	((u32)1)
> +#define DMA_VCS_PAS	((u64)1)
>  
>  #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)			\
>  do {									\
> @@ -562,6 +566,7 @@ struct intel_iommu {
>  	u64		reg_size; /* size of hw register set */
>  	u64		cap;
>  	u64		ecap;
> +	u64		vccap;
>  	u32		gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
>  	raw_spinlock_t	register_lock; /* protect register handling */
>  	int		seq_id;	/* sequence id of the iommu */
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID
  2020-04-03 18:42 ` [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
@ 2020-04-09 10:31   ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-09 10:31 UTC (permalink / raw)
  To: Jacob Pan, Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Christoph Hellwig, Jonathan Cameron

Hi Jacob,

On 4/3/20 8:42 PM, Jacob Pan wrote:
> When VT-d driver runs in the guest, PASID allocation must be
> performed via virtual command interface. This patch registers a
> custom IOASID allocator which takes precedence over the default
> XArray based allocator. The resulting IOASID allocation will always
> come from the host. This ensures that PASID namespace is system-
> wide.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 84 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |  2 ++
>  2 files changed, 86 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 045c5c08d71d..ff3f0386951f 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1732,6 +1732,9 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
>  		if (ecap_prs(iommu->ecap))
>  			intel_svm_finish_prq(iommu);
>  	}
> +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> +		ioasid_unregister_allocator(&iommu->pasid_allocator);
> +
>  #endif
>  }
>  
> @@ -3266,6 +3269,84 @@ static int copy_translation_tables(struct intel_iommu *iommu)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static ioasid_t intel_vcmd_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
> +{
> +	struct intel_iommu *iommu = data;
> +	ioasid_t ioasid;
> +
> +	if (!iommu)
> +		return INVALID_IOASID;
> +	/*
> +	 * VT-d virtual command interface always uses the full 20 bit
> +	 * PASID range. Host can partition guest PASID range based on
> +	 * policies but it is out of guest's control.
> +	 */
> +	if (min < PASID_MIN || max > intel_pasid_max_id)
> +		return INVALID_IOASID;
> +
> +	if (vcmd_alloc_pasid(iommu, &ioasid))
> +		return INVALID_IOASID;
> +
> +	return ioasid;
> +}
> +
> +static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
> +{
> +	struct intel_iommu *iommu = data;
> +
> +	if (!iommu)
> +		return;
> +	/*
> +	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> +	 * We can only free the PASID when all the devices are unbound.
> +	 */
> +	if (ioasid_find(NULL, ioasid, NULL)) {
> +		pr_alert("Cannot free active IOASID %d\n", ioasid);
> +		return;
> +	}
> +	vcmd_free_pasid(iommu, ioasid);
> +}
> +
> +static void register_pasid_allocator(struct intel_iommu *iommu)
> +{
> +	/*
> +	 * If we are running in the host, no need for custom allocator
> +	 * in that PASIDs are allocated from the host system-wide.
> +	 */
> +	if (!cap_caching_mode(iommu->cap))
> +		return;
> +
> +	if (!sm_supported(iommu)) {
> +		pr_warn("VT-d Scalable Mode not enabled, no PASID allocation\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Register a custom PASID allocator if we are running in a guest,
> +	 * guest PASID must be obtained via virtual command interface.
> +	 * There can be multiple vIOMMUs in each guest but only one allocator
> +	 * is active. All vIOMMU allocators will eventually be calling the same
> +	 * host allocator.
> +	 */
nit: I prefer
if (!ecap_vcs(iommu->ecap) || !vccap_pasid(iommu->vccap))
	returns;
as it removes indents.
		
> +	if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> +		pr_info("Register custom PASID allocator\n");
> +		iommu->pasid_allocator.alloc = intel_vcmd_ioasid_alloc;
> +		iommu->pasid_allocator.free = intel_vcmd_ioasid_free;
> +		iommu->pasid_allocator.pdata = (void *)iommu;
> +		if (ioasid_register_allocator(&iommu->pasid_allocator)) {
> +			pr_warn("Custom PASID allocator failed, scalable mode disabled\n");
> +			/*
> +			 * Disable scalable mode on this IOMMU if there
> +			 * is no custom allocator. Mixing SM capable vIOMMU
> +			 * and non-SM vIOMMU are not supported.
> +			 */
> +			intel_iommu_sm = 0;
> +		}
> +	}
> +}
> +#endif
> +
>  static int __init init_dmars(void)
>  {
>  	struct dmar_drhd_unit *drhd;
> @@ -3383,6 +3464,9 @@ static int __init init_dmars(void)
>  	 */
>  	for_each_active_iommu(iommu, drhd) {
>  		iommu_flush_write_buffer(iommu);
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +		register_pasid_allocator(iommu);
> +#endif
>  		iommu_set_root_entry(iommu);
>  		iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
>  		iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index f652db3198d9..e122cb30388e 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -19,6 +19,7 @@
>  #include <linux/iommu.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/dmar.h>
> +#include <linux/ioasid.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/iommu.h>
> @@ -588,6 +589,7 @@ struct intel_iommu {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	struct page_req_dsc *prq;
>  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> +	struct ioasid_allocator_ops pasid_allocator; /* Custom allocator for PASIDs */
>  #endif
>  	struct q_inval  *qi;            /* Queued invalidation info */
>  	u32 *iommu_state; /* Store iommu states between suspend and resume.*/
> 
Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-09  7:41   ` Auger Eric
@ 2020-04-10 19:45     ` Jacob Pan
  2020-04-16 10:43       ` Auger Eric
  2020-04-10 21:06     ` Jacob Pan
  1 sibling, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-10 19:45 UTC (permalink / raw)
  To: Auger Eric
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron, Yi L,
	jacob.jun.pan

Hi Eric,

On Thu, 9 Apr 2020 09:41:32 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/3/20 8:42 PM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> >     .-------------.  .---------------------------.
> >     |   vIOMMU    |  | Guest process CR3, FL only|
> >     |             |  '---------------------------'
> >     .----------------/
> >     | PASID Entry |--- PASID cache flush -
> >     '-------------'                       |
> >     |             |                       V
> >     |             |                CR3 in GPA
> >     '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >       v        v                          v
> > Host
> >     .-------------.  .----------------------.
> >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >     |             |  '----------------------'
> >     .----------------/  |
> >     | PASID Entry |     V (Nested xlate)
> >     '----------------\.------------------------------.
> >     |             |   |SL for GPA-HPA, default domain|
> >     |             |   '------------------------------'
> >     '-------------'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > ---
> > v11: Fixed locking, avoid duplicated paging mode check, added
> > helper to free svm if device list is empty. Use rate limited error
> > message since the bind gpasid call comes from user space.
> > ---
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 206
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h |   8 +- include/linux/intel-svm.h   |
> > 17 ++++ 4 files changed, 234 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index c0dadec5a6b3..94c7993dac6a
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > +#endif
> >  };
> >  
> >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..7cf711318b87 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
> >  	list_for_each_entry((sdev), &(svm)->devs, list)	\
> >  		if ((d) != (sdev)->dev) {} else
> >  
> > +
> > +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
> > u64 pasid) +{
> > +	if (list_empty(&svm->devs)) {
> > +		ioasid_set_data(pasid, NULL);
> > +		kfree(svm);
> > +	}
> > +}
> > +
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +			struct device *dev,
> > +			struct iommu_gpasid_bind_data *data)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct dmar_domain *dmar_domain;
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!iommu) || !data)
> > +		return -EINVAL;
> > +
> > +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +		return -EINVAL;
> > +
> > +	if (dev_is_pci(dev)) {
> > +		/* VT-d supports devices with full 20 bit PASIDs
> > only */
> > +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +			return -EINVAL;
> > +	} else {
> > +		return -ENOTSUPP;
> > +	}
> > +
> > +	/*
> > +	 * We only check host PASID range, we have no knowledge to
> > check
> > +	 * guest PASID range.
> > +	 */
> > +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > +		return -EINVAL;
> > +
> > +	dmar_domain = to_dmar_domain(domain);
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, data->hpasid, NULL);
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	if (svm) {
> > +		/*
> > +		 * If we found svm for the PASID, there must be at
> > +		 * least one device bond, otherwise svm should be
> > freed.
> > +		 */
> > +		if (WARN_ON(list_empty(&svm->devs))) {
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		for_each_svm_dev(sdev, svm, dev) {
> > +			/* In case of multiple sub-devices of the
> > same pdev
> > +			 * assigned, we should allow multiple bind
> > calls with
> > +			 * the same PASID and pdev.
> > +			 */
> > +			sdev->users++;  
> What if this is not an mdev device. Is it also allowed?
Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
example of normal use case. You can bind the same PCI device (PF or
SRIOV VF) more than once to the same PASID. Just need to unbind also.

Do you see any issues?

> > +			goto out;
> > +		}
> > +	} else {
> > +		/* We come here when PASID has never been bond to
> > a device. */
> > +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > +		if (!svm) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		/* REVISIT: upper layer/VFIO can track host
> > process that bind the PASID.
> > +		 * ioasid_set = mm might be sufficient for vfio to
> > check pasid VMM
> > +		 * ownership. We can drop the following line once
> > VFIO and IOASID set
> > +		 * check is in place.
> > +		 */
> > +		svm->mm = get_task_mm(current);
> > +		svm->pasid = data->hpasid;
> > +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > +			svm->gpasid = data->gpasid;
> > +			svm->flags |= SVM_FLAG_GUEST_PASID;
> > +		}
> > +		ioasid_set_data(data->hpasid, svm);
> > +		INIT_LIST_HEAD_RCU(&svm->devs);
> > +		mmput(svm->mm);
> > +	}
> > +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > +	if (!sdev) {
> > +		/*
> > +		 * If this is a new PASID that never bond to a
> > device, then
> > +		 * the device list must be empty which indicates
> > struct svm
> > +		 * was allocated in this function.
> > +		 */
> > +		intel_svm_free_if_empty(svm, data->hpasid);
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	sdev->dev = dev;
> > +	sdev->users = 1;
> > +
> > +	/* Set up device context entry for PASID if not enabled
> > already */
> > +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > +	if (ret) {
> > +		dev_err_ratelimited(dev, "Failed to enable PASID
> > capability\n");
> > +		kfree(sdev);
> > +		intel_svm_free_if_empty(svm, data->hpasid);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * PASID table is per device for better security.
> > Therefore, for
> > +	 * each bind of a new device even with an existing PASID,
> > we need to
> > +	 * call the nested mode setup function here.
> > +	 */
> > +	spin_lock(&iommu->lock);
> > +	ret = intel_pasid_setup_nested(iommu,
> > +				       dev,
> > +				       (pgd_t *)data->gpgd,
> > +				       data->hpasid,
> > +				       &data->vtd,
> > +				       dmar_domain,
> > +				       data->addr_width);
> > +	if (ret) {
> > +		dev_err_ratelimited(dev, "Failed to set up PASID
> > %llu in nested mode, Err %d\n",
> > +			data->hpasid, ret);
> > +		/*
> > +		 * PASID entry should be in cleared state if
> > nested mode
> > +		 * set up failed. So we only need to clear IOASID
> > tracking
> > +		 * data such that free call will succeed.
> > +		 */
> > +		kfree(sdev);
> > +		intel_svm_free_if_empty(svm, data->hpasid);
> > +		spin_unlock(&iommu->lock);
> > +		goto out;
> > +	}
> > +	spin_unlock(&iommu->lock);
> > +	svm->flags |= SVM_FLAG_GUEST_MODE;
> > +
> > +	init_rcu_head(&sdev->rcu);
> > +	list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > +	mutex_unlock(&pasid_mutex);
> > +	return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{  
> This function's code looks very very similar to intel_svm_unbind_mm()
> besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
> have means to factorize the code by checking the SVM_FLAG_GUEST_MODE
> flag?
> 
Yes, I agree there is room to consolidate. But the same time there are
two potential changes coming:
1. I also try to consolidate with Jean's generic sva_bind_device() for
native bind.
https://lkml.org/lkml/2020/2/24/1351
2. Support enqcmd and lazy PASID free.
https://lkml.org/lkml/2020/3/30/910

Perhaps we can revisit after those get resolved?

> Please could you explain again what does happen if the guest process
> dies/VM dies. How do we make sure the svm get freed. I fail to
> understand the whole picture at the moment as you cannot fully rely on
> the userspace to call unbind_gpasid.

The fault handling needs to be better documented. Here is the
current plan as I proposed in the IOASID extension set with IOASID
notifier.
https://lkml.org/lkml/2020/3/25/874
I will do a document in the next version. In short On VT-d, IOASID has
the following subscribers:
1. IOMMU driver, when free comes before unbind, since there is no way
to hook up with guest mmu_notifier directly.
2. KVM, needs to update VMCS PASID translation table. This is needed
when we support non-identity G-H PASID mapping.
3. VDCM (Virtual device composition module), similar to SRIOV PF
driver, also needs G-H PASID translation.

So when a guest goes away, based on FD close VFIO needs to free all
the PASIDs belong to the guest. The proposal is to call
ioasid_free_set() where the ioasid_set contains all the IOASIDs
for the VM.
A blocking notifier will be called, IOMMU gets notified and performs
all the clean up on IOMMU side as follows:
1.clear PASID table entry, flush PASID and IOTLB, devTLB. Drain page
requests for the PASID. The same thing you would do during unbind.
2.mark PASID defunct such that we can drop further DMA faults and PRQ

For individual guest PASID, the normal order is 
1. alloc (via virtual cmd)
2. bind_gpasid
3. unbind_gpasid
4. free (via virtual cmd)

If 4 comes before 3, free will call ioasid notifier. IOMMU gets
notified and perform cleanup as above. When unbind comes, there is no
PASID entry then exit.

One challenge is the cleanup can be lengthy and requires thread
context. SVA code needed to call ioasid_free in atomic context
(mmu_notifier release, SRCU callback). This might be changing with
Jean's patch to defer ioasid_free, this restriction may go away.
https://www.spinics.net/lists/iommu/msg43065.html

I still feel the clean up work in notifier handler is too heavy.
Two options I can think of:
1. let VFIO enforce the unbind->free order
2. introduce some reclaim phase for IOASID instead of putting the
IOASID back to the pool immediately.

Any suggestions?

> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm;
> > +	int ret = -EINVAL;
> > +
> > +	if (WARN_ON(!iommu))
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, pasid, NULL);
> > +	if (!svm) {
> > +		ret = -EINVAL;  
> As per the discussion on the VFIO series, shall we return an error in
> that case? Same below?
You mean make unbind a void function? I agree, there is really no way
to recover. Same for invalidate, which cannot fail.
Perhaps another series?

> > +		goto out;
> > +	}
> > +
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	for_each_svm_dev(sdev, svm, dev) {
> > +		ret = 0;
> > +		sdev->users--;
> > +		if (!sdev->users) {
> > +			list_del_rcu(&sdev->list);
> > +			intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);  
> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
> > +			/* TODO: Drain in flight PRQ for the PASID
> > since it
> > +			 * may get reused soon, we don't want to
> > +			 * confuse with its previous life.
> > +			 * intel_svm_drain_prq(dev, pasid);
> > +			 */
> > +			kfree_rcu(sdev, rcu);
> > +
> > +			if (list_empty(&svm->devs)) {
> > +				/*
> > +				 * We do not free the IOASID here
> > in that
> > +				 * IOMMU driver did not allocate
> > it.  
> s/in/as?
> > +				 * Unlike native SVM, IOASID for
> > guest use was
> > +				 * allocated prior to the bind
> > call.> +				 * In any case, if the free
> > call comes before
> > +				 * the unbind, IOMMU driver will
> > get notified
> > +				 * and perform cleanup.
> > +				 */
> > +				ioasid_set_data(pasid, NULL);
> > +				kfree(svm);
> > +			}  
> nit: you may use intel_svm_free_if_empty()
> > +		}
> > +		break;
> > +	}
> > +out:
> > +	mutex_unlock(&pasid_mutex);
> > +  
> nit : spare new line
Sounds good.

> > +	return ret;
> > +}
> > +
> >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops) {
> >  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 6da03f627ba3..a5bd53cf190c
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
> > *dev); extern void intel_svm_check(struct intel_iommu *iommu);
> >  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> >  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +		struct device *dev, struct iommu_gpasid_bind_data
> > *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
> > pasid); struct svm_dev_ops;
> >  
> >  struct intel_svm_dev {
> > @@ -723,9 +725,13 @@ struct intel_svm_dev {
> >  struct intel_svm {
> >  	struct mmu_notifier notifier;
> >  	struct mm_struct *mm;
> > +
> >  	struct intel_iommu *iommu;
> >  	int flags;
> >  	int pasid;
> > +	int gpasid; /* Guest PASID in case of vSVA bind with
> > non-identity host
> > +		     * to guest PASID mapping.
> > +		     */
> >  	struct list_head devs;
> >  	struct list_head list;
> >  };
> > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > index d7c403d0dd27..c19690937540 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> >   * do such IOTLB flushes automatically.
> >   */
> >  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.
> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE	(1<<2)
> > +/*
> > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > PASID space,
> > + * which requires guest and host PASID translation at both
> > directions. We keep
> > + * track of guest PASID in order to provide lookup service to
> > device drivers.
> > + * One such example is a physical function (PF) driver that
> > supports mediated
> > + * device (mdev) assignment. Guest programming of mdev
> > configuration space can
> > + * only be done with guest PASID, therefore PF driver needs to
> > find the matching
> > + * host PASID to program the real hardware.
> > + */
> > +#define SVM_FLAG_GUEST_PASID	(1<<3)
> >  
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-09  7:41   ` Auger Eric
  2020-04-10 19:45     ` Jacob Pan
@ 2020-04-10 21:06     ` Jacob Pan
  2020-04-16 10:46       ` Auger Eric
  1 sibling, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-10 21:06 UTC (permalink / raw)
  To: Auger Eric
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron, Yi L,
	jacob.jun.pan

Hi Eric,

Missed a few things in the last reply.

On Thu, 9 Apr 2020 09:41:32 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> > +			intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);  
> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
Right, pasid tear down should always include (DEV-)IOTLB flush, I
initially thought it is taken care of by intel_pasid_tear_down_entry().

> > +			/* TODO: Drain in flight PRQ for the PASID
> > since it
> > +			 * may get reused soon, we don't want to
> > +			 * confuse with its previous life.
> > +			 * intel_svm_drain_prq(dev, pasid);
> > +			 */
> > +			kfree_rcu(sdev, rcu);
> > +
> > +			if (list_empty(&svm->devs)) {
> > +				/*
> > +				 * We do not free the IOASID here
> > in that
> > +				 * IOMMU driver did not allocate
> > it.  
> s/in/as?
I meant to say "in that" as "for the reason that"

> > +				 * Unlike native SVM, IOASID for
> > guest use was
> > +				 * allocated prior to the bind
> > call.> +				 * In any case, if the free
> > call comes before
> > +				 * the unbind, IOMMU driver will
> > get notified
> > +				 * and perform cleanup.
> > +				 */
> > +				ioasid_set_data(pasid, NULL);
> > +				kfree(svm);
> > +			}  
> nit: you may use intel_svm_free_if_empty()
True, but I meant to insert ioasid_notifier under the list_empty
condition in the following ioasid patch.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function
  2020-04-09  8:50   ` Auger Eric
@ 2020-04-10 21:56     ` Jacob Pan
  2020-04-16 10:59       ` Auger Eric
  0 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-10 21:56 UTC (permalink / raw)
  To: Auger Eric
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron,
	jacob.jun.pan

On Thu, 9 Apr 2020 10:50:34 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/3/20 8:42 PM, Jacob Pan wrote:
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> > 
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> > 
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> > 
> > ---
> > v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
> >       Fixed devTLB invalidation granularity mapping. Disregard G=1
> > case and use address selective invalidation only.
> > 
> > v7 review fixed in v10
> > ---
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 158
> > ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5594,6 +5594,163 @@ static void
> > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >  
> > +/*
> > + * 2D array for converting and sanitizing IOMMU generic TLB
> > granularity to
> > + * VT-d granularity. Invalidation is typically included in the
> > unmap operation
> > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > guest
> > + * owns the first level page tables. Invalidations of translation
> > caches in the
> > + * guest are trapped and passed down to the host.
> > + *
> > + * vIOMMU in the guest will only expose first level page tables,
> > therefore
> > + * we do not support IOTLB granularity for request without PASID
> > (second level).
> > + *
> > + * For example, to find the VT-d granularity encoding for IOTLB
> > + * type and page selective granularity within PASID:
> > + * X: indexed by iommu cache type
> > + * Y: indexed by enum iommu_inv_granularity
> > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > + */
> > +
> > +const static int
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] =
> > {
> > +	/*
> > +	 * PASID based IOTLB invalidation: PASID selective (per
> > PASID),
> > +	 * page selective (address granularity)
> > +	 */
> > +	{-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > +	/* PASID based dev TLBs */
> > +	{-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> > +	/* PASID cache */
> > +	{-EINVAL, -EINVAL, -EINVAL}
> > +};
> > +
> > +static inline int to_vtd_granularity(int type, int granu)
> > +{
> > +	return inv_type_granu_table[type][granu];
> > +}
> > +
> > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > +{
> > +	u64 nr_pages = (granu_size * nr_granules) >>
> > VTD_PAGE_SHIFT; +
> > +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > for 2MB, etc.
> > +	 * IOMMU cache invalidate API passes granu_size in bytes,
> > and number of
> > +	 * granu size in contiguous memory.
> > +	 */
> > +	return order_base_2(nr_pages);
> > +}
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > +		struct device *dev, struct
> > iommu_cache_invalidate_info *inv_info) +{
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct intel_iommu *iommu;
> > +	unsigned long flags;
> > +	int cache_type;
> > +	u8 bus, devfn;
> > +	u16 did, sid;
> > +	int ret = 0;
> > +	u64 size = 0;
> > +
> > +	if (!inv_info || !dmar_domain ||
> > +		inv_info->version !=
> > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > +		return -EINVAL;
> > +
> > +	if (!dev || !dev_is_pci(dev))
> > +		return -ENODEV;  
> 
> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
Good point

> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +
> > +	spin_lock_irqsave(&device_domain_lock, flags);
> > +	spin_lock(&iommu->lock);
> > +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > devfn);
> > +	if (!info) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > +	sid = PCI_DEVID(bus, devfn);
> > +
> > +	/* Size is only valid in non-PASID selective invalidation
> > */
> > +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > +		size =
> > to_vtd_size(inv_info->addr_info.granule_size,
> > +
> > inv_info->addr_info.nb_granules); +
> > +	for_each_set_bit(cache_type, (unsigned long
> > *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > +		int granu = 0;
> > +		u64 pasid = 0;
> > +
> > +		granu = to_vtd_granularity(cache_type,
> > inv_info->granularity);
> > +		if (granu == -EINVAL) {
> > +			pr_err("Invalid cache type and granu
> > combination %d/%d\n", cache_type,
> > +				inv_info->granularity);  
> rate unlimited traces here and after.
> ret = -EINVAL?
> also related to the discussion we had on the VFIO series. What is the
> error policy?
Invalidation should not fail, there is no way to tell the guest. this
can be a void function or VFIO can ignore it similar to what is done to
iommu_unmap.


> > +			break;
> > +		}
> > +
> > +		/* PASID is stored in different locations based on
> > granularity */  
> not sure the above comment really is requested
Just to explain the situation where PASID can be in pasid_info or
addr_info.

> > +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID
> > &&
> > +			inv_info->pasid_info.flags &
> > IOMMU_INV_PASID_FLAGS_PASID)
> > +			pasid = inv_info->pasid_info.pasid;
> > +		else if (inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR &&
> > +			inv_info->addr_info.flags &
> > IOMMU_INV_ADDR_FLAGS_PASID)
> > +			pasid = inv_info->addr_info.pasid;
> > +
> > +		switch (BIT(cache_type)) {
> > +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > +			if ((inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR) &&
> > +				size && (inv_info->addr_info.addr
> > & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > +				pr_err("Address out of range,
> > 0x%llx, size order %llu\n",
> > +					inv_info->addr_info.addr,
> > size);
> > +				ret = -ERANGE;
> > +				goto out_unlock;
> > +			}
> > +
> > +			qi_flush_piotlb(iommu, did,
> > +					pasid,
> > +
> > mm_to_dma_pfn(inv_info->addr_info.addr),  
> This does not sound correct to me:
> inv_info->addr_info.addr and inv_info->addr_info.flags are not valid
> in case of PASID-selective invalidation
If granu is PASID-selective, this address is ignored. Since npages = -1.

> > +					(granu ==
> > QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> > +					inv_info->addr_info.flags
> > & IOMMU_INV_ADDR_FLAGS_LEAF); +
> > +			/*
> > +			 * Always flush device IOTLB if ATS is
> > enabled. vIOMMU
> > +			 * in the guest may assume IOTLB flush is
> > inclusive,
> > +			 * which is more efficient.
> > +			 */
> > +			if (info->ats_enabled)
> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > +						pasid,
> > info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,  
> same
> > +						granu);
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > +			if (info->ats_enabled)
> > +				qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,  
> nit: use pasid directly
will do.

> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			else
> > +				pr_warn("Passdown device IOTLB
> > flush w/o ATS!\n");
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_PASID:
> > +			qi_flush_pasid_cache(iommu, did, granu,
> > inv_info->pasid_info.pasid);
> > +			break;
> > +		default:
> > +			dev_err(dev, "Unsupported IOMMU
> > invalidation type %d\n",
> > +				cache_type);
> > +			ret = -EINVAL;
> > +		}
> > +	}
> > +out_unlock:
> > +	spin_unlock(&iommu->lock);
> > +	spin_unlock_irqrestore(&device_domain_lock, flags);
> > +
> > +	return ret;
> > +}
> > +#endif
> > +
> >  static int intel_iommu_map(struct iommu_domain *domain,
> >  			   unsigned long iova, phys_addr_t hpa,
> >  			   size_t size, int iommu_prot, gfp_t gfp)
> > @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.is_attach_deferred	=
> > intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> > INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >  #endif
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-10 19:45     ` Jacob Pan
@ 2020-04-16 10:43       ` Auger Eric
  2020-04-17  2:45         ` Tian, Kevin
  0 siblings, 1 reply; 33+ messages in thread
From: Auger Eric @ 2020-04-16 10:43 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron, Yi L

Hi Jacob,

On 4/10/20 9:45 PM, Jacob Pan wrote:
> Hi Eric,
> 
> On Thu, 9 Apr 2020 09:41:32 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/3/20 8:42 PM, Jacob Pan wrote:
>>> When supporting guest SVA with emulated IOMMU, the guest PASID
>>> table is shadowed in VMM. Updates to guest vIOMMU PASID table
>>> will result in PASID cache flush which will be passed down to
>>> the host as bind guest PASID calls.
>>>
>>> For the SL page tables, it will be harvested from device's
>>> default domain (request w/o PASID), or aux domain in case of
>>> mediated device.
>>>
>>>     .-------------.  .---------------------------.
>>>     |   vIOMMU    |  | Guest process CR3, FL only|
>>>     |             |  '---------------------------'
>>>     .----------------/
>>>     | PASID Entry |--- PASID cache flush -
>>>     '-------------'                       |
>>>     |             |                       V
>>>     |             |                CR3 in GPA
>>>     '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|--------
>>>       v        v                          v
>>> Host
>>>     .-------------.  .----------------------.
>>>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>>>     |             |  '----------------------'
>>>     .----------------/  |
>>>     | PASID Entry |     V (Nested xlate)
>>>     '----------------\.------------------------------.
>>>     |             |   |SL for GPA-HPA, default domain|
>>>     |             |   '------------------------------'
>>>     '-------------'
>>> Where:
>>>  - FL = First level/stage one page tables
>>>  - SL = Second level/stage two page tables
>>>
>>> ---
>>> v11: Fixed locking, avoid duplicated paging mode check, added
>>> helper to free svm if device list is empty. Use rate limited error
>>> message since the bind gpasid call comes from user space.
>>> ---
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/intel-iommu.c |   4 +
>>>  drivers/iommu/intel-svm.c   | 206
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/intel-iommu.h |   8 +- include/linux/intel-svm.h   |
>>> 17 ++++ 4 files changed, 234 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index c0dadec5a6b3..94c7993dac6a
>>> 100644 --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
>>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>>>  	.is_attach_deferred	=
>>> intel_iommu_is_attach_deferred, .pgsize_bitmap		=
>>> INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>> +#endif
>>>  };
>>>  
>>>  static void quirk_iommu_igfx(struct pci_dev *dev)
>>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
>>> index d7f2a5358900..7cf711318b87 100644
>>> --- a/drivers/iommu/intel-svm.c
>>> +++ b/drivers/iommu/intel-svm.c
>>> @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
>>>  	list_for_each_entry((sdev), &(svm)->devs, list)	\
>>>  		if ((d) != (sdev)->dev) {} else
>>>  
>>> +
>>> +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
>>> u64 pasid) +{
>>> +	if (list_empty(&svm->devs)) {
>>> +		ioasid_set_data(pasid, NULL);
>>> +		kfree(svm);
>>> +	}
>>> +}
>>> +
>>> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> +			struct device *dev,
>>> +			struct iommu_gpasid_bind_data *data)
>>> +{
>>> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> +	struct dmar_domain *dmar_domain;
>>> +	struct intel_svm_dev *sdev;
>>> +	struct intel_svm *svm;
>>> +	int ret = 0;
>>> +
>>> +	if (WARN_ON(!iommu) || !data)
>>> +		return -EINVAL;
>>> +
>>> +	if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
>>> +	    data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
>>> +		return -EINVAL;
>>> +
>>> +	if (dev_is_pci(dev)) {
>>> +		/* VT-d supports devices with full 20 bit PASIDs
>>> only */
>>> +		if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
>>> +			return -EINVAL;
>>> +	} else {
>>> +		return -ENOTSUPP;
>>> +	}
>>> +
>>> +	/*
>>> +	 * We only check host PASID range, we have no knowledge to
>>> check
>>> +	 * guest PASID range.
>>> +	 */
>>> +	if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
>>> +		return -EINVAL;
>>> +
>>> +	dmar_domain = to_dmar_domain(domain);
>>> +
>>> +	mutex_lock(&pasid_mutex);
>>> +	svm = ioasid_find(NULL, data->hpasid, NULL);
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>> +		goto out;
>>> +	}
>>> +
>>> +	if (svm) {
>>> +		/*
>>> +		 * If we found svm for the PASID, there must be at
>>> +		 * least one device bond, otherwise svm should be
>>> freed.
>>> +		 */
>>> +		if (WARN_ON(list_empty(&svm->devs))) {
>>> +			ret = -EINVAL;
>>> +			goto out;
>>> +		}
>>> +
>>> +		for_each_svm_dev(sdev, svm, dev) {
>>> +			/* In case of multiple sub-devices of the
>>> same pdev
>>> +			 * assigned, we should allow multiple bind
>>> calls with
>>> +			 * the same PASID and pdev.
>>> +			 */
>>> +			sdev->users++;  
>> What if this is not an mdev device. Is it also allowed?
> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> example of normal use case. You can bind the same PCI device (PF or
> SRIOV VF) more than once to the same PASID. Just need to unbind also.

I don't get the point of binding a non mdev device several times with
the same PASID. Do you intend to allow that at userspace level or
prevent this from happening in VFIO?

Besides, the comment is a bit misleading as it gives the impression it
is only true for mdev and there is no associated check.
> 
> Do you see any issues?
no I don't
> 
>>> +			goto out;
>>> +		}
>>> +	} else {
>>> +		/* We come here when PASID has never been bond to
>>> a device. */
>>> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
>>> +		if (!svm) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +		/* REVISIT: upper layer/VFIO can track host
>>> process that bind the PASID.
>>> +		 * ioasid_set = mm might be sufficient for vfio to
>>> check pasid VMM
>>> +		 * ownership. We can drop the following line once
>>> VFIO and IOASID set
>>> +		 * check is in place.
>>> +		 */
>>> +		svm->mm = get_task_mm(current);
>>> +		svm->pasid = data->hpasid;
>>> +		if (data->flags & IOMMU_SVA_GPASID_VAL) {
>>> +			svm->gpasid = data->gpasid;
>>> +			svm->flags |= SVM_FLAG_GUEST_PASID;
>>> +		}
>>> +		ioasid_set_data(data->hpasid, svm);
>>> +		INIT_LIST_HEAD_RCU(&svm->devs);
>>> +		mmput(svm->mm);
>>> +	}
>>> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
>>> +	if (!sdev) {
>>> +		/*
>>> +		 * If this is a new PASID that never bond to a
>>> device, then
>>> +		 * the device list must be empty which indicates
>>> struct svm
>>> +		 * was allocated in this function.
>>> +		 */
>>> +		intel_svm_free_if_empty(svm, data->hpasid);
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +	sdev->dev = dev;
>>> +	sdev->users = 1;
>>> +
>>> +	/* Set up device context entry for PASID if not enabled
>>> already */
>>> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
>>> +	if (ret) {
>>> +		dev_err_ratelimited(dev, "Failed to enable PASID
>>> capability\n");
>>> +		kfree(sdev);
>>> +		intel_svm_free_if_empty(svm, data->hpasid);
>>> +		goto out;
>>> +	}
>>> +
>>> +	/*
>>> +	 * PASID table is per device for better security.
>>> Therefore, for
>>> +	 * each bind of a new device even with an existing PASID,
>>> we need to
>>> +	 * call the nested mode setup function here.
>>> +	 */
>>> +	spin_lock(&iommu->lock);
>>> +	ret = intel_pasid_setup_nested(iommu,
>>> +				       dev,
>>> +				       (pgd_t *)data->gpgd,
>>> +				       data->hpasid,
>>> +				       &data->vtd,
>>> +				       dmar_domain,
>>> +				       data->addr_width);
>>> +	if (ret) {
>>> +		dev_err_ratelimited(dev, "Failed to set up PASID
>>> %llu in nested mode, Err %d\n",
>>> +			data->hpasid, ret);
>>> +		/*
>>> +		 * PASID entry should be in cleared state if
>>> nested mode
>>> +		 * set up failed. So we only need to clear IOASID
>>> tracking
>>> +		 * data such that free call will succeed.
>>> +		 */
>>> +		kfree(sdev);
>>> +		intel_svm_free_if_empty(svm, data->hpasid);
>>> +		spin_unlock(&iommu->lock);
>>> +		goto out;
>>> +	}
>>> +	spin_unlock(&iommu->lock);
>>> +	svm->flags |= SVM_FLAG_GUEST_MODE;
>>> +
>>> +	init_rcu_head(&sdev->rcu);
>>> +	list_add_rcu(&sdev->list, &svm->devs);
>>> + out:
>>> +	mutex_unlock(&pasid_mutex);
>>> +	return ret;
>>> +}
>>> +
>>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>>> +{  
>> This function's code looks very very similar to intel_svm_unbind_mm()
>> besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
>> have means to factorize the code by checking the SVM_FLAG_GUEST_MODE
>> flag?
>>
> Yes, I agree there is room to consolidate. But the same time there are
> two potential changes coming:
> 1. I also try to consolidate with Jean's generic sva_bind_device() for
> native bind.
> https://lkml.org/lkml/2020/2/24/1351
> 2. Support enqcmd and lazy PASID free.
> https://lkml.org/lkml/2020/3/30/910
> 
> Perhaps we can revisit after those get resolved?
yep
> 
>> Please could you explain again what does happen if the guest process
>> dies/VM dies. How do we make sure the svm get freed. I fail to
>> understand the whole picture at the moment as you cannot fully rely on
>> the userspace to call unbind_gpasid.
> 
> The fault handling needs to be better documented. Here is the
> current plan as I proposed in the IOASID extension set with IOASID
> notifier.
> https://lkml.org/lkml/2020/3/25/874
> I will do a document in the next version. In short On VT-d, IOASID has
> the following subscribers:
> 1. IOMMU driver, when free comes before unbind, since there is no way
> to hook up with guest mmu_notifier directly.
> 2. KVM, needs to update VMCS PASID translation table. This is needed
> when we support non-identity G-H PASID mapping.
> 3. VDCM (Virtual device composition module), similar to SRIOV PF
> driver, also needs G-H PASID translation.
> 
> So when a guest goes away, based on FD close VFIO needs to free all
> the PASIDs belong to the guest. The proposal is to call
> ioasid_free_set() where the ioasid_set contains all the IOASIDs
> for the VM.
> A blocking notifier will be called, IOMMU gets notified and performs
> all the clean up on IOMMU side as follows:
> 1.clear PASID table entry, flush PASID and IOTLB, devTLB. Drain page
> requests for the PASID. The same thing you would do during unbind.
> 2.mark PASID defunct such that we can drop further DMA faults and PRQ
> 
> For individual guest PASID, the normal order is 
> 1. alloc (via virtual cmd)
> 2. bind_gpasid
> 3. unbind_gpasid
> 4. free (via virtual cmd)
> 
> If 4 comes before 3, free will call ioasid notifier. IOMMU gets
> notified and perform cleanup as above. When unbind comes, there is no
> PASID entry then exit.
[*] could you detail when the free comes before unbind and the opposite?
> 
> One challenge is the cleanup can be lengthy and requires thread
> context. SVA code needed to call ioasid_free in atomic context
> (mmu_notifier release, SRCU callback). This might be changing with
> Jean's patch to defer ioasid_free, this restriction may go away.
> https://www.spinics.net/lists/iommu/msg43065.html
> 
> I still feel the clean up work in notifier handler is too heavy.
> Two options I can think of:
> 1. let VFIO enforce the unbind->free order
depends on [*]
> 2. introduce some reclaim phase for IOASID instead of putting the
> IOASID back to the pool immediately.
what do you mean by reclain phase?

Thanks

Eric
> 
> Any suggestions?

> 
>>> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> +	struct intel_svm_dev *sdev;
>>> +	struct intel_svm *svm;
>>> +	int ret = -EINVAL;
>>> +
>>> +	if (WARN_ON(!iommu))
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&pasid_mutex);
>>> +	svm = ioasid_find(NULL, pasid, NULL);
>>> +	if (!svm) {
>>> +		ret = -EINVAL;  
>> As per the discussion on the VFIO series, shall we return an error in
>> that case? Same below?
> You mean make unbind a void function? I agree, there is really no way
> to recover. Same for invalidate, which cannot fail.
> Perhaps another series?
> 
>>> +		goto out;
>>> +	}
>>> +
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>> +		goto out;
>>> +	}
>>> +
>>> +	for_each_svm_dev(sdev, svm, dev) {
>>> +		ret = 0;
>>> +		sdev->users--;
>>> +		if (!sdev->users) {
>>> +			list_del_rcu(&sdev->list);
>>> +			intel_pasid_tear_down_entry(iommu, dev,
>>> svm->pasid);  
>> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
>> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
>>> +			/* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> +			 * may get reused soon, we don't want to
>>> +			 * confuse with its previous life.
>>> +			 * intel_svm_drain_prq(dev, pasid);
>>> +			 */
>>> +			kfree_rcu(sdev, rcu);
>>> +
>>> +			if (list_empty(&svm->devs)) {
>>> +				/*
>>> +				 * We do not free the IOASID here
>>> in that
>>> +				 * IOMMU driver did not allocate
>>> it.  
>> s/in/as?
>>> +				 * Unlike native SVM, IOASID for
>>> guest use was
>>> +				 * allocated prior to the bind
>>> call.> +				 * In any case, if the free
>>> call comes before
>>> +				 * the unbind, IOMMU driver will
>>> get notified
>>> +				 * and perform cleanup.
>>> +				 */
>>> +				ioasid_set_data(pasid, NULL);
>>> +				kfree(svm);
>>> +			}  
>> nit: you may use intel_svm_free_if_empty()
>>> +		}
>>> +		break;
>>> +	}
>>> +out:
>>> +	mutex_unlock(&pasid_mutex);
>>> +  
>> nit : spare new line
> Sounds good.
> 
>>> +	return ret;
>>> +}
>>> +
>>>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
>>> struct svm_dev_ops *ops) {
>>>  	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 6da03f627ba3..a5bd53cf190c
>>> 100644 --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
>>> *dev); extern void intel_svm_check(struct intel_iommu *iommu);
>>>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>>>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>>> -
>>> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> +		struct device *dev, struct iommu_gpasid_bind_data
>>> *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
>>> pasid); struct svm_dev_ops;
>>>  
>>>  struct intel_svm_dev {
>>> @@ -723,9 +725,13 @@ struct intel_svm_dev {
>>>  struct intel_svm {
>>>  	struct mmu_notifier notifier;
>>>  	struct mm_struct *mm;
>>> +
>>>  	struct intel_iommu *iommu;
>>>  	int flags;
>>>  	int pasid;
>>> +	int gpasid; /* Guest PASID in case of vSVA bind with
>>> non-identity host
>>> +		     * to guest PASID mapping.
>>> +		     */
>>>  	struct list_head devs;
>>>  	struct list_head list;
>>>  };
>>> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
>>> index d7c403d0dd27..c19690937540 100644
>>> --- a/include/linux/intel-svm.h
>>> +++ b/include/linux/intel-svm.h
>>> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>>>   * do such IOTLB flushes automatically.
>>>   */
>>>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
>>> +/*
>>> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
>>> to a device.
>>> + * In this case the mm_struct is in the guest kernel or userspace,
>>> its life
>>> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
>>> API provides
>>> + * means to bind/unbind guest CR3 with PASIDs allocated for a
>>> device.
>>> + */
>>> +#define SVM_FLAG_GUEST_MODE	(1<<2)
>>> +/*
>>> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
>>> PASID space,
>>> + * which requires guest and host PASID translation at both
>>> directions. We keep
>>> + * track of guest PASID in order to provide lookup service to
>>> device drivers.
>>> + * One such example is a physical function (PF) driver that
>>> supports mediated
>>> + * device (mdev) assignment. Guest programming of mdev
>>> configuration space can
>>> + * only be done with guest PASID, therefore PF driver needs to
>>> find the matching
>>> + * host PASID to program the real hardware.
>>> + */
>>> +#define SVM_FLAG_GUEST_PASID	(1<<3)
>>>  
>>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>>  
>>>   
>> Thanks
>>
>> Eric
>>
> 
> [Jacob Pan]
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-10 21:06     ` Jacob Pan
@ 2020-04-16 10:46       ` Auger Eric
  0 siblings, 0 replies; 33+ messages in thread
From: Auger Eric @ 2020-04-16 10:46 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron, Yi L

Hi Jacob,

On 4/10/20 11:06 PM, Jacob Pan wrote:
> Hi Eric,
> 
> Missed a few things in the last reply.
> 
> On Thu, 9 Apr 2020 09:41:32 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>>> +			intel_pasid_tear_down_entry(iommu, dev,
>>> svm->pasid);  
>> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
>> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
> Right, pasid tear down should always include (DEV-)IOTLB flush, I
> initially thought it is taken care of by intel_pasid_tear_down_entry().
> 
>>> +			/* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> +			 * may get reused soon, we don't want to
>>> +			 * confuse with its previous life.
>>> +			 * intel_svm_drain_prq(dev, pasid);
>>> +			 */
>>> +			kfree_rcu(sdev, rcu);
>>> +
>>> +			if (list_empty(&svm->devs)) {
>>> +				/*
>>> +				 * We do not free the IOASID here
>>> in that
>>> +				 * IOMMU driver did not allocate
>>> it.  
>> s/in/as?
> I meant to say "in that" as "for the reason that"
ok sorry
> 
>>> +				 * Unlike native SVM, IOASID for
>>> guest use was
>>> +				 * allocated prior to the bind
>>> call.> +				 * In any case, if the free
>>> call comes before
>>> +				 * the unbind, IOMMU driver will
>>> get notified
>>> +				 * and perform cleanup.
>>> +				 */
>>> +				ioasid_set_data(pasid, NULL);
>>> +				kfree(svm);
>>> +			}  
>> nit: you may use intel_svm_free_if_empty()
> True, but I meant to insert ioasid_notifier under the list_empty
> condition in the following ioasid patch.
ok

Thanks

Eric
> 
> 
> Thanks,
> 
> Jacob
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function
  2020-04-10 21:56     ` Jacob Pan
@ 2020-04-16 10:59       ` Auger Eric
  2020-04-17 16:14         ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Auger Eric @ 2020-04-16 10:59 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron

Hi Jacob,
On 4/10/20 11:56 PM, Jacob Pan wrote:
> On Thu, 9 Apr 2020 10:50:34 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/3/20 8:42 PM, Jacob Pan wrote:
>>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>>> vIOMMU, we need to provide invalidation support at IOMMU API and
>>> driver level. This patch adds Intel VT-d specific function to
>>> implement iommu passdown invalidate API for shared virtual address.
>>>
>>> The use case is for supporting caching structure invalidation
>>> of assigned SVM capable devices. Emulated IOMMU exposes queue
>>> invalidation capability and passes down all descriptors from the
>>> guest to the physical IOMMU.
>>>
>>> The assumption is that guest to host device ID mapping should be
>>> resolved prior to calling IOMMU driver. Based on the device handle,
>>> host IOMMU driver can replace certain fields before submit to the
>>> invalidation queue.
>>>
>>> ---
>>> v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
>>>       Fixed devTLB invalidation granularity mapping. Disregard G=1
>>> case and use address selective invalidation only.
>>>
>>> v7 review fixed in v10
>>> ---
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
>>> ---
>>>  drivers/iommu/intel-iommu.c | 158
>>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
>>> insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
>>> 100644 --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -5594,6 +5594,163 @@ static void
>>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
>>>  
>>> +/*
>>> + * 2D array for converting and sanitizing IOMMU generic TLB
>>> granularity to
>>> + * VT-d granularity. Invalidation is typically included in the
>>> unmap operation
>>> + * as a result of DMA or VFIO unmap. However, for assigned devices
>>> guest
>>> + * owns the first level page tables. Invalidations of translation
>>> caches in the
>>> + * guest are trapped and passed down to the host.
>>> + *
>>> + * vIOMMU in the guest will only expose first level page tables,
>>> therefore
>>> + * we do not support IOTLB granularity for request without PASID
>>> (second level).
>>> + *
>>> + * For example, to find the VT-d granularity encoding for IOTLB
>>> + * type and page selective granularity within PASID:
>>> + * X: indexed by iommu cache type
>>> + * Y: indexed by enum iommu_inv_granularity
>>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
>>> + */
>>> +
>>> +const static int
>>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] =
>>> {
>>> +	/*
>>> +	 * PASID based IOTLB invalidation: PASID selective (per
>>> PASID),
>>> +	 * page selective (address granularity)
>>> +	 */
>>> +	{-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>>> +	/* PASID based dev TLBs */
>>> +	{-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
>>> +	/* PASID cache */
>>> +	{-EINVAL, -EINVAL, -EINVAL}
>>> +};
>>> +
>>> +static inline int to_vtd_granularity(int type, int granu)
>>> +{
>>> +	return inv_type_granu_table[type][granu];
>>> +}
>>> +
>>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>>> +{
>>> +	u64 nr_pages = (granu_size * nr_granules) >>
>>> VTD_PAGE_SHIFT; +
>>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
>>> for 2MB, etc.
>>> +	 * IOMMU cache invalidate API passes granu_size in bytes,
>>> and number of
>>> +	 * granu size in contiguous memory.
>>> +	 */
>>> +	return order_base_2(nr_pages);
>>> +}
>>> +
>>> +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
>>> +		struct device *dev, struct
>>> iommu_cache_invalidate_info *inv_info) +{
>>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>>> +	struct device_domain_info *info;
>>> +	struct intel_iommu *iommu;
>>> +	unsigned long flags;
>>> +	int cache_type;
>>> +	u8 bus, devfn;
>>> +	u16 did, sid;
>>> +	int ret = 0;
>>> +	u64 size = 0;
>>> +
>>> +	if (!inv_info || !dmar_domain ||
>>> +		inv_info->version !=
>>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>>> +		return -EINVAL;
>>> +
>>> +	if (!dev || !dev_is_pci(dev))
>>> +		return -ENODEV;  
>>
>> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
> Good point
> 
>>> +
>>> +	iommu = device_to_iommu(dev, &bus, &devfn);
>>> +	if (!iommu)
>>> +		return -ENODEV;
>>> +
>>> +	spin_lock_irqsave(&device_domain_lock, flags);
>>> +	spin_lock(&iommu->lock);
>>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
>>> devfn);
>>> +	if (!info) {
>>> +		ret = -EINVAL;
>>> +		goto out_unlock;
>>> +	}
>>> +	did = dmar_domain->iommu_did[iommu->seq_id];
>>> +	sid = PCI_DEVID(bus, devfn);
>>> +
>>> +	/* Size is only valid in non-PASID selective invalidation
>>> */
>>> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
>>> +		size =
>>> to_vtd_size(inv_info->addr_info.granule_size,
>>> +
>>> inv_info->addr_info.nb_granules); +
>>> +	for_each_set_bit(cache_type, (unsigned long
>>> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
>>> +		int granu = 0;
>>> +		u64 pasid = 0;
>>> +
>>> +		granu = to_vtd_granularity(cache_type,
>>> inv_info->granularity);
>>> +		if (granu == -EINVAL) {
>>> +			pr_err("Invalid cache type and granu
>>> combination %d/%d\n", cache_type,
>>> +				inv_info->granularity);  
>> rate unlimited traces here and after.
still to be fixed
>> ret = -EINVAL?
>> also related to the discussion we had on the VFIO series. What is the
>> error policy?
> Invalidation should not fail, there is no way to tell the guest. this
> can be a void function or VFIO can ignore it similar to what is done to
> iommu_unmap.
OK it was just for reminder
> 
> 
>>> +			break;
>>> +		}
>>> +
>>> +		/* PASID is stored in different locations based on
>>> granularity */  
>> not sure the above comment really is requested
> Just to explain the situation where PASID can be in pasid_info or
> addr_info.
> 
>>> +		if (inv_info->granularity == IOMMU_INV_GRANU_PASID
>>> &&
>>> +			inv_info->pasid_info.flags &
>>> IOMMU_INV_PASID_FLAGS_PASID)
>>> +			pasid = inv_info->pasid_info.pasid;
>>> +		else if (inv_info->granularity ==
>>> IOMMU_INV_GRANU_ADDR &&
>>> +			inv_info->addr_info.flags &
>>> IOMMU_INV_ADDR_FLAGS_PASID)
>>> +			pasid = inv_info->addr_info.pasid;
>>> +
>>> +		switch (BIT(cache_type)) {
>>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
>>> +			if ((inv_info->granularity ==
>>> IOMMU_INV_GRANU_ADDR) &&
>>> +				size && (inv_info->addr_info.addr
>>> & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>>> +				pr_err("Address out of range,
>>> 0x%llx, size order %llu\n",
>>> +					inv_info->addr_info.addr,
>>> size);
>>> +				ret = -ERANGE;
>>> +				goto out_unlock;
>>> +			}
>>> +
>>> +			qi_flush_piotlb(iommu, did,
>>> +					pasid,
>>> +
>>> mm_to_dma_pfn(inv_info->addr_info.addr),  
>> This does not sound correct to me:
>> inv_info->addr_info.addr and inv_info->addr_info.flags are not valid
>> in case of PASID-selective invalidation
> If granu is PASID-selective, this address is ignored. Since npages = -1.
Ah OK. Maybe add a comment since it is a bit strange to see irrelevant
fields to be called in PASID-selective case.

Thanks

Eric
> 
>>> +					(granu ==
>>> QI_GRAN_NONG_PASID) ? -1 : 1 << size,
>>> +					inv_info->addr_info.flags
>>> & IOMMU_INV_ADDR_FLAGS_LEAF); +
>>> +			/*
>>> +			 * Always flush device IOTLB if ATS is
>>> enabled. vIOMMU
>>> +			 * in the guest may assume IOTLB flush is
>>> inclusive,
>>> +			 * which is more efficient.
>>> +			 */
>>> +			if (info->ats_enabled)
>>> +				qi_flush_dev_iotlb_pasid(iommu,
>>> sid, info->pfsid,
>>> +						pasid,
>>> info->ats_qdep,
>>> +
>>> inv_info->addr_info.addr, size,  
>> same
>>> +						granu);
>>> +			break;
>>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>>> +			if (info->ats_enabled)
>>> +				qi_flush_dev_iotlb_pasid(iommu,
>>> sid, info->pfsid,
>>> +
>>> inv_info->addr_info.pasid, info->ats_qdep,  
>> nit: use pasid directly
> will do.
> 
>>> +
>>> inv_info->addr_info.addr, size,
>>> +						granu);
>>> +			else
>>> +				pr_warn("Passdown device IOTLB
>>> flush w/o ATS!\n");
>>> +			break;
>>> +		case IOMMU_CACHE_INV_TYPE_PASID:
>>> +			qi_flush_pasid_cache(iommu, did, granu,
>>> inv_info->pasid_info.pasid);
>>> +			break;
>>> +		default:
>>> +			dev_err(dev, "Unsupported IOMMU
>>> invalidation type %d\n",
>>> +				cache_type);
>>> +			ret = -EINVAL;
>>> +		}
>>> +	}
>>> +out_unlock:
>>> +	spin_unlock(&iommu->lock);
>>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>>> +
>>> +	return ret;
>>> +}
>>> +#endif
>>> +
>>>  static int intel_iommu_map(struct iommu_domain *domain,
>>>  			   unsigned long iova, phys_addr_t hpa,
>>>  			   size_t size, int iommu_prot, gfp_t gfp)
>>> @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
>>>  	.is_attach_deferred	=
>>> intel_iommu_is_attach_deferred, .pgsize_bitmap		=
>>> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
>>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>>  #endif
>>>   
>> Thanks
>>
>> Eric
>>
> 
> [Jacob Pan]
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-16 10:43       ` Auger Eric
@ 2020-04-17  2:45         ` Tian, Kevin
  2020-04-17  7:46           ` Auger Eric
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2020-04-17  2:45 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan
  Cc: Yi L, Alex Williamson, Raj, Ashok, David Woodhouse, LKML, iommu,
	Jean-Philippe Brucker, Jonathan Cameron

> From: Auger Eric
> Sent: Thursday, April 16, 2020 6:43 PM
> 
[...]
> >>> +	if (svm) {
> >>> +		/*
> >>> +		 * If we found svm for the PASID, there must be at
> >>> +		 * least one device bond, otherwise svm should be
> >>> freed.
> >>> +		 */
> >>> +		if (WARN_ON(list_empty(&svm->devs))) {
> >>> +			ret = -EINVAL;
> >>> +			goto out;
> >>> +		}
> >>> +
> >>> +		for_each_svm_dev(sdev, svm, dev) {
> >>> +			/* In case of multiple sub-devices of the
> >>> same pdev
> >>> +			 * assigned, we should allow multiple bind
> >>> calls with
> >>> +			 * the same PASID and pdev.
> >>> +			 */
> >>> +			sdev->users++;
> >> What if this is not an mdev device. Is it also allowed?
> > Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> > example of normal use case. You can bind the same PCI device (PF or
> > SRIOV VF) more than once to the same PASID. Just need to unbind also.
> 
> I don't get the point of binding a non mdev device several times with
> the same PASID. Do you intend to allow that at userspace level or
> prevent this from happening in VFIO?

I feel it's better to prevent this from happening, otherwise VFIO also
needs to track the bind count and do multiple unbinds at mm_exit.
But it's not necessary to prevent it in VFIO. We can check here
upon whether aux_domain is valid, and if not return -EBUSY.

> 
> Besides, the comment is a bit misleading as it gives the impression it
> is only true for mdev and there is no associated check.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-17  2:45         ` Tian, Kevin
@ 2020-04-17  7:46           ` Auger Eric
  2020-04-17 15:28             ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Auger Eric @ 2020-04-17  7:46 UTC (permalink / raw)
  To: Tian, Kevin, Jacob Pan
  Cc: Yi L, Alex Williamson, Raj, Ashok, David Woodhouse, LKML, iommu,
	Jean-Philippe Brucker, Jonathan Cameron

Hi Kevin,
On 4/17/20 4:45 AM, Tian, Kevin wrote:
>> From: Auger Eric
>> Sent: Thursday, April 16, 2020 6:43 PM
>>
> [...]
>>>>> +	if (svm) {
>>>>> +		/*
>>>>> +		 * If we found svm for the PASID, there must be at
>>>>> +		 * least one device bond, otherwise svm should be
>>>>> freed.
>>>>> +		 */
>>>>> +		if (WARN_ON(list_empty(&svm->devs))) {
>>>>> +			ret = -EINVAL;
>>>>> +			goto out;
>>>>> +		}
>>>>> +
>>>>> +		for_each_svm_dev(sdev, svm, dev) {
>>>>> +			/* In case of multiple sub-devices of the
>>>>> same pdev
>>>>> +			 * assigned, we should allow multiple bind
>>>>> calls with
>>>>> +			 * the same PASID and pdev.
>>>>> +			 */
>>>>> +			sdev->users++;
>>>> What if this is not an mdev device. Is it also allowed?
>>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
>>> example of normal use case. You can bind the same PCI device (PF or
>>> SRIOV VF) more than once to the same PASID. Just need to unbind also.
>>
>> I don't get the point of binding a non mdev device several times with
>> the same PASID. Do you intend to allow that at userspace level or
>> prevent this from happening in VFIO?
> 
> I feel it's better to prevent this from happening, otherwise VFIO also
> needs to track the bind count and do multiple unbinds at mm_exit.
> But it's not necessary to prevent it in VFIO. We can check here
> upon whether aux_domain is valid, and if not return -EBUSY.
Ah OK. So if we can detect the case here it is even better

Thanks

Eric
> 
>>
>> Besides, the comment is a bit misleading as it gives the impression it
>> is only true for mdev and there is no associated check.
> 
> Thanks
> Kevin
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-17  7:46           ` Auger Eric
@ 2020-04-17 15:28             ` Jacob Pan
  2020-04-17 23:46               ` Tian, Kevin
  0 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2020-04-17 15:28 UTC (permalink / raw)
  To: Auger Eric
  Cc: Tian, Kevin, Yi L, Alex Williamson, Raj, Ashok, David Woodhouse,
	LKML, iommu, Jean-Philippe Brucker, Jonathan Cameron,
	jacob.jun.pan

On Fri, 17 Apr 2020 09:46:55 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Kevin,
> On 4/17/20 4:45 AM, Tian, Kevin wrote:
> >> From: Auger Eric
> >> Sent: Thursday, April 16, 2020 6:43 PM
> >>  
> > [...]  
> >>>>> +	if (svm) {
> >>>>> +		/*
> >>>>> +		 * If we found svm for the PASID, there must
> >>>>> be at
> >>>>> +		 * least one device bond, otherwise svm should
> >>>>> be freed.
> >>>>> +		 */
> >>>>> +		if (WARN_ON(list_empty(&svm->devs))) {
> >>>>> +			ret = -EINVAL;
> >>>>> +			goto out;
> >>>>> +		}
> >>>>> +
> >>>>> +		for_each_svm_dev(sdev, svm, dev) {
> >>>>> +			/* In case of multiple sub-devices of
> >>>>> the same pdev
> >>>>> +			 * assigned, we should allow multiple
> >>>>> bind calls with
> >>>>> +			 * the same PASID and pdev.
> >>>>> +			 */
> >>>>> +			sdev->users++;  
> >>>> What if this is not an mdev device. Is it also allowed?  
> >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> >>> example of normal use case. You can bind the same PCI device (PF
> >>> or SRIOV VF) more than once to the same PASID. Just need to
> >>> unbind also.  
> >>
> >> I don't get the point of binding a non mdev device several times
> >> with the same PASID. Do you intend to allow that at userspace
> >> level or prevent this from happening in VFIO?  
> > 
> > I feel it's better to prevent this from happening, otherwise VFIO
> > also needs to track the bind count and do multiple unbinds at
> > mm_exit. But it's not necessary to prevent it in VFIO. We can check
> > here upon whether aux_domain is valid, and if not return -EBUSY.  
> Ah OK. So if we can detect the case here it is even better
> 
I don't understand why VFIO cannot track, since it is mdev aware. if we
don;t refcount the users, one mdev unbind will result unbind for all
mdev under the same pdev. That may not be the right thing to do.

> Thanks
> 
> Eric
> >   
> >>
> >> Besides, the comment is a bit misleading as it gives the
> >> impression it is only true for mdev and there is no associated
> >> check.  
> > 
> > Thanks
> > Kevin
> >   
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function
  2020-04-16 10:59       ` Auger Eric
@ 2020-04-17 16:14         ` Jacob Pan
  0 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-17 16:14 UTC (permalink / raw)
  To: Auger Eric
  Cc: Lu Baolu, iommu, LKML, Joerg Roedel, David Woodhouse,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Alex Williamson, Christoph Hellwig, Jonathan Cameron,
	jacob.jun.pan

On Thu, 16 Apr 2020 12:59:14 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> On 4/10/20 11:56 PM, Jacob Pan wrote:
> > On Thu, 9 Apr 2020 10:50:34 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >>
> >> On 4/3/20 8:42 PM, Jacob Pan wrote:  
> >>> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >>> vIOMMU, we need to provide invalidation support at IOMMU API and
> >>> driver level. This patch adds Intel VT-d specific function to
> >>> implement iommu passdown invalidate API for shared virtual
> >>> address.
> >>>
> >>> The use case is for supporting caching structure invalidation
> >>> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >>> invalidation capability and passes down all descriptors from the
> >>> guest to the physical IOMMU.
> >>>
> >>> The assumption is that guest to host device ID mapping should be
> >>> resolved prior to calling IOMMU driver. Based on the device
> >>> handle, host IOMMU driver can replace certain fields before
> >>> submit to the invalidation queue.
> >>>
> >>> ---
> >>> v11 - Removed 2D map array, use -EINVAL in granularity lookup
> >>> array. Fixed devTLB invalidation granularity mapping. Disregard
> >>> G=1 case and use address selective invalidation only.
> >>>
> >>> v7 review fixed in v10
> >>> ---
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> >>> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  drivers/iommu/intel-iommu.c | 158
> >>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
> >>> insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/intel-iommu.c
> >>> b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
> >>> 100644 --- a/drivers/iommu/intel-iommu.c
> >>> +++ b/drivers/iommu/intel-iommu.c
> >>> @@ -5594,6 +5594,163 @@ static void
> >>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >>>  
> >>> +/*
> >>> + * 2D array for converting and sanitizing IOMMU generic TLB
> >>> granularity to
> >>> + * VT-d granularity. Invalidation is typically included in the
> >>> unmap operation
> >>> + * as a result of DMA or VFIO unmap. However, for assigned
> >>> devices guest
> >>> + * owns the first level page tables. Invalidations of translation
> >>> caches in the
> >>> + * guest are trapped and passed down to the host.
> >>> + *
> >>> + * vIOMMU in the guest will only expose first level page tables,
> >>> therefore
> >>> + * we do not support IOTLB granularity for request without PASID
> >>> (second level).
> >>> + *
> >>> + * For example, to find the VT-d granularity encoding for IOTLB
> >>> + * type and page selective granularity within PASID:
> >>> + * X: indexed by iommu cache type
> >>> + * Y: indexed by enum iommu_inv_granularity
> >>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >>> + */
> >>> +
> >>> +const static int
> >>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR]
> >>> = {
> >>> +	/*
> >>> +	 * PASID based IOTLB invalidation: PASID selective (per
> >>> PASID),
> >>> +	 * page selective (address granularity)
> >>> +	 */
> >>> +	{-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >>> +	/* PASID based dev TLBs */
> >>> +	{-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> >>> +	/* PASID cache */
> >>> +	{-EINVAL, -EINVAL, -EINVAL}
> >>> +};
> >>> +
> >>> +static inline int to_vtd_granularity(int type, int granu)
> >>> +{
> >>> +	return inv_type_granu_table[type][granu];
> >>> +}
> >>> +
> >>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> >>> +{
> >>> +	u64 nr_pages = (granu_size * nr_granules) >>
> >>> VTD_PAGE_SHIFT; +
> >>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k,
> >>> 9 for 2MB, etc.
> >>> +	 * IOMMU cache invalidate API passes granu_size in bytes,
> >>> and number of
> >>> +	 * granu size in contiguous memory.
> >>> +	 */
> >>> +	return order_base_2(nr_pages);
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_INTEL_IOMMU_SVM
> >>> +static int intel_iommu_sva_invalidate(struct iommu_domain
> >>> *domain,
> >>> +		struct device *dev, struct
> >>> iommu_cache_invalidate_info *inv_info) +{
> >>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >>> +	struct device_domain_info *info;
> >>> +	struct intel_iommu *iommu;
> >>> +	unsigned long flags;
> >>> +	int cache_type;
> >>> +	u8 bus, devfn;
> >>> +	u16 did, sid;
> >>> +	int ret = 0;
> >>> +	u64 size = 0;
> >>> +
> >>> +	if (!inv_info || !dmar_domain ||
> >>> +		inv_info->version !=
> >>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (!dev || !dev_is_pci(dev))
> >>> +		return -ENODEV;    
> >>
> >> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?  
> > Good point
> >   
> >>> +
> >>> +	iommu = device_to_iommu(dev, &bus, &devfn);
> >>> +	if (!iommu)
> >>> +		return -ENODEV;
> >>> +
> >>> +	spin_lock_irqsave(&device_domain_lock, flags);
> >>> +	spin_lock(&iommu->lock);
> >>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> >>> devfn);
> >>> +	if (!info) {
> >>> +		ret = -EINVAL;
> >>> +		goto out_unlock;
> >>> +	}
> >>> +	did = dmar_domain->iommu_did[iommu->seq_id];
> >>> +	sid = PCI_DEVID(bus, devfn);
> >>> +
> >>> +	/* Size is only valid in non-PASID selective invalidation
> >>> */
> >>> +	if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> >>> +		size =
> >>> to_vtd_size(inv_info->addr_info.granule_size,
> >>> +
> >>> inv_info->addr_info.nb_granules); +
> >>> +	for_each_set_bit(cache_type, (unsigned long
> >>> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> >>> +		int granu = 0;
> >>> +		u64 pasid = 0;
> >>> +
> >>> +		granu = to_vtd_granularity(cache_type,
> >>> inv_info->granularity);
> >>> +		if (granu == -EINVAL) {
> >>> +			pr_err("Invalid cache type and granu
> >>> combination %d/%d\n", cache_type,
> >>> +				inv_info->granularity);    
> >> rate unlimited traces here and after.  
> still to be fixed
Yes. will do.

> >> ret = -EINVAL?
> >> also related to the discussion we had on the VFIO series. What is
> >> the error policy?  
> > Invalidation should not fail, there is no way to tell the guest.
> > this can be a void function or VFIO can ignore it similar to what
> > is done to iommu_unmap.  
> OK it was just for reminder
It will also be documented in an IOMMU UAPI doc, I plan to send out
along with the feedback on UAPI version and sizing patchset.

Could you check if the following statement is true.

"TLB invalidations should always succeed from vIOMMU's
perspective. There is no architectual way to report back the vIOMMU if
the UAPI data is not compatible. For this reason the following IOMMU
UAPIs cannot fail:

1. Free PASID
2. Unbind guest PASID
3. Unbind guest PASID table (SMMU)
4. Cache invalidate
5. Page response
"

> > 
> >   
> >>> +			break;
> >>> +		}
> >>> +
> >>> +		/* PASID is stored in different locations based
> >>> on granularity */    
> >> not sure the above comment really is requested  
> > Just to explain the situation where PASID can be in pasid_info or
> > addr_info.
> >   
> >>> +		if (inv_info->granularity ==
> >>> IOMMU_INV_GRANU_PASID &&
> >>> +			inv_info->pasid_info.flags &
> >>> IOMMU_INV_PASID_FLAGS_PASID)
> >>> +			pasid = inv_info->pasid_info.pasid;
> >>> +		else if (inv_info->granularity ==
> >>> IOMMU_INV_GRANU_ADDR &&
> >>> +			inv_info->addr_info.flags &
> >>> IOMMU_INV_ADDR_FLAGS_PASID)
> >>> +			pasid = inv_info->addr_info.pasid;
> >>> +
> >>> +		switch (BIT(cache_type)) {
> >>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> >>> +			if ((inv_info->granularity ==
> >>> IOMMU_INV_GRANU_ADDR) &&
> >>> +				size && (inv_info->addr_info.addr
> >>> & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> >>> +				pr_err("Address out of range,
> >>> 0x%llx, size order %llu\n",
> >>> +					inv_info->addr_info.addr,
> >>> size);
> >>> +				ret = -ERANGE;
> >>> +				goto out_unlock;
> >>> +			}
> >>> +
> >>> +			qi_flush_piotlb(iommu, did,
> >>> +					pasid,
> >>> +
> >>> mm_to_dma_pfn(inv_info->addr_info.addr),    
> >> This does not sound correct to me:
> >> inv_info->addr_info.addr and inv_info->addr_info.flags are not
> >> valid in case of PASID-selective invalidation  
> > If granu is PASID-selective, this address is ignored. Since npages
> > = -1.  
> Ah OK. Maybe add a comment since it is a bit strange to see irrelevant
> fields to be called in PASID-selective case.
> 
OK, will do.

Thanks!
> Thanks
> 
> Eric
> >   
> >>> +					(granu ==
> >>> QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> >>> +					inv_info->addr_info.flags
> >>> & IOMMU_INV_ADDR_FLAGS_LEAF); +
> >>> +			/*
> >>> +			 * Always flush device IOTLB if ATS is
> >>> enabled. vIOMMU
> >>> +			 * in the guest may assume IOTLB flush is
> >>> inclusive,
> >>> +			 * which is more efficient.
> >>> +			 */
> >>> +			if (info->ats_enabled)
> >>> +				qi_flush_dev_iotlb_pasid(iommu,
> >>> sid, info->pfsid,
> >>> +						pasid,
> >>> info->ats_qdep,
> >>> +
> >>> inv_info->addr_info.addr, size,    
> >> same  
> >>> +						granu);
> >>> +			break;
> >>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> >>> +			if (info->ats_enabled)
> >>> +				qi_flush_dev_iotlb_pasid(iommu,
> >>> sid, info->pfsid,
> >>> +
> >>> inv_info->addr_info.pasid, info->ats_qdep,    
> >> nit: use pasid directly  
> > will do.
> >   
> >>> +
> >>> inv_info->addr_info.addr, size,
> >>> +						granu);
> >>> +			else
> >>> +				pr_warn("Passdown device IOTLB
> >>> flush w/o ATS!\n");
> >>> +			break;
> >>> +		case IOMMU_CACHE_INV_TYPE_PASID:
> >>> +			qi_flush_pasid_cache(iommu, did, granu,
> >>> inv_info->pasid_info.pasid);
> >>> +			break;
> >>> +		default:
> >>> +			dev_err(dev, "Unsupported IOMMU
> >>> invalidation type %d\n",
> >>> +				cache_type);
> >>> +			ret = -EINVAL;
> >>> +		}
> >>> +	}
> >>> +out_unlock:
> >>> +	spin_unlock(&iommu->lock);
> >>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +#endif
> >>> +
> >>>  static int intel_iommu_map(struct iommu_domain *domain,
> >>>  			   unsigned long iova, phys_addr_t hpa,
> >>>  			   size_t size, int iommu_prot, gfp_t
> >>> gfp) @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops
> >>> = { .is_attach_deferred	=
> >>> intel_iommu_is_attach_deferred, .pgsize_bitmap		=
> >>> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> >>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >>>  #endif
> >>>     
> >> Thanks
> >>
> >> Eric
> >>  
> > 
> > [Jacob Pan]
> >   
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-17 15:28             ` Jacob Pan
@ 2020-04-17 23:46               ` Tian, Kevin
  2020-04-27 17:28                 ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2020-04-17 23:46 UTC (permalink / raw)
  To: Jacob Pan, Auger Eric
  Cc: Yi L, Alex Williamson, Raj, Ashok, David Woodhouse, LKML, iommu,
	Jean-Philippe Brucker, Jonathan Cameron

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Friday, April 17, 2020 11:29 PM
> 
> On Fri, 17 Apr 2020 09:46:55 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
> > Hi Kevin,
> > On 4/17/20 4:45 AM, Tian, Kevin wrote:
> > >> From: Auger Eric
> > >> Sent: Thursday, April 16, 2020 6:43 PM
> > >>
> > > [...]
> > >>>>> +	if (svm) {
> > >>>>> +		/*
> > >>>>> +		 * If we found svm for the PASID, there must
> > >>>>> be at
> > >>>>> +		 * least one device bond, otherwise svm should
> > >>>>> be freed.
> > >>>>> +		 */
> > >>>>> +		if (WARN_ON(list_empty(&svm->devs))) {
> > >>>>> +			ret = -EINVAL;
> > >>>>> +			goto out;
> > >>>>> +		}
> > >>>>> +
> > >>>>> +		for_each_svm_dev(sdev, svm, dev) {
> > >>>>> +			/* In case of multiple sub-devices of
> > >>>>> the same pdev
> > >>>>> +			 * assigned, we should allow multiple
> > >>>>> bind calls with
> > >>>>> +			 * the same PASID and pdev.
> > >>>>> +			 */
> > >>>>> +			sdev->users++;
> > >>>> What if this is not an mdev device. Is it also allowed?
> > >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> > >>> example of normal use case. You can bind the same PCI device (PF
> > >>> or SRIOV VF) more than once to the same PASID. Just need to
> > >>> unbind also.
> > >>
> > >> I don't get the point of binding a non mdev device several times
> > >> with the same PASID. Do you intend to allow that at userspace
> > >> level or prevent this from happening in VFIO?
> > >
> > > I feel it's better to prevent this from happening, otherwise VFIO
> > > also needs to track the bind count and do multiple unbinds at
> > > mm_exit. But it's not necessary to prevent it in VFIO. We can check
> > > here upon whether aux_domain is valid, and if not return -EBUSY.
> > Ah OK. So if we can detect the case here it is even better
> >
> I don't understand why VFIO cannot track, since it is mdev aware. if we
> don;t refcount the users, one mdev unbind will result unbind for all
> mdev under the same pdev. That may not be the right thing to do.
> 

The open here is not for mdev, which refcount is still required. Eric's
point is for non-mdev endpoints. It's meaningless and not intuitive 
to allow binding a PASID multiple-times to the same device. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support
  2020-04-17 23:46               ` Tian, Kevin
@ 2020-04-27 17:28                 ` Jacob Pan
  0 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2020-04-27 17:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Auger Eric, Yi L, Alex Williamson, Raj, Ashok, David Woodhouse,
	LKML, iommu, Jean-Philippe Brucker, Jonathan Cameron,
	jacob.jun.pan

On Fri, 17 Apr 2020 23:46:13 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Friday, April 17, 2020 11:29 PM
> > 
> > On Fri, 17 Apr 2020 09:46:55 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> > > Hi Kevin,
> > > On 4/17/20 4:45 AM, Tian, Kevin wrote:  
> > > >> From: Auger Eric
> > > >> Sent: Thursday, April 16, 2020 6:43 PM
> > > >>  
> > > > [...]  
> > > >>>>> +	if (svm) {
> > > >>>>> +		/*
> > > >>>>> +		 * If we found svm for the PASID, there
> > > >>>>> must be at
> > > >>>>> +		 * least one device bond, otherwise svm
> > > >>>>> should be freed.
> > > >>>>> +		 */
> > > >>>>> +		if (WARN_ON(list_empty(&svm->devs))) {
> > > >>>>> +			ret = -EINVAL;
> > > >>>>> +			goto out;
> > > >>>>> +		}
> > > >>>>> +
> > > >>>>> +		for_each_svm_dev(sdev, svm, dev) {
> > > >>>>> +			/* In case of multiple sub-devices
> > > >>>>> of the same pdev
> > > >>>>> +			 * assigned, we should allow
> > > >>>>> multiple bind calls with
> > > >>>>> +			 * the same PASID and pdev.
> > > >>>>> +			 */
> > > >>>>> +			sdev->users++;  
> > > >>>> What if this is not an mdev device. Is it also allowed?  
> > > >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is
> > > >>> just an example of normal use case. You can bind the same PCI
> > > >>> device (PF or SRIOV VF) more than once to the same PASID.
> > > >>> Just need to unbind also.  
> > > >>
> > > >> I don't get the point of binding a non mdev device several
> > > >> times with the same PASID. Do you intend to allow that at
> > > >> userspace level or prevent this from happening in VFIO?  
> > > >
> > > > I feel it's better to prevent this from happening, otherwise
> > > > VFIO also needs to track the bind count and do multiple unbinds
> > > > at mm_exit. But it's not necessary to prevent it in VFIO. We
> > > > can check here upon whether aux_domain is valid, and if not
> > > > return -EBUSY.  
> > > Ah OK. So if we can detect the case here it is even better
> > >  
> > I don't understand why VFIO cannot track, since it is mdev aware.
> > if we don;t refcount the users, one mdev unbind will result unbind
> > for all mdev under the same pdev. That may not be the right thing
> > to do. 
> 
> The open here is not for mdev, which refcount is still required.
> Eric's point is for non-mdev endpoints. It's meaningless and not
> intuitive to allow binding a PASID multiple-times to the same device. 
> 
That seems contradictory. The refcount here is intended/required for sub
devices such as mdev. Since IOMMU driver is not mdev aware, we cannot
treat devices differently.

Perhaps Yi can clarify if this case is handled within VFIO, then I can
drop the refcount here.

> Thanks
> Kevin

[Jacob Pan]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-04-27 17:22 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 18:42 [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Jacob Pan
2020-04-03 18:42 ` [PATCH v11 01/10] iommu/vt-d: Move domain helper to header Jacob Pan
2020-04-03 18:42 ` [PATCH v11 02/10] iommu/uapi: Define a mask for bind data Jacob Pan
2020-04-08 13:07   ` Joerg Roedel
2020-04-08 16:11     ` Jacob Pan
2020-04-03 18:42 ` [PATCH v11 03/10] iommu/vt-d: Use a helper function to skip agaw for SL Jacob Pan
2020-04-05  6:37   ` Auger Eric
2020-04-03 18:42 ` [PATCH v11 04/10] iommu/vt-d: Add nested translation helper function Jacob Pan
2020-04-08 13:33   ` Auger Eric
2020-04-03 18:42 ` [PATCH v11 05/10] iommu/vt-d: Add bind guest PASID support Jacob Pan
2020-04-09  7:41   ` Auger Eric
2020-04-10 19:45     ` Jacob Pan
2020-04-16 10:43       ` Auger Eric
2020-04-17  2:45         ` Tian, Kevin
2020-04-17  7:46           ` Auger Eric
2020-04-17 15:28             ` Jacob Pan
2020-04-17 23:46               ` Tian, Kevin
2020-04-27 17:28                 ` Jacob Pan
2020-04-10 21:06     ` Jacob Pan
2020-04-16 10:46       ` Auger Eric
2020-04-03 18:42 ` [PATCH v11 06/10] iommu/vt-d: Support flushing more translation cache types Jacob Pan
2020-04-05  6:37   ` Auger Eric
2020-04-03 18:42 ` [PATCH v11 07/10] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
2020-04-09  8:50   ` Auger Eric
2020-04-10 21:56     ` Jacob Pan
2020-04-16 10:59       ` Auger Eric
2020-04-17 16:14         ` Jacob Pan
2020-04-03 18:42 ` [PATCH v11 08/10] iommu/vt-d: Cache virtual command capability register Jacob Pan
2020-04-09 10:14   ` Auger Eric
2020-04-03 18:42 ` [PATCH v11 09/10] iommu/vt-d: Enlightened PASID allocation Jacob Pan
2020-04-03 18:42 ` [PATCH v11 10/10] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
2020-04-09 10:31   ` Auger Eric
2020-04-05  8:28 ` [PATCH v11 00/10] Nested Shared Virtual Address (SVA) VT-d support Lu Baolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).