KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/8] Use 1st-level for DMA remapping
@ 2019-11-28  2:25 Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 1/8] iommu/vt-d: Add per domain page table ops Lu Baolu
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu

Intel VT-d in scalable mode supports two types of page talbes
for DMA translation: the first level page table and the second
level page table. The first level page table uses the same
format as the CPU page table, while the second level page table
keeps compatible with previous formats. The software is able
to choose any one of them for DMA remapping according to the use
case.

This patchset aims to move IOVA (I/O Virtual Address) translation
to 1st-level page table in scalable mode. This will simplify vIOMMU
(IOMMU simulated by VM hypervisor) design by using the two-stage
translation, a.k.a. nested mode translation.

As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA)
support is now implemented in a shadow page manner. The device
simulation software, like QEMU, has to figure out GIOVA->GPA mappings
and write them to a shadowed page table, which will be used by the
physical IOMMU. Each time when mappings are created or destroyed in
vIOMMU, the simulation software has to intervene. Hence, the changes
on GIOVA->GPA could be shadowed to host.


     .-----------.
     |  vIOMMU   |
     |-----------|                 .--------------------.
     |           |IOTLB flush trap |        QEMU        |
     .-----------. (map/unmap)     |--------------------|
     |GIOVA->GPA |---------------->|    .------------.  |
     '-----------'                 |    | GIOVA->HPA |  |
     |           |                 |    '------------'  |
     '-----------'                 |                    |
                                   |                    |
                                   '--------------------'
                                                |
            <------------------------------------
            |
            v VFIO/IOMMU API
      .-----------.
      |  pIOMMU   |
      |-----------|
      |           |
      .-----------.
      |GIOVA->HPA |
      '-----------'
      |           |
      '-----------'

In VT-d 3.0, scalable mode is introduced, which offers two-level
translation page tables and nested translation mode. Regards to
GIOVA support, it can be simplified by 1) moving the GIOVA support
over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
second level for GPA->HPA translation, and 4) enable nested (a.k.a.
dual-stage) translation in host. Compared with current shadow GIOVA
support, the new approach makes the vIOMMU design simpler and more
efficient as we only need to flush the pIOMMU IOTLB and possible
device-IOTLB when an IOVA mapping in vIOMMU is torn down.

     .-----------.
     |  vIOMMU   |
     |-----------|                 .-----------.
     |           |IOTLB flush trap |   QEMU    |
     .-----------.    (unmap)      |-----------|
     |GIOVA->GPA |---------------->|           |
     '-----------'                 '-----------'
     |           |                       |
     '-----------'                       |
           <------------------------------
           |      VFIO/IOMMU          
           |  cache invalidation and  
           | guest gpd bind interfaces
           v
     .-----------.
     |  pIOMMU   |
     |-----------|
     .-----------.
     |GIOVA->GPA |<---First level
     '-----------'
     | GPA->HPA  |<---Scond level
     '-----------'
     '-----------'

This patch set includes two parts. The former part implements the
per-domain page table abstraction, which makes the page table
difference transparent to various map/unmap APIs. The later part
applies the first level page table for IOVA translation unless the
DOMAIN_ATTR_NESTING domain attribution has been set, which indicates
nested mode in use.

Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
Based-on-idea-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>

Change log:

 v1->v2
 - The first series was posted here
   https://lkml.org/lkml/2019/9/23/297
 - Use per domain page table ops to handle different page tables.
 - Use first level for DMA remapping by default on both bare metal
   and vm guest.
 - Code refine according to code review comments for v1.

Lu Baolu (8):
  iommu/vt-d: Add per domain page table ops
  iommu/vt-d: Move domain_flush_cache helper into header
  iommu/vt-d: Implement second level page table ops
  iommu/vt-d: Apply per domain second level page table ops
  iommu/vt-d: Add first level page table interfaces
  iommu/vt-d: Implement first level page table ops
  iommu/vt-d: Identify domains using first level page table
  iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr

 drivers/iommu/Makefile             |   2 +-
 drivers/iommu/intel-iommu.c        | 412 +++++++++++++++++++++++------
 drivers/iommu/intel-pgtable.c      | 376 ++++++++++++++++++++++++++
 include/linux/intel-iommu.h        |  64 ++++-
 include/trace/events/intel_iommu.h |  60 +++++
 5 files changed, 837 insertions(+), 77 deletions(-)
 create mode 100644 drivers/iommu/intel-pgtable.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/8] iommu/vt-d: Add per domain page table ops
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 2/8] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu

The Intel VT-d in scalable mode supports two types of
page talbes for DMA translation: the first level page
table and the second level page table. The IOMMU driver
is able to choose one of them for DMA remapping according
to the use case. The first level page table uses the same
format as the CPU page table, while the second level page
table keeps compatible with previous formats.

This abstracts the page tables used in Intel IOMMU driver
by defining a per domain page table ops structure which
contains callbacks for various page table operations.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/intel-iommu.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 326146a36dbf..e8bfe7466ebb 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -499,6 +499,28 @@ struct context_entry {
 	u64 hi;
 };
 
+struct dmar_domain;
+
+/*
+ * struct pgtable_ops - page table ops
+ * @map_range: map a physically contiguous memory region to iova
+ * @unmap_range: unmap a physically contiguous memory region
+ * @iova_to_phys: return the physical address mapped to @iova
+ * @flush_tlb_range: flush the tlb caches as the result of map or unmap
+ */
+struct pgtable_ops {
+	int (*map_range)(struct dmar_domain *domain,
+			 unsigned long iova, phys_addr_t paddr,
+			 size_t size, int prot);
+	struct page *(*unmap_range)(struct dmar_domain *domain,
+				    unsigned long iova, size_t size);
+	phys_addr_t (*iova_to_phys)(struct dmar_domain *domain,
+				    unsigned long iova);
+	void (*flush_tlb_range)(struct dmar_domain *domain,
+				struct intel_iommu *iommu,
+				unsigned long iova, size_t size, bool ih);
+};
+
 struct dmar_domain {
 	int	nid;			/* node id */
 
@@ -517,8 +539,10 @@ struct dmar_domain {
 	struct list_head auxd;		/* link to device's auxiliary list */
 	struct iova_domain iovad;	/* iova's that belong to this domain */
 
+	/* page table used by this domain */
 	struct dma_pte	*pgd;		/* virtual address */
 	int		gaw;		/* max guest address width */
+	const struct pgtable_ops *ops;	/* page table ops */
 
 	/* adjusted guest address width, 0 is level 2 30-bit */
 	int		agaw;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/8] iommu/vt-d: Move domain_flush_cache helper into header
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 1/8] iommu/vt-d: Add per domain page table ops Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 3/8] iommu/vt-d: Implement second level page table ops Lu Baolu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu

So that it could be used in other source files as well.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 7 -------
 include/linux/intel-iommu.h | 7 +++++++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 677e82a828f0..7752ff299cb5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -833,13 +833,6 @@ static struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devf
 	return iommu;
 }
 
-static void domain_flush_cache(struct dmar_domain *domain,
-			       void *addr, int size)
-{
-	if (!domain->iommu_coherency)
-		clflush_cache_range(addr, size);
-}
-
 static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn)
 {
 	struct context_entry *context;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index e8bfe7466ebb..9b259756057b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -681,6 +681,13 @@ static inline int first_pte_in_page(struct dma_pte *pte)
 	return !((unsigned long)pte & ~VTD_PAGE_MASK);
 }
 
+static inline void
+domain_flush_cache(struct dmar_domain *domain, void *addr, int size)
+{
+	if (!domain->iommu_coherency)
+		clflush_cache_range(addr, size);
+}
+
 extern struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev);
 extern int dmar_find_matched_atsr_unit(struct pci_dev *dev);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 3/8] iommu/vt-d: Implement second level page table ops
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 1/8] iommu/vt-d: Add per domain page table ops Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 2/8] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 4/8] iommu/vt-d: Apply per domain " Lu Baolu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu, Yi Sun

This adds the implementation of page table callbacks for
the second level page table.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 81 +++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 7752ff299cb5..96ead4e3395a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -413,6 +413,7 @@ int for_each_device_domain(int (*fn)(struct device_domain_info *info,
 }
 
 const struct iommu_ops intel_iommu_ops;
+static const struct pgtable_ops second_lvl_pgtable_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
 {
@@ -1720,6 +1721,7 @@ static struct dmar_domain *alloc_domain(int flags)
 	domain->nid = NUMA_NO_NODE;
 	domain->flags = flags;
 	domain->has_iotlb_device = false;
+	domain->ops = &second_lvl_pgtable_ops;
 	INIT_LIST_HEAD(&domain->devices);
 
 	return domain;
@@ -2334,6 +2336,85 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 	return 0;
 }
 
+static int second_lvl_domain_map_range(struct dmar_domain *domain,
+				       unsigned long iova, phys_addr_t paddr,
+				       size_t size, int prot)
+{
+	return __domain_mapping(domain, iova >> VTD_PAGE_SHIFT, NULL,
+				paddr >> VTD_PAGE_SHIFT,
+				aligned_nrpages(paddr, size), prot);
+}
+
+static struct page *
+second_lvl_domain_unmap_range(struct dmar_domain *domain,
+			      unsigned long iova, size_t size)
+{
+	unsigned long start_pfn, end_pfn, nrpages;
+
+	start_pfn = mm_to_dma_pfn(IOVA_PFN(iova));
+	nrpages = aligned_nrpages(iova, size);
+	end_pfn = start_pfn + nrpages - 1;
+
+	return dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
+				   domain->pgd, 0, start_pfn, end_pfn, NULL);
+}
+
+static phys_addr_t
+second_lvl_domain_iova_to_phys(struct dmar_domain *domain,
+			       unsigned long iova)
+{
+	struct dma_pte *pte;
+	int level = 0;
+	u64 phys = 0;
+
+	pte = pfn_to_dma_pte(domain, iova >> VTD_PAGE_SHIFT, &level);
+	if (pte)
+		phys = dma_pte_addr(pte);
+
+	return phys;
+}
+
+static void
+second_lvl_domain_flush_tlb_range(struct dmar_domain *domain,
+				  struct intel_iommu *iommu,
+				  unsigned long addr, size_t size,
+				  bool ih)
+{
+	unsigned long pages = aligned_nrpages(addr, size);
+	u16 did = domain->iommu_did[iommu->seq_id];
+	unsigned int mask;
+
+	if (pages) {
+		mask = ilog2(__roundup_pow_of_two(pages));
+		addr &= (u64)-1 << (VTD_PAGE_SHIFT + mask);
+	} else {
+		mask = MAX_AGAW_PFN_WIDTH;
+		addr = 0;
+	}
+
+	/*
+	 * Fallback to domain selective flush if no PSI support or the size is
+	 * too big.
+	 * PSI requires page size to be 2 ^ x, and the base address is naturally
+	 * aligned to the size
+	 */
+	if (!pages || !cap_pgsel_inv(iommu->cap) ||
+	    mask > cap_max_amask_val(iommu->cap))
+		iommu->flush.iotlb_inv(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+	else
+		iommu->flush.iotlb_inv(iommu, did, addr | ((int)ih << 6),
+				       mask, DMA_TLB_PSI_FLUSH);
+
+	iommu_flush_dev_iotlb(domain, addr, mask);
+}
+
+static const struct pgtable_ops second_lvl_pgtable_ops = {
+	.map_range		= second_lvl_domain_map_range,
+	.unmap_range		= second_lvl_domain_unmap_range,
+	.iova_to_phys		= second_lvl_domain_iova_to_phys,
+	.flush_tlb_range	= second_lvl_domain_flush_tlb_range,
+};
+
 static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			  struct scatterlist *sg, unsigned long phys_pfn,
 			  unsigned long nr_pages, int prot)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 4/8] iommu/vt-d: Apply per domain second level page table ops
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (2 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 3/8] iommu/vt-d: Implement second level page table ops Lu Baolu
@ 2019-11-28  2:25 ` " Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces Lu Baolu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu

This applies per domain page table ops to various domain
mapping and unmapping interfaces.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 118 ++++++++++++++++--------------------
 1 file changed, 52 insertions(+), 66 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 96ead4e3395a..66f76f6df2c2 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -80,6 +80,7 @@
 #define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
+#define PFN_ADDR(pfn)		((pfn) << PAGE_SHIFT)
 
 /* page table handling */
 #define LEVEL_STRIDE		(9)
@@ -1153,8 +1154,8 @@ static struct page *domain_unmap(struct dmar_domain *domain,
 	BUG_ON(start_pfn > last_pfn);
 
 	/* we don't need lock here; nobody else touches the iova range */
-	freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-				       domain->pgd, 0, start_pfn, last_pfn, NULL);
+	freelist = domain->ops->unmap_range(domain, PFN_ADDR(start_pfn),
+					    PFN_ADDR(last_pfn - start_pfn + 1));
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -1484,39 +1485,6 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
 
-static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
-				  struct dmar_domain *domain,
-				  unsigned long pfn, unsigned int pages,
-				  int ih, int map)
-{
-	unsigned int mask = ilog2(__roundup_pow_of_two(pages));
-	uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
-	u16 did = domain->iommu_did[iommu->seq_id];
-
-	BUG_ON(pages == 0);
-
-	if (ih)
-		ih = 1 << 6;
-	/*
-	 * Fallback to domain selective flush if no PSI support or the size is
-	 * too big.
-	 * PSI requires page size to be 2 ^ x, and the base address is naturally
-	 * aligned to the size
-	 */
-	if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
-		iommu->flush.iotlb_inv(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
-	else
-		iommu->flush.iotlb_inv(iommu, did, addr | ih,
-				       mask, DMA_TLB_PSI_FLUSH);
-
-	/*
-	 * In caching mode, changes of pages from non-present to present require
-	 * flush. However, device IOTLB doesn't need to be flushed in this case.
-	 */
-	if (!cap_caching_mode(iommu->cap) || !map)
-		iommu_flush_dev_iotlb(domain, addr, mask);
-}
-
 /* Notification for newly created mappings */
 static inline void __mapping_notify_one(struct intel_iommu *iommu,
 					struct dmar_domain *domain,
@@ -1524,7 +1492,8 @@ static inline void __mapping_notify_one(struct intel_iommu *iommu,
 {
 	/* It's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
-		iommu_flush_iotlb_psi(iommu, domain, pfn, pages, 0, 1);
+		domain->ops->flush_tlb_range(domain, iommu, PFN_ADDR(pfn),
+					     PFN_ADDR(pages), 0);
 	else
 		iommu_flush_write_buffer(iommu);
 }
@@ -1536,16 +1505,8 @@ static void iommu_flush_iova(struct iova_domain *iovad)
 
 	domain = container_of(iovad, struct dmar_domain, iovad);
 
-	for_each_domain_iommu(idx, domain) {
-		struct intel_iommu *iommu = g_iommus[idx];
-		u16 did = domain->iommu_did[iommu->seq_id];
-
-		iommu->flush.iotlb_inv(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
-
-		if (!cap_caching_mode(iommu->cap))
-			iommu_flush_dev_iotlb(get_iommu_domain(iommu, did),
-					      0, MAX_AGAW_PFN_WIDTH);
-	}
+	for_each_domain_iommu(idx, domain)
+		domain->ops->flush_tlb_range(domain, g_iommus[idx], 0, 0, 0);
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2419,13 +2380,43 @@ static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			  struct scatterlist *sg, unsigned long phys_pfn,
 			  unsigned long nr_pages, int prot)
 {
-	int iommu_id, ret;
 	struct intel_iommu *iommu;
+	int iommu_id, ret;
 
 	/* Do the real mapping first */
-	ret = __domain_mapping(domain, iov_pfn, sg, phys_pfn, nr_pages, prot);
-	if (ret)
-		return ret;
+	if (!sg) {
+		ret = domain->ops->map_range(domain, PFN_ADDR(iov_pfn),
+					     PFN_ADDR(phys_pfn),
+					     PFN_ADDR(nr_pages),
+					     prot);
+		if (ret)
+			return ret;
+	} else {
+		unsigned long pgoff, pgs;
+		unsigned long start = iov_pfn, total = nr_pages;
+
+		while (total && sg) {
+			pgoff = sg->offset & ~PAGE_MASK;
+			pgs = aligned_nrpages(sg->offset, sg->length);
+
+			ret = domain->ops->map_range(domain, PFN_ADDR(start),
+						     sg_phys(sg) - pgoff,
+						     PFN_ADDR(pgs), prot);
+			if (ret) {
+				domain->ops->unmap_range(domain,
+							 PFN_ADDR(iov_pfn),
+							 PFN_ADDR(nr_pages));
+				return ret;
+			}
+
+			sg->dma_address = ((dma_addr_t)start << VTD_PAGE_SHIFT) + pgoff;
+			sg->dma_length = sg->length;
+
+			total -= pgs;
+			start += pgs;
+			sg = sg_next(sg);
+		}
+	}
 
 	for_each_domain_iommu(iommu_id, domain) {
 		iommu = g_iommus[iommu_id];
@@ -3837,8 +3828,8 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 	freelist = domain_unmap(domain, start_pfn, last_pfn);
 	if (intel_iommu_strict || (pdev && pdev->untrusted) ||
 			!has_iova_flush_queue(&domain->iovad)) {
-		iommu_flush_iotlb_psi(iommu, domain, start_pfn,
-				      nrpages, !freelist, 0);
+		domain->ops->flush_tlb_range(domain, iommu, dev_addr,
+					     size, !freelist);
 		/* free iova */
 		free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(nrpages));
 		dma_free_pagelist(freelist);
@@ -4927,9 +4918,9 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
 
 			rcu_read_lock();
 			for_each_active_iommu(iommu, drhd)
-				iommu_flush_iotlb_psi(iommu, si_domain,
-					iova->pfn_lo, iova_size(iova),
-					!freelist, 0);
+				si_domain->ops->flush_tlb_range(si_domain,
+					iommu, PFN_ADDR(iova->pfn_lo),
+					PFN_ADDR(iova_size(iova)), !freelist);
 			rcu_read_unlock();
 			dma_free_pagelist(freelist);
 
@@ -5732,8 +5723,9 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	npages = last_pfn - start_pfn + 1;
 
 	for_each_domain_iommu(iommu_id, dmar_domain)
-		iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain,
-				      start_pfn, npages, !freelist, 0);
+		dmar_domain->ops->flush_tlb_range(dmar_domain,
+						  g_iommus[iommu_id],
+						  iova, size, !freelist);
 
 	dma_free_pagelist(freelist);
 
@@ -5747,18 +5739,12 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 					    dma_addr_t iova)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-	struct dma_pte *pte;
-	int level = 0;
-	u64 phys = 0;
 
-	if (dmar_domain->flags & DOMAIN_FLAG_LOSE_CHILDREN)
+	if ((dmar_domain->flags & DOMAIN_FLAG_LOSE_CHILDREN) ||
+	    !dmar_domain->ops->iova_to_phys)
 		return 0;
 
-	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
-	if (pte)
-		phys = dma_pte_addr(pte);
-
-	return phys;
+	return dmar_domain->ops->iova_to_phys(dmar_domain, iova);
 }
 
 static inline bool scalable_mode_support(void)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (3 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 4/8] iommu/vt-d: Apply per domain " Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-12-02 23:27   ` Jacob Pan
  2019-11-28  2:25 ` [PATCH v2 6/8] iommu/vt-d: Implement first level page table ops Lu Baolu
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu, Yi Sun

This adds functions to manipulate first level page tables
which could be used by a scalale mode capable IOMMU unit.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/Makefile             |   2 +-
 drivers/iommu/intel-iommu.c        |  33 +++
 drivers/iommu/intel-pgtable.c      | 376 +++++++++++++++++++++++++++++
 include/linux/intel-iommu.h        |  33 ++-
 include/trace/events/intel_iommu.h |  60 +++++
 5 files changed, 502 insertions(+), 2 deletions(-)
 create mode 100644 drivers/iommu/intel-pgtable.c

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 35d17094fe3b..aa04f4c3ae26 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -18,7 +18,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o
 obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
 obj-$(CONFIG_DMAR_TABLE) += dmar.o
 obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o
-obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o
+obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o
 obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o
 obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o
 obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 66f76f6df2c2..a314892ee72b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1670,6 +1670,37 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
 #endif
 }
 
+/* First level 5-level paging support */
+static bool first_lvl_5lp_support(void)
+{
+	struct dmar_drhd_unit *drhd;
+	struct intel_iommu *iommu;
+	static int first_level_5lp_supported = -1;
+
+	if (likely(first_level_5lp_supported != -1))
+		return first_level_5lp_supported;
+
+	first_level_5lp_supported = 1;
+#ifdef CONFIG_X86
+	/* Match IOMMU first level and CPU paging mode */
+	if (!cpu_feature_enabled(X86_FEATURE_LA57)) {
+		first_level_5lp_supported = 0;
+		return first_level_5lp_supported;
+	}
+#endif /* #ifdef CONFIG_X86 */
+
+	rcu_read_lock();
+	for_each_active_iommu(iommu, drhd) {
+		if (!cap_5lp_support(iommu->cap)) {
+			first_level_5lp_supported = 0;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return first_level_5lp_supported;
+}
+
 static struct dmar_domain *alloc_domain(int flags)
 {
 	struct dmar_domain *domain;
@@ -1683,6 +1714,8 @@ static struct dmar_domain *alloc_domain(int flags)
 	domain->flags = flags;
 	domain->has_iotlb_device = false;
 	domain->ops = &second_lvl_pgtable_ops;
+	domain->first_lvl_5lp = first_lvl_5lp_support();
+	spin_lock_init(&domain->page_table_lock);
 	INIT_LIST_HEAD(&domain->devices);
 
 	return domain;
diff --git a/drivers/iommu/intel-pgtable.c b/drivers/iommu/intel-pgtable.c
new file mode 100644
index 000000000000..4a26d08a7570
--- /dev/null
+++ b/drivers/iommu/intel-pgtable.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * intel-pgtable.c - Intel IOMMU page table manipulation library
+ *
+ * Copyright (C) 2019 Intel Corporation
+ *
+ * Author: Lu Baolu <baolu.lu@linux.intel.com>
+ */
+
+#define pr_fmt(fmt)     "DMAR: " fmt
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/io.h>
+#include <linux/export.h>
+#include <linux/intel-iommu.h>
+#include <asm/cacheflush.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <trace/events/intel_iommu.h>
+
+/*
+ * first_lvl_map: Map a range of IO virtual address to physical addresses.
+ */
+#ifdef CONFIG_X86
+#define pgtable_populate(domain, nm)					\
+do {									\
+	void *__new = alloc_pgtable_page(domain->nid);			\
+	if (!__new)							\
+		return -ENOMEM;						\
+	smp_wmb();							\
+	spin_lock(&(domain)->page_table_lock);				\
+	if (nm ## _present(*nm)) {					\
+		free_pgtable_page(__new);				\
+	} else {							\
+		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\
+		domain_flush_cache(domain, nm, sizeof(nm##_t));		\
+	}								\
+	spin_unlock(&(domain)->page_table_lock);			\
+} while (0)
+
+static int
+first_lvl_map_pte_range(struct dmar_domain *domain, pmd_t *pmd,
+			unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
+{
+	pte_t *pte, *first_pte;
+	u64 pfn;
+
+	pfn = phys_addr >> PAGE_SHIFT;
+	if (unlikely(pmd_none(*pmd)))
+		pgtable_populate(domain, pmd);
+
+	first_pte = pte = pte_offset_kernel(pmd, addr);
+
+	do {
+		if (pte_present(*pte))
+			pr_crit("ERROR: PTE for vPFN 0x%llx already set to 0x%llx\n",
+				pfn, (unsigned long long)pte_val(*pte));
+		set_pte(pte, pfn_pte(pfn, prot));
+		pfn++;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
+
+	return 0;
+}
+
+static int
+first_lvl_map_pmd_range(struct dmar_domain *domain, pud_t *pud,
+			unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	pmd_t *pmd;
+
+	if (unlikely(pud_none(*pud)))
+		pgtable_populate(domain, pud);
+	pmd = pmd_offset(pud, addr);
+
+	phys_addr -= addr;
+	do {
+		next = pmd_addr_end(addr, end);
+		if (first_lvl_map_pte_range(domain, pmd, addr, next,
+					    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pmd++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int
+first_lvl_map_pud_range(struct dmar_domain *domain, p4d_t *p4d,
+			unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	pud_t *pud;
+
+	if (unlikely(p4d_none(*p4d)))
+		pgtable_populate(domain, p4d);
+
+	pud = pud_offset(p4d, addr);
+
+	phys_addr -= addr;
+	do {
+		next = pud_addr_end(addr, end);
+		if (first_lvl_map_pmd_range(domain, pud, addr, next,
+					    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pud++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int
+first_lvl_map_p4d_range(struct dmar_domain *domain, pgd_t *pgd,
+			unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	p4d_t *p4d;
+
+	if (domain->first_lvl_5lp && unlikely(pgd_none(*pgd)))
+		pgtable_populate(domain, pgd);
+
+	p4d = p4d_offset(pgd, addr);
+
+	phys_addr -= addr;
+	do {
+		next = p4d_addr_end(addr, end);
+		if (first_lvl_map_pud_range(domain, p4d, addr, next,
+					    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (p4d++, addr = next, addr != end);
+
+	return 0;
+}
+
+int first_lvl_map_range(struct dmar_domain *domain, unsigned long addr,
+			unsigned long end, phys_addr_t phys_addr, int dma_prot)
+{
+	unsigned long next;
+	pgprot_t prot;
+	pgd_t *pgd;
+
+	trace_domain_mm_map(domain, addr, end, phys_addr);
+
+	/*
+	 * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW
+	 * for a pte that requires write operation.
+	 */
+	prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO;
+	if (WARN_ON(addr >= end))
+		return -EINVAL;
+
+	phys_addr -= addr;
+	pgd = pgd_offset_pgd(domain->pgd, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (first_lvl_map_p4d_range(domain, pgd, addr, next,
+					    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pgd++, addr = next, addr != end);
+
+	return 0;
+}
+
+/*
+ * first_lvl_unmap: Unmap an existing mapping between a range of IO virtual
+ *		    address and physical addresses.
+ */
+static struct page *
+first_lvl_unmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
+			  unsigned long addr, unsigned long end,
+			  struct page *freelist)
+{
+	unsigned long start;
+	pte_t *pte, *first_pte;
+
+	start = addr;
+	pte = pte_offset_kernel(pmd, addr);
+	first_pte = pte;
+	do {
+		set_pte(pte, __pte(0));
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
+
+	/*
+	 * Reclaim pmd page, lock is unnecessary here if it owns
+	 * the whole range.
+	 */
+	if (start != end && IS_ALIGNED(start | end, PMD_SIZE)) {
+		struct page *pte_page;
+
+		pte_page = pmd_page(*pmd);
+		pte_page->freelist = freelist;
+		freelist = pte_page;
+		pmd_clear(pmd);
+		domain_flush_cache(domain, pmd, sizeof(pmd_t));
+	}
+
+	return freelist;
+}
+
+static struct page *
+first_lvl_unmap_pmd_range(struct dmar_domain *domain, pud_t *pud,
+			  unsigned long addr, unsigned long end,
+			  struct page *freelist)
+{
+	pmd_t *pmd;
+	unsigned long start, next;
+
+	start = addr;
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		freelist = first_lvl_unmap_pte_range(domain, pmd,
+						     addr, next, freelist);
+	} while (pmd++, addr = next, addr != end);
+
+	/*
+	 * Reclaim pud page, lock is unnecessary here if it owns
+	 * the whole range.
+	 */
+	if (start != end && IS_ALIGNED(start | end, PUD_SIZE)) {
+		struct page *pmd_page;
+
+		pmd_page = pud_page(*pud);
+		pmd_page->freelist = freelist;
+		freelist = pmd_page;
+		pud_clear(pud);
+		domain_flush_cache(domain, pud, sizeof(pud_t));
+	}
+
+	return freelist;
+}
+
+static struct page *
+first_lvl_unmap_pud_range(struct dmar_domain *domain, p4d_t *p4d,
+			  unsigned long addr, unsigned long end,
+			  struct page *freelist)
+{
+	pud_t *pud;
+	unsigned long start, next;
+
+	start = addr;
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		freelist = first_lvl_unmap_pmd_range(domain, pud,
+						     addr, next, freelist);
+	} while (pud++, addr = next, addr != end);
+
+	/*
+	 * Reclaim p4d page, lock is unnecessary here if it owns
+	 * the whole range.
+	 */
+	if (start != end && IS_ALIGNED(start | end, P4D_SIZE)) {
+		struct page *pud_page;
+
+		pud_page = p4d_page(*p4d);
+		pud_page->freelist = freelist;
+		freelist = pud_page;
+		p4d_clear(p4d);
+		domain_flush_cache(domain, p4d, sizeof(p4d_t));
+	}
+
+	return freelist;
+}
+
+static struct page *
+first_lvl_unmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd,
+			  unsigned long addr, unsigned long end,
+			  struct page *freelist)
+{
+	p4d_t *p4d;
+	unsigned long start, next;
+
+	start = addr;
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		freelist = first_lvl_unmap_pud_range(domain, p4d,
+						     addr, next, freelist);
+	} while (p4d++, addr = next, addr != end);
+
+	/*
+	 * Reclaim pgd page, lock is unnecessary here if it owns
+	 * the whole range.
+	 */
+	if (domain->first_lvl_5lp && start != end &&
+	    IS_ALIGNED(start | end, PGDIR_SIZE)) {
+		struct page *p4d_page;
+
+		p4d_page = pgd_page(*pgd);
+		p4d_page->freelist = freelist;
+		freelist = p4d_page;
+		pgd_clear(pgd);
+		domain_flush_cache(domain, pgd, sizeof(pgd_t));
+	}
+
+	return freelist;
+}
+
+struct page *first_lvl_unmap_range(struct dmar_domain *domain,
+				   unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	struct page *freelist = NULL;
+
+	trace_domain_mm_unmap(domain, addr, end);
+
+	if (WARN_ON(addr >= end))
+		return NULL;
+
+	pgd = pgd_offset_pgd(domain->pgd, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		freelist = first_lvl_unmap_p4d_range(domain, pgd,
+						     addr, next, freelist);
+	} while (pgd++, addr = next, addr != end);
+
+	return freelist;
+}
+
+static pte_t *iova_to_pte(struct dmar_domain *domain, unsigned long iova)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (WARN_ON_ONCE(!IS_ALIGNED(iova, PAGE_SIZE)))
+		return NULL;
+
+	pgd = pgd_offset_pgd(domain->pgd, iova);
+	if (pgd_none_or_clear_bad(pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, iova);
+	if (p4d_none_or_clear_bad(p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, iova);
+	if (pud_none_or_clear_bad(pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, iova);
+	if (pmd_none_or_clear_bad(pmd))
+		return NULL;
+
+	return pte_offset_kernel(pmd, iova);
+}
+
+phys_addr_t
+first_lvl_iova_to_phys(struct dmar_domain *domain, unsigned long iova)
+{
+	pte_t *pte = iova_to_pte(domain, PAGE_ALIGN(iova));
+
+	if (!pte || !pte_present(*pte))
+		return 0;
+
+	return (pte_val(*pte) & PTE_PFN_MASK) | (iova & ~PAGE_MASK);
+}
+#endif /* CONFIG_X86 */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 9b259756057b..9273e3f59078 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -540,9 +540,11 @@ struct dmar_domain {
 	struct iova_domain iovad;	/* iova's that belong to this domain */
 
 	/* page table used by this domain */
-	struct dma_pte	*pgd;		/* virtual address */
+	void		*pgd;		/* virtual address */
+	spinlock_t page_table_lock;	/* Protects page tables */
 	int		gaw;		/* max guest address width */
 	const struct pgtable_ops *ops;	/* page table ops */
+	bool		first_lvl_5lp;	/* First level 5-level paging support */
 
 	/* adjusted guest address width, 0 is level 2 30-bit */
 	int		agaw;
@@ -708,6 +710,35 @@ int for_each_device_domain(int (*fn)(struct device_domain_info *info,
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
 
+#ifdef CONFIG_X86
+int first_lvl_map_range(struct dmar_domain *domain, unsigned long addr,
+			unsigned long end, phys_addr_t phys_addr, int dma_prot);
+struct page *first_lvl_unmap_range(struct dmar_domain *domain,
+				   unsigned long addr, unsigned long end);
+phys_addr_t first_lvl_iova_to_phys(struct dmar_domain *domain,
+				   unsigned long iova);
+#else
+static inline int
+first_lvl_map_range(struct dmar_domain *domain, unsigned long addr,
+		    unsigned long end, phys_addr_t phys_addr, int dma_prot)
+{
+	return -ENODEV;
+}
+
+static inline struct page *
+first_lvl_unmap_range(struct dmar_domain *domain,
+		      unsigned long addr, unsigned long end)
+{
+	return NULL;
+}
+
+static inline phys_addr_t
+first_lvl_iova_to_phys(struct dmar_domain *domain, unsigned long iova)
+{
+	return 0;
+}
+#endif /* CONFIG_X86 */
+
 #ifdef CONFIG_INTEL_IOMMU_SVM
 extern void intel_svm_check(struct intel_iommu *iommu);
 extern int intel_svm_enable_prq(struct intel_iommu *iommu);
diff --git a/include/trace/events/intel_iommu.h b/include/trace/events/intel_iommu.h
index 54e61d456cdf..e8c95290fd13 100644
--- a/include/trace/events/intel_iommu.h
+++ b/include/trace/events/intel_iommu.h
@@ -99,6 +99,66 @@ DEFINE_EVENT(dma_unmap, bounce_unmap_single,
 	TP_ARGS(dev, dev_addr, size)
 );
 
+DECLARE_EVENT_CLASS(domain_map,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end, phys_addr_t phys_addr),
+
+	TP_ARGS(domain, addr, end, phys_addr),
+
+	TP_STRUCT__entry(
+		__field(struct dmar_domain *, domain)
+		__field(unsigned long, addr)
+		__field(unsigned long, end)
+		__field(phys_addr_t, phys_addr)
+	),
+
+	TP_fast_assign(
+		__entry->domain = domain;
+		__entry->addr = addr;
+		__entry->end = end;
+		__entry->phys_addr = phys_addr;
+	),
+
+	TP_printk("domain=%p addr=0x%lx end=0x%lx phys_addr=0x%llx",
+		  __entry->domain, __entry->addr, __entry->end,
+		  (unsigned long long)__entry->phys_addr)
+);
+
+DEFINE_EVENT(domain_map, domain_mm_map,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end, phys_addr_t phys_addr),
+
+	TP_ARGS(domain, addr, end, phys_addr)
+);
+
+DECLARE_EVENT_CLASS(domain_unmap,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end),
+
+	TP_ARGS(domain, addr, end),
+
+	TP_STRUCT__entry(
+		__field(struct dmar_domain *, domain)
+		__field(unsigned long, addr)
+		__field(unsigned long, end)
+	),
+
+	TP_fast_assign(
+		__entry->domain = domain;
+		__entry->addr = addr;
+		__entry->end = end;
+	),
+
+	TP_printk("domain=%p addr=0x%lx end=0x%lx",
+		  __entry->domain, __entry->addr, __entry->end)
+);
+
+DEFINE_EVENT(domain_unmap, domain_mm_unmap,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end),
+
+	TP_ARGS(domain, addr, end)
+);
 #endif /* _TRACE_INTEL_IOMMU_H */
 
 /* This part must be outside protection */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 6/8] iommu/vt-d: Implement first level page table ops
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (4 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 7/8] iommu/vt-d: Identify domains using first level page table Lu Baolu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu, Yi Sun

This adds the implementation of page table callbacks for
the first level page table.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 56 +++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a314892ee72b..695a7a5fbe8e 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -414,6 +414,7 @@ int for_each_device_domain(int (*fn)(struct device_domain_info *info,
 }
 
 const struct iommu_ops intel_iommu_ops;
+static const struct pgtable_ops first_lvl_pgtable_ops;
 static const struct pgtable_ops second_lvl_pgtable_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
@@ -2330,6 +2331,61 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 	return 0;
 }
 
+static int first_lvl_domain_map_range(struct dmar_domain *domain,
+				      unsigned long iova, phys_addr_t paddr,
+				      size_t size, int prot)
+{
+	return first_lvl_map_range(domain, PAGE_ALIGN(iova),
+				   round_up(iova + size, PAGE_SIZE),
+				   PAGE_ALIGN(paddr), prot);
+}
+
+static struct page *
+first_lvl_domain_unmap_range(struct dmar_domain *domain,
+			     unsigned long iova, size_t size)
+{
+	return first_lvl_unmap_range(domain, PAGE_ALIGN(iova),
+				     round_up(iova + size, PAGE_SIZE));
+}
+
+static phys_addr_t
+first_lvl_domain_iova_to_phys(struct dmar_domain *domain,
+			      unsigned long iova)
+{
+	return first_lvl_iova_to_phys(domain, iova);
+}
+
+static void
+first_lvl_domain_flush_tlb_range(struct dmar_domain *domain,
+				 struct intel_iommu *iommu,
+				 unsigned long iova, size_t size, bool ih)
+{
+	unsigned long pages = aligned_nrpages(iova, size);
+	u16 did = domain->iommu_did[iommu->seq_id];
+	unsigned int mask;
+
+	if (pages) {
+		mask = ilog2(__roundup_pow_of_two(pages));
+		iova &= (u64)-1 << (VTD_PAGE_SHIFT + mask);
+	} else {
+		mask = MAX_AGAW_PFN_WIDTH;
+		iova = 0;
+		pages = -1;
+	}
+
+	iommu->flush.p_iotlb_inv(iommu, did, domain->default_pasid,
+				 iova, pages, ih);
+
+	iommu_flush_dev_iotlb(domain, iova, mask);
+}
+
+static const struct pgtable_ops first_lvl_pgtable_ops = {
+	.map_range		= first_lvl_domain_map_range,
+	.unmap_range		= first_lvl_domain_unmap_range,
+	.iova_to_phys		= first_lvl_domain_iova_to_phys,
+	.flush_tlb_range	= first_lvl_domain_flush_tlb_range,
+};
+
 static int second_lvl_domain_map_range(struct dmar_domain *domain,
 				       unsigned long iova, phys_addr_t paddr,
 				       size_t size, int prot)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 7/8] iommu/vt-d: Identify domains using first level page table
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (5 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 6/8] iommu/vt-d: Implement first level page table ops Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-11-28  2:25 ` [PATCH v2 8/8] iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr Lu Baolu
  2019-12-02 20:19 ` [PATCH v2 0/8] Use 1st-level for DMA remapping Jacob Pan
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu, Yi Sun

This checks whether a domain should use first level page table
for map/unmap. And if so, we should attach the domain to the
device in first level translation mode.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 63 +++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 695a7a5fbe8e..68b2f98ecd65 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -417,6 +417,11 @@ const struct iommu_ops intel_iommu_ops;
 static const struct pgtable_ops first_lvl_pgtable_ops;
 static const struct pgtable_ops second_lvl_pgtable_ops;
 
+static inline bool domain_pgtable_first_lvl(struct dmar_domain *domain)
+{
+	return domain->ops == &first_lvl_pgtable_ops;
+}
+
 static bool translation_pre_enabled(struct intel_iommu *iommu)
 {
 	return (iommu->flags & VTD_FLAG_TRANS_PRE_ENABLED);
@@ -1702,6 +1707,44 @@ static bool first_lvl_5lp_support(void)
 	return first_level_5lp_supported;
 }
 
+/*
+ * Check and return whether first level is used by default for
+ * DMA translation.
+ */
+static bool first_level_by_default(void)
+{
+	struct dmar_drhd_unit *drhd;
+	struct intel_iommu *iommu;
+	static int first_level_support = -1;
+
+	if (likely(first_level_support != -1))
+		return first_level_support;
+
+	first_level_support = 1;
+
+	rcu_read_lock();
+	for_each_active_iommu(iommu, drhd) {
+		if (!sm_supported(iommu) || !ecap_flts(iommu->ecap)) {
+			first_level_support = 0;
+			break;
+		}
+#ifdef CONFIG_X86
+		/*
+		 * Currently we don't support paging mode missmatching.
+		 * Could be turned on later if there is a case.
+		 */
+		if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+		    !cap_5lp_support(iommu->cap)) {
+			first_level_support = 0;
+			break;
+		}
+#endif /* #ifdef CONFIG_X86 */
+	}
+	rcu_read_unlock();
+
+	return first_level_support;
+}
+
 static struct dmar_domain *alloc_domain(int flags)
 {
 	struct dmar_domain *domain;
@@ -1714,7 +1757,10 @@ static struct dmar_domain *alloc_domain(int flags)
 	domain->nid = NUMA_NO_NODE;
 	domain->flags = flags;
 	domain->has_iotlb_device = false;
-	domain->ops = &second_lvl_pgtable_ops;
+	if (first_level_by_default())
+		domain->ops = &first_lvl_pgtable_ops;
+	else
+		domain->ops = &second_lvl_pgtable_ops;
 	domain->first_lvl_5lp = first_lvl_5lp_support();
 	spin_lock_init(&domain->page_table_lock);
 	INIT_LIST_HEAD(&domain->devices);
@@ -2710,6 +2756,11 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
 		if (hw_pass_through && domain_type_is_si(domain))
 			ret = intel_pasid_setup_pass_through(iommu, domain,
 					dev, PASID_RID2PASID);
+		else if (domain_pgtable_first_lvl(domain))
+			ret = intel_pasid_setup_first_level(iommu, dev,
+					domain->pgd, PASID_RID2PASID,
+					domain->iommu_did[iommu->seq_id],
+					PASID_FLAG_SUPERVISOR_MODE);
 		else
 			ret = intel_pasid_setup_second_level(iommu, domain,
 					dev, PASID_RID2PASID);
@@ -5597,8 +5648,14 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 		goto attach_failed;
 
 	/* Setup the PASID entry for mediated devices: */
-	ret = intel_pasid_setup_second_level(iommu, domain, dev,
-					     domain->default_pasid);
+	if (domain_pgtable_first_lvl(domain))
+		ret = intel_pasid_setup_first_level(iommu, dev,
+				domain->pgd, domain->default_pasid,
+				domain->iommu_did[iommu->seq_id],
+				PASID_FLAG_SUPERVISOR_MODE);
+	else
+		ret = intel_pasid_setup_second_level(iommu, domain, dev,
+						     domain->default_pasid);
 	if (ret)
 		goto table_failed;
 	spin_unlock(&iommu->lock);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 8/8] iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (6 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 7/8] iommu/vt-d: Identify domains using first level page table Lu Baolu
@ 2019-11-28  2:25 ` Lu Baolu
  2019-12-02 20:19 ` [PATCH v2 0/8] Use 1st-level for DMA remapping Jacob Pan
  8 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-11-28  2:25 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: ashok.raj, sanjay.k.kumar, jacob.jun.pan, kevin.tian, yi.l.liu,
	yi.y.sun, Peter Xu, iommu, kvm, linux-kernel, Lu Baolu, Yi Sun

This adds the Intel VT-d specific callback of setting
DOMAIN_ATTR_NESTING domain attribution. It is necessary
to let the VT-d driver know that the domain represents
a virutual machine which requires the IOMMU hardware to
support nested translation mode. Return success if the
IOMMU hardware suports nested mode, otherwise failure.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 56 +++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 68b2f98ecd65..ee717dcb9644 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -308,6 +308,12 @@ static int hw_pass_through = 1;
  */
 #define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
 
+/*
+ * Domain represents a virtual machine which demands iommu nested
+ * translation mode support.
+ */
+#define DOMAIN_FLAG_NESTING_MODE		BIT(2)
+
 #define for_each_domain_iommu(idx, domain)			\
 	for (idx = 0; idx < g_num_of_iommus; idx++)		\
 		if (domain->iommu_refcnt[idx])
@@ -5929,6 +5935,24 @@ static inline bool iommu_pasid_support(void)
 	return ret;
 }
 
+static inline bool nested_mode_support(void)
+{
+	struct dmar_drhd_unit *drhd;
+	struct intel_iommu *iommu;
+	bool ret = true;
+
+	rcu_read_lock();
+	for_each_active_iommu(iommu, drhd) {
+		if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
+			ret = false;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
@@ -6305,10 +6329,42 @@ static bool intel_iommu_is_attach_deferred(struct iommu_domain *domain,
 	return dev->archdata.iommu == DEFER_DEVICE_DOMAIN_INFO;
 }
 
+static int
+intel_iommu_domain_set_attr(struct iommu_domain *domain,
+			    enum iommu_attr attr, void *data)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long flags;
+	int ret = 0;
+
+	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
+		return -EINVAL;
+
+	switch (attr) {
+	case DOMAIN_ATTR_NESTING:
+		spin_lock_irqsave(&device_domain_lock, flags);
+		if (nested_mode_support() &&
+		    list_empty(&dmar_domain->devices)) {
+			dmar_domain->flags |= DOMAIN_FLAG_NESTING_MODE;
+			dmar_domain->ops = &second_lvl_pgtable_ops;
+		} else {
+			ret = -ENODEV;
+		}
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
 	.domain_free		= intel_iommu_domain_free,
+	.domain_set_attr	= intel_iommu_domain_set_attr,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
 	.aux_attach_dev		= intel_iommu_aux_attach_device,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/8] Use 1st-level for DMA remapping
  2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
                   ` (7 preceding siblings ...)
  2019-11-28  2:25 ` [PATCH v2 8/8] iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr Lu Baolu
@ 2019-12-02 20:19 ` Jacob Pan
  2019-12-03  2:19   ` Lu Baolu
  8 siblings, 1 reply; 14+ messages in thread
From: Jacob Pan @ 2019-12-02 20:19 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, David Woodhouse, Alex Williamson, ashok.raj,
	sanjay.k.kumar, kevin.tian, yi.l.liu, yi.y.sun, Peter Xu, iommu,
	kvm, linux-kernel, jacob.jun.pan

On Thu, 28 Nov 2019 10:25:42 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Intel VT-d in scalable mode supports two types of page talbes
tables
> for DMA translation: the first level page table and the second
> level page table. The first level page table uses the same
> format as the CPU page table, while the second level page table
> keeps compatible with previous formats. The software is able
> to choose any one of them for DMA remapping according to the use
> case.
> 
> This patchset aims to move IOVA (I/O Virtual Address) translation
move guest IOVA only, right?
> to 1st-level page table in scalable mode. This will simplify vIOMMU
> (IOMMU simulated by VM hypervisor) design by using the two-stage
> translation, a.k.a. nested mode translation.
> 
> As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA)
> support is now implemented in a shadow page manner. The device
> simulation software, like QEMU, has to figure out GIOVA->GPA mappings
> and write them to a shadowed page table, which will be used by the
> physical IOMMU. Each time when mappings are created or destroyed in
> vIOMMU, the simulation software has to intervene. Hence, the changes
> on GIOVA->GPA could be shadowed to host.
> 
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .--------------------.
>      |           |IOTLB flush trap |        QEMU        |
>      .-----------. (map/unmap)     |--------------------|
>      |GIOVA->GPA |---------------->|    .------------.  |
>      '-----------'                 |    | GIOVA->HPA |  |
>      |           |                 |    '------------'  |
>      '-----------'                 |                    |
>                                    |                    |
>                                    '--------------------'
>                                                 |
>             <------------------------------------
>             |
>             v VFIO/IOMMU API
>       .-----------.
>       |  pIOMMU   |
>       |-----------|
>       |           |
>       .-----------.
>       |GIOVA->HPA |
>       '-----------'
>       |           |
>       '-----------'
> 
> In VT-d 3.0, scalable mode is introduced, which offers two-level
> translation page tables and nested translation mode. Regards to
> GIOVA support, it can be simplified by 1) moving the GIOVA support
> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> dual-stage) translation in host. Compared with current shadow GIOVA
> support, the new approach makes the vIOMMU design simpler and more
> efficient as we only need to flush the pIOMMU IOTLB and possible
> device-IOTLB when an IOVA mapping in vIOMMU is torn down.
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .-----------.
>      |           |IOTLB flush trap |   QEMU    |
>      .-----------.    (unmap)      |-----------|
>      |GIOVA->GPA |---------------->|           |
>      '-----------'                 '-----------'
>      |           |                       |
>      '-----------'                       |
>            <------------------------------
>            |      VFIO/IOMMU          
>            |  cache invalidation and  
>            | guest gpd bind interfaces
>            v
>      .-----------.
>      |  pIOMMU   |
>      |-----------|
>      .-----------.
>      |GIOVA->GPA |<---First level
>      '-----------'
>      | GPA->HPA  |<---Scond level
>      '-----------'
>      '-----------'
> 
> This patch set includes two parts. The former part implements the
> per-domain page table abstraction, which makes the page table
> difference transparent to various map/unmap APIs. The later part
s/later/latter/
> applies the first level page table for IOVA translation unless the
> DOMAIN_ATTR_NESTING domain attribution has been set, which indicates
> nested mode in use.
> 
Maybe I am reading this wrong, but shouldn't it be the opposite?
i.e. Use FL page table for IOVA if it is a nesting domain?

> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> Based-on-idea-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
> Change log:
> 
>  v1->v2
>  - The first series was posted here
>    https://lkml.org/lkml/2019/9/23/297
>  - Use per domain page table ops to handle different page tables.
>  - Use first level for DMA remapping by default on both bare metal
>    and vm guest.
>  - Code refine according to code review comments for v1.
> 
> Lu Baolu (8):
>   iommu/vt-d: Add per domain page table ops
>   iommu/vt-d: Move domain_flush_cache helper into header
>   iommu/vt-d: Implement second level page table ops
>   iommu/vt-d: Apply per domain second level page table ops
>   iommu/vt-d: Add first level page table interfaces
>   iommu/vt-d: Implement first level page table ops
>   iommu/vt-d: Identify domains using first level page table
>   iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr
> 
>  drivers/iommu/Makefile             |   2 +-
>  drivers/iommu/intel-iommu.c        | 412
> +++++++++++++++++++++++------ drivers/iommu/intel-pgtable.c      |
> 376 ++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> 64 ++++- include/trace/events/intel_iommu.h |  60 +++++
>  5 files changed, 837 insertions(+), 77 deletions(-)
>  create mode 100644 drivers/iommu/intel-pgtable.c
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
  2019-11-28  2:25 ` [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces Lu Baolu
@ 2019-12-02 23:27   ` Jacob Pan
  2019-12-03  2:36     ` Lu Baolu
  2019-12-11  1:56     ` Lu Baolu
  0 siblings, 2 replies; 14+ messages in thread
From: Jacob Pan @ 2019-12-02 23:27 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, David Woodhouse, Alex Williamson, ashok.raj,
	sanjay.k.kumar, kevin.tian, yi.l.liu, yi.y.sun, Peter Xu, iommu,
	kvm, linux-kernel, Yi Sun, jacob.jun.pan

On Thu, 28 Nov 2019 10:25:47 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> This adds functions to manipulate first level page tables
> which could be used by a scalale mode capable IOMMU unit.
> 
FL and SL page tables are very similar, and I presume we are not using
all the flag bits in FL paging structures for DMA mapping. Are there
enough relevant differences to warrant a new set of helper functions
for FL? Or we can merge into one.

> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Liu Yi L <yi.l.liu@intel.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/Makefile             |   2 +-
>  drivers/iommu/intel-iommu.c        |  33 +++
>  drivers/iommu/intel-pgtable.c      | 376
> +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> 33 ++- include/trace/events/intel_iommu.h |  60 +++++
>  5 files changed, 502 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/iommu/intel-pgtable.c
> 
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 35d17094fe3b..aa04f4c3ae26 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -18,7 +18,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o
>  obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
>  obj-$(CONFIG_DMAR_TABLE) += dmar.o
>  obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o
> -obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o
> +obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o
>  obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o
>  obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o
>  obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 66f76f6df2c2..a314892ee72b 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1670,6 +1670,37 @@ static void free_dmar_iommu(struct intel_iommu
> *iommu) #endif
>  }
>  
> +/* First level 5-level paging support */
> +static bool first_lvl_5lp_support(void)
> +{
> +	struct dmar_drhd_unit *drhd;
> +	struct intel_iommu *iommu;
> +	static int first_level_5lp_supported = -1;
> +
> +	if (likely(first_level_5lp_supported != -1))
> +		return first_level_5lp_supported;
> +
> +	first_level_5lp_supported = 1;
> +#ifdef CONFIG_X86
> +	/* Match IOMMU first level and CPU paging mode */
> +	if (!cpu_feature_enabled(X86_FEATURE_LA57)) {
> +		first_level_5lp_supported = 0;
> +		return first_level_5lp_supported;
> +	}
> +#endif /* #ifdef CONFIG_X86 */
> +
> +	rcu_read_lock();
> +	for_each_active_iommu(iommu, drhd) {
> +		if (!cap_5lp_support(iommu->cap)) {
> +			first_level_5lp_supported = 0;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return first_level_5lp_supported;
> +}
> +
>  static struct dmar_domain *alloc_domain(int flags)
>  {
>  	struct dmar_domain *domain;
> @@ -1683,6 +1714,8 @@ static struct dmar_domain *alloc_domain(int
> flags) domain->flags = flags;
>  	domain->has_iotlb_device = false;
>  	domain->ops = &second_lvl_pgtable_ops;
> +	domain->first_lvl_5lp = first_lvl_5lp_support();
> +	spin_lock_init(&domain->page_table_lock);
>  	INIT_LIST_HEAD(&domain->devices);
>  
>  	return domain;
> diff --git a/drivers/iommu/intel-pgtable.c
> b/drivers/iommu/intel-pgtable.c new file mode 100644
> index 000000000000..4a26d08a7570
> --- /dev/null
> +++ b/drivers/iommu/intel-pgtable.c
> @@ -0,0 +1,376 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/**
> + * intel-pgtable.c - Intel IOMMU page table manipulation library
> + *
> + * Copyright (C) 2019 Intel Corporation
> + *
> + * Author: Lu Baolu <baolu.lu@linux.intel.com>
> + */
> +
> +#define pr_fmt(fmt)     "DMAR: " fmt
> +#include <linux/vmalloc.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/io.h>
> +#include <linux/export.h>
> +#include <linux/intel-iommu.h>
> +#include <asm/cacheflush.h>
> +#include <asm/pgtable.h>
> +#include <asm/pgalloc.h>
> +#include <trace/events/intel_iommu.h>
> +
> +/*
> + * first_lvl_map: Map a range of IO virtual address to physical
> addresses.
> + */
> +#ifdef CONFIG_X86
> +#define pgtable_populate(domain,
> nm)					\ +do
> {
> \
> +	void *__new =
> alloc_pgtable_page(domain->nid);			\
> +	if
> (!__new)							\
> +		return
> -ENOMEM;						\
> +
> smp_wmb();							\
> +
> spin_lock(&(domain)->page_table_lock);
> \
> +	if (nm ## _present(*nm))
> {					\
> +
> free_pgtable_page(__new);				\
> +	} else
> {							\
> +		set_##nm(nm, __##nm(__pa(__new) |
> _PAGE_TABLE));	\
> +		domain_flush_cache(domain, nm,
> sizeof(nm##_t));		\
> +	}
> \
> +
> spin_unlock(&(domain)->page_table_lock);			\ +}
> while (0) +
> +static int
> +first_lvl_map_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> +			unsigned long addr, unsigned long end,
> +			phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	pte_t *pte, *first_pte;
> +	u64 pfn;
> +
> +	pfn = phys_addr >> PAGE_SHIFT;
> +	if (unlikely(pmd_none(*pmd)))
> +		pgtable_populate(domain, pmd);
> +
> +	first_pte = pte = pte_offset_kernel(pmd, addr);
> +
> +	do {
> +		if (pte_present(*pte))
> +			pr_crit("ERROR: PTE for vPFN 0x%llx already
> set to 0x%llx\n",
> +				pfn, (unsigned long
> long)pte_val(*pte));
> +		set_pte(pte, pfn_pte(pfn, prot));
> +		pfn++;
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	domain_flush_cache(domain, first_pte, (void *)pte - (void
> *)first_pte); +
> +	return 0;
> +}
> +
> +static int
> +first_lvl_map_pmd_range(struct dmar_domain *domain, pud_t *pud,
> +			unsigned long addr, unsigned long end,
> +			phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	pmd_t *pmd;
> +
> +	if (unlikely(pud_none(*pud)))
> +		pgtable_populate(domain, pud);
> +	pmd = pmd_offset(pud, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (first_lvl_map_pte_range(domain, pmd, addr, next,
> +					    phys_addr + addr, prot))
> +			return -ENOMEM;
> +	} while (pmd++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int
> +first_lvl_map_pud_range(struct dmar_domain *domain, p4d_t *p4d,
> +			unsigned long addr, unsigned long end,
> +			phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	pud_t *pud;
> +
> +	if (unlikely(p4d_none(*p4d)))
> +		pgtable_populate(domain, p4d);
> +
> +	pud = pud_offset(p4d, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (first_lvl_map_pmd_range(domain, pud, addr, next,
> +					    phys_addr + addr, prot))
> +			return -ENOMEM;
> +	} while (pud++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int
> +first_lvl_map_p4d_range(struct dmar_domain *domain, pgd_t *pgd,
> +			unsigned long addr, unsigned long end,
> +			phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	p4d_t *p4d;
> +
> +	if (domain->first_lvl_5lp && unlikely(pgd_none(*pgd)))
> +		pgtable_populate(domain, pgd);
> +
> +	p4d = p4d_offset(pgd, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = p4d_addr_end(addr, end);
> +		if (first_lvl_map_pud_range(domain, p4d, addr, next,
> +					    phys_addr + addr, prot))
> +			return -ENOMEM;
> +	} while (p4d++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +int first_lvl_map_range(struct dmar_domain *domain, unsigned long
> addr,
> +			unsigned long end, phys_addr_t phys_addr,
> int dma_prot) +{
> +	unsigned long next;
> +	pgprot_t prot;
> +	pgd_t *pgd;
> +
> +	trace_domain_mm_map(domain, addr, end, phys_addr);
> +
> +	/*
> +	 * There is no PAGE_KERNEL_WO for a pte entry, so let's use
> RW
> +	 * for a pte that requires write operation.
> +	 */
> +	prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL :
> PAGE_KERNEL_RO;
> +	if (WARN_ON(addr >= end))
> +		return -EINVAL;
> +
> +	phys_addr -= addr;
> +	pgd = pgd_offset_pgd(domain->pgd, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (first_lvl_map_p4d_range(domain, pgd, addr, next,
> +					    phys_addr + addr, prot))
> +			return -ENOMEM;
> +	} while (pgd++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +/*
> + * first_lvl_unmap: Unmap an existing mapping between a range of IO
> virtual
> + *		    address and physical addresses.
> + */
> +static struct page *
> +first_lvl_unmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> +			  unsigned long addr, unsigned long end,
> +			  struct page *freelist)
> +{
> +	unsigned long start;
> +	pte_t *pte, *first_pte;
> +
> +	start = addr;
> +	pte = pte_offset_kernel(pmd, addr);
> +	first_pte = pte;
> +	do {
> +		set_pte(pte, __pte(0));
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	domain_flush_cache(domain, first_pte, (void *)pte - (void
> *)first_pte); +
> +	/*
> +	 * Reclaim pmd page, lock is unnecessary here if it owns
> +	 * the whole range.
> +	 */
> +	if (start != end && IS_ALIGNED(start | end, PMD_SIZE)) {
> +		struct page *pte_page;
> +
> +		pte_page = pmd_page(*pmd);
> +		pte_page->freelist = freelist;
> +		freelist = pte_page;
> +		pmd_clear(pmd);
> +		domain_flush_cache(domain, pmd, sizeof(pmd_t));
> +	}
> +
> +	return freelist;
> +}
> +
> +static struct page *
> +first_lvl_unmap_pmd_range(struct dmar_domain *domain, pud_t *pud,
> +			  unsigned long addr, unsigned long end,
> +			  struct page *freelist)
> +{
> +	pmd_t *pmd;
> +	unsigned long start, next;
> +
> +	start = addr;
> +	pmd = pmd_offset(pud, addr);
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		freelist = first_lvl_unmap_pte_range(domain, pmd,
> +						     addr, next,
> freelist);
> +	} while (pmd++, addr = next, addr != end);
> +
> +	/*
> +	 * Reclaim pud page, lock is unnecessary here if it owns
> +	 * the whole range.
> +	 */
> +	if (start != end && IS_ALIGNED(start | end, PUD_SIZE)) {
> +		struct page *pmd_page;
> +
> +		pmd_page = pud_page(*pud);
> +		pmd_page->freelist = freelist;
> +		freelist = pmd_page;
> +		pud_clear(pud);
> +		domain_flush_cache(domain, pud, sizeof(pud_t));
> +	}
> +
> +	return freelist;
> +}
> +
> +static struct page *
> +first_lvl_unmap_pud_range(struct dmar_domain *domain, p4d_t *p4d,
> +			  unsigned long addr, unsigned long end,
> +			  struct page *freelist)
> +{
> +	pud_t *pud;
> +	unsigned long start, next;
> +
> +	start = addr;
> +	pud = pud_offset(p4d, addr);
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none_or_clear_bad(pud))
> +			continue;
> +		freelist = first_lvl_unmap_pmd_range(domain, pud,
> +						     addr, next,
> freelist);
> +	} while (pud++, addr = next, addr != end);
> +
> +	/*
> +	 * Reclaim p4d page, lock is unnecessary here if it owns
> +	 * the whole range.
> +	 */
> +	if (start != end && IS_ALIGNED(start | end, P4D_SIZE)) {
> +		struct page *pud_page;
> +
> +		pud_page = p4d_page(*p4d);
> +		pud_page->freelist = freelist;
> +		freelist = pud_page;
> +		p4d_clear(p4d);
> +		domain_flush_cache(domain, p4d, sizeof(p4d_t));
> +	}
> +
> +	return freelist;
> +}
> +
> +static struct page *
> +first_lvl_unmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd,
> +			  unsigned long addr, unsigned long end,
> +			  struct page *freelist)
> +{
> +	p4d_t *p4d;
> +	unsigned long start, next;
> +
> +	start = addr;
> +	p4d = p4d_offset(pgd, addr);
> +	do {
> +		next = p4d_addr_end(addr, end);
> +		if (p4d_none_or_clear_bad(p4d))
> +			continue;
> +		freelist = first_lvl_unmap_pud_range(domain, p4d,
> +						     addr, next,
> freelist);
> +	} while (p4d++, addr = next, addr != end);
> +
> +	/*
> +	 * Reclaim pgd page, lock is unnecessary here if it owns
> +	 * the whole range.
> +	 */
> +	if (domain->first_lvl_5lp && start != end &&
> +	    IS_ALIGNED(start | end, PGDIR_SIZE)) {
> +		struct page *p4d_page;
> +
> +		p4d_page = pgd_page(*pgd);
> +		p4d_page->freelist = freelist;
> +		freelist = p4d_page;
> +		pgd_clear(pgd);
> +		domain_flush_cache(domain, pgd, sizeof(pgd_t));
> +	}
> +
> +	return freelist;
> +}
> +
> +struct page *first_lvl_unmap_range(struct dmar_domain *domain,
> +				   unsigned long addr, unsigned long
> end) +{
> +	pgd_t *pgd;
> +	unsigned long next;
> +	struct page *freelist = NULL;
> +
> +	trace_domain_mm_unmap(domain, addr, end);
> +
> +	if (WARN_ON(addr >= end))
> +		return NULL;
> +
> +	pgd = pgd_offset_pgd(domain->pgd, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none_or_clear_bad(pgd))
> +			continue;
> +		freelist = first_lvl_unmap_p4d_range(domain, pgd,
> +						     addr, next,
> freelist);
> +	} while (pgd++, addr = next, addr != end);
> +
> +	return freelist;
> +}
> +
> +static pte_t *iova_to_pte(struct dmar_domain *domain, unsigned long
> iova) +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +
> +	if (WARN_ON_ONCE(!IS_ALIGNED(iova, PAGE_SIZE)))
> +		return NULL;
> +
> +	pgd = pgd_offset_pgd(domain->pgd, iova);
> +	if (pgd_none_or_clear_bad(pgd))
> +		return NULL;
> +
> +	p4d = p4d_offset(pgd, iova);
> +	if (p4d_none_or_clear_bad(p4d))
> +		return NULL;
> +
> +	pud = pud_offset(p4d, iova);
> +	if (pud_none_or_clear_bad(pud))
> +		return NULL;
> +
> +	pmd = pmd_offset(pud, iova);
> +	if (pmd_none_or_clear_bad(pmd))
> +		return NULL;
> +
> +	return pte_offset_kernel(pmd, iova);
> +}
> +
> +phys_addr_t
> +first_lvl_iova_to_phys(struct dmar_domain *domain, unsigned long
> iova) +{
> +	pte_t *pte = iova_to_pte(domain, PAGE_ALIGN(iova));
> +
> +	if (!pte || !pte_present(*pte))
> +		return 0;
> +
> +	return (pte_val(*pte) & PTE_PFN_MASK) | (iova & ~PAGE_MASK);
> +}
> +#endif /* CONFIG_X86 */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 9b259756057b..9273e3f59078 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -540,9 +540,11 @@ struct dmar_domain {
>  	struct iova_domain iovad;	/* iova's that belong to
> this domain */ 
>  	/* page table used by this domain */
> -	struct dma_pte	*pgd;		/* virtual
> address */
> +	void		*pgd;		/* virtual address
> */
> +	spinlock_t page_table_lock;	/* Protects page tables */
>  	int		gaw;		/* max guest address
> width */ const struct pgtable_ops *ops;	/* page table ops */
> +	bool		first_lvl_5lp;	/* First level
> 5-level paging support */ 
>  	/* adjusted guest address width, 0 is level 2 30-bit */
>  	int		agaw;
> @@ -708,6 +710,35 @@ int for_each_device_domain(int (*fn)(struct
> device_domain_info *info, void iommu_flush_write_buffer(struct
> intel_iommu *iommu); int intel_iommu_enable_pasid(struct intel_iommu
> *iommu, struct device *dev); 
> +#ifdef CONFIG_X86
> +int first_lvl_map_range(struct dmar_domain *domain, unsigned long
> addr,
> +			unsigned long end, phys_addr_t phys_addr,
> int dma_prot); +struct page *first_lvl_unmap_range(struct dmar_domain
> *domain,
> +				   unsigned long addr, unsigned long
> end); +phys_addr_t first_lvl_iova_to_phys(struct dmar_domain *domain,
> +				   unsigned long iova);
> +#else
> +static inline int
> +first_lvl_map_range(struct dmar_domain *domain, unsigned long addr,
> +		    unsigned long end, phys_addr_t phys_addr, int
> dma_prot) +{
> +	return -ENODEV;
> +}
> +
> +static inline struct page *
> +first_lvl_unmap_range(struct dmar_domain *domain,
> +		      unsigned long addr, unsigned long end)
> +{
> +	return NULL;
> +}
> +
> +static inline phys_addr_t
> +first_lvl_iova_to_phys(struct dmar_domain *domain, unsigned long
> iova) +{
> +	return 0;
> +}
> +#endif /* CONFIG_X86 */
> +
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  extern void intel_svm_check(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> diff --git a/include/trace/events/intel_iommu.h
> b/include/trace/events/intel_iommu.h index 54e61d456cdf..e8c95290fd13
> 100644 --- a/include/trace/events/intel_iommu.h
> +++ b/include/trace/events/intel_iommu.h
> @@ -99,6 +99,66 @@ DEFINE_EVENT(dma_unmap, bounce_unmap_single,
>  	TP_ARGS(dev, dev_addr, size)
>  );
>  
> +DECLARE_EVENT_CLASS(domain_map,
> +	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
> +		 unsigned long end, phys_addr_t phys_addr),
> +
> +	TP_ARGS(domain, addr, end, phys_addr),
> +
> +	TP_STRUCT__entry(
> +		__field(struct dmar_domain *, domain)
> +		__field(unsigned long, addr)
> +		__field(unsigned long, end)
> +		__field(phys_addr_t, phys_addr)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->domain = domain;
> +		__entry->addr = addr;
> +		__entry->end = end;
> +		__entry->phys_addr = phys_addr;
> +	),
> +
> +	TP_printk("domain=%p addr=0x%lx end=0x%lx phys_addr=0x%llx",
> +		  __entry->domain, __entry->addr, __entry->end,
> +		  (unsigned long long)__entry->phys_addr)
> +);
> +
> +DEFINE_EVENT(domain_map, domain_mm_map,
> +	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
> +		 unsigned long end, phys_addr_t phys_addr),
> +
> +	TP_ARGS(domain, addr, end, phys_addr)
> +);
> +
> +DECLARE_EVENT_CLASS(domain_unmap,
> +	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
> +		 unsigned long end),
> +
> +	TP_ARGS(domain, addr, end),
> +
> +	TP_STRUCT__entry(
> +		__field(struct dmar_domain *, domain)
> +		__field(unsigned long, addr)
> +		__field(unsigned long, end)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->domain = domain;
> +		__entry->addr = addr;
> +		__entry->end = end;
> +	),
> +
> +	TP_printk("domain=%p addr=0x%lx end=0x%lx",
> +		  __entry->domain, __entry->addr, __entry->end)
> +);
> +
> +DEFINE_EVENT(domain_unmap, domain_mm_unmap,
> +	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
> +		 unsigned long end),
> +
> +	TP_ARGS(domain, addr, end)
> +);
>  #endif /* _TRACE_INTEL_IOMMU_H */
>  
>  /* This part must be outside protection */

[Jacob Pan]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/8] Use 1st-level for DMA remapping
  2019-12-02 20:19 ` [PATCH v2 0/8] Use 1st-level for DMA remapping Jacob Pan
@ 2019-12-03  2:19   ` Lu Baolu
  0 siblings, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-12-03  2:19 UTC (permalink / raw)
  To: Jacob Pan
  Cc: baolu.lu, Joerg Roedel, David Woodhouse, Alex Williamson,
	ashok.raj, sanjay.k.kumar, kevin.tian, yi.l.liu, yi.y.sun,
	Peter Xu, iommu, kvm, linux-kernel

Hi Jacob,

Thanks for reviewing it.

On 12/3/19 4:19 AM, Jacob Pan wrote:
> On Thu, 28 Nov 2019 10:25:42 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
>> Intel VT-d in scalable mode supports two types of page talbes
> tables

Got it, thanks!

>> for DMA translation: the first level page table and the second
>> level page table. The first level page table uses the same
>> format as the CPU page table, while the second level page table
>> keeps compatible with previous formats. The software is able
>> to choose any one of them for DMA remapping according to the use
>> case.
>>
>> This patchset aims to move IOVA (I/O Virtual Address) translation
> move guest IOVA only, right?

No. In v1, only for guest IOVA. This has been changed since v2 according
to comments during v1 review period. v2 will use first level for both
host and guest unless nested mode.

>> to 1st-level page table in scalable mode. This will simplify vIOMMU
>> (IOMMU simulated by VM hypervisor) design by using the two-stage
>> translation, a.k.a. nested mode translation.
>>
>> As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA)
>> support is now implemented in a shadow page manner. The device
>> simulation software, like QEMU, has to figure out GIOVA->GPA mappings
>> and write them to a shadowed page table, which will be used by the
>> physical IOMMU. Each time when mappings are created or destroyed in
>> vIOMMU, the simulation software has to intervene. Hence, the changes
>> on GIOVA->GPA could be shadowed to host.
>>
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .--------------------.
>>       |           |IOTLB flush trap |        QEMU        |
>>       .-----------. (map/unmap)     |--------------------|
>>       |GIOVA->GPA |---------------->|    .------------.  |
>>       '-----------'                 |    | GIOVA->HPA |  |
>>       |           |                 |    '------------'  |
>>       '-----------'                 |                    |
>>                                     |                    |
>>                                     '--------------------'
>>                                                  |
>>              <------------------------------------
>>              |
>>              v VFIO/IOMMU API
>>        .-----------.
>>        |  pIOMMU   |
>>        |-----------|
>>        |           |
>>        .-----------.
>>        |GIOVA->HPA |
>>        '-----------'
>>        |           |
>>        '-----------'
>>
>> In VT-d 3.0, scalable mode is introduced, which offers two-level
>> translation page tables and nested translation mode. Regards to
>> GIOVA support, it can be simplified by 1) moving the GIOVA support
>> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
>> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
>> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
>> dual-stage) translation in host. Compared with current shadow GIOVA
>> support, the new approach makes the vIOMMU design simpler and more
>> efficient as we only need to flush the pIOMMU IOTLB and possible
>> device-IOTLB when an IOVA mapping in vIOMMU is torn down.
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .-----------.
>>       |           |IOTLB flush trap |   QEMU    |
>>       .-----------.    (unmap)      |-----------|
>>       |GIOVA->GPA |---------------->|           |
>>       '-----------'                 '-----------'
>>       |           |                       |
>>       '-----------'                       |
>>             <------------------------------
>>             |      VFIO/IOMMU
>>             |  cache invalidation and
>>             | guest gpd bind interfaces
>>             v
>>       .-----------.
>>       |  pIOMMU   |
>>       |-----------|
>>       .-----------.
>>       |GIOVA->GPA |<---First level
>>       '-----------'
>>       | GPA->HPA  |<---Scond level
>>       '-----------'
>>       '-----------'
>>
>> This patch set includes two parts. The former part implements the
>> per-domain page table abstraction, which makes the page table
>> difference transparent to various map/unmap APIs. The later part
> s/later/latter/
>> applies the first level page table for IOVA translation unless the
>> DOMAIN_ATTR_NESTING domain attribution has been set, which indicates
>> nested mode in use.
>>
> Maybe I am reading this wrong, but shouldn't it be the opposite?
> i.e. Use FL page table for IOVA if it is a nesting domain?

My description seems to a bit confusing. If DOMAIN_ATTR_NESTING is set
for a domain, the second level will be used to map gPA (guest physical
address) to hPA (host physical address), and the mappings between gVA (
guest virtual address) and gPA will be maintained by the guest with the
page table address binding to host's first level. Otherwise, first level
will be used for mapping between gPA and hPA, or IOVA and DMA address.

Best regards,
baolu

> 
>> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
>> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
>> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
>> Based-on-idea-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
>> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
>>
>> Change log:
>>
>>   v1->v2
>>   - The first series was posted here
>>     https://lkml.org/lkml/2019/9/23/297
>>   - Use per domain page table ops to handle different page tables.
>>   - Use first level for DMA remapping by default on both bare metal
>>     and vm guest.
>>   - Code refine according to code review comments for v1.
>>
>> Lu Baolu (8):
>>    iommu/vt-d: Add per domain page table ops
>>    iommu/vt-d: Move domain_flush_cache helper into header
>>    iommu/vt-d: Implement second level page table ops
>>    iommu/vt-d: Apply per domain second level page table ops
>>    iommu/vt-d: Add first level page table interfaces
>>    iommu/vt-d: Implement first level page table ops
>>    iommu/vt-d: Identify domains using first level page table
>>    iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr
>>
>>   drivers/iommu/Makefile             |   2 +-
>>   drivers/iommu/intel-iommu.c        | 412
>> +++++++++++++++++++++++------ drivers/iommu/intel-pgtable.c      |
>> 376 ++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
>> 64 ++++- include/trace/events/intel_iommu.h |  60 +++++
>>   5 files changed, 837 insertions(+), 77 deletions(-)
>>   create mode 100644 drivers/iommu/intel-pgtable.c
>>
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
  2019-12-02 23:27   ` Jacob Pan
@ 2019-12-03  2:36     ` Lu Baolu
  2019-12-11  1:56     ` Lu Baolu
  1 sibling, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-12-03  2:36 UTC (permalink / raw)
  To: Jacob Pan
  Cc: baolu.lu, Joerg Roedel, David Woodhouse, Alex Williamson,
	ashok.raj, sanjay.k.kumar, kevin.tian, yi.l.liu, yi.y.sun,
	Peter Xu, iommu, kvm, linux-kernel, Yi Sun

Hi,

On 12/3/19 7:27 AM, Jacob Pan wrote:
> On Thu, 28 Nov 2019 10:25:47 +0800
> Lu Baolu<baolu.lu@linux.intel.com>  wrote:
> 
>> This adds functions to manipulate first level page tables
>> which could be used by a scalale mode capable IOMMU unit.
>>
> FL and SL page tables are very similar, and I presume we are not using
> all the flag bits in FL paging structures for DMA mapping. Are there
> enough relevant differences to warrant a new set of helper functions
> for FL? Or we can merge into one.
> 

I ever thought about this and I am still open for this suggestion.

We had a quick compare on these two page tables. The only concern is the
read/write/present encoding. The present bit in first level implies read
permission while second level page table explicitly has a READ bit.
(recalled from memory, correct me if it's bad. :-)).

Anyway, let's listen to more opinions.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
  2019-12-02 23:27   ` Jacob Pan
  2019-12-03  2:36     ` Lu Baolu
@ 2019-12-11  1:56     ` Lu Baolu
  1 sibling, 0 replies; 14+ messages in thread
From: Lu Baolu @ 2019-12-11  1:56 UTC (permalink / raw)
  To: Jacob Pan
  Cc: baolu.lu, Joerg Roedel, David Woodhouse, Alex Williamson,
	ashok.raj, sanjay.k.kumar, kevin.tian, yi.l.liu, yi.y.sun,
	Peter Xu, iommu, kvm, linux-kernel, Yi Sun

Hi Jacob,

On 12/3/19 7:27 AM, Jacob Pan wrote:
> On Thu, 28 Nov 2019 10:25:47 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
>> This adds functions to manipulate first level page tables
>> which could be used by a scalale mode capable IOMMU unit.
>>
> FL and SL page tables are very similar, and I presume we are not using
> all the flag bits in FL paging structures for DMA mapping. Are there
> enough relevant differences to warrant a new set of helper functions
> for FL? Or we can merge into one.

I evaluated your suggestion these days. It turned out that your
suggestion make code simpler and easier for maintainence. Thank you for
the comment and I will send out a new version for review soon.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-28  2:25 [PATCH v2 0/8] Use 1st-level for DMA remapping Lu Baolu
2019-11-28  2:25 ` [PATCH v2 1/8] iommu/vt-d: Add per domain page table ops Lu Baolu
2019-11-28  2:25 ` [PATCH v2 2/8] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
2019-11-28  2:25 ` [PATCH v2 3/8] iommu/vt-d: Implement second level page table ops Lu Baolu
2019-11-28  2:25 ` [PATCH v2 4/8] iommu/vt-d: Apply per domain " Lu Baolu
2019-11-28  2:25 ` [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces Lu Baolu
2019-12-02 23:27   ` Jacob Pan
2019-12-03  2:36     ` Lu Baolu
2019-12-11  1:56     ` Lu Baolu
2019-11-28  2:25 ` [PATCH v2 6/8] iommu/vt-d: Implement first level page table ops Lu Baolu
2019-11-28  2:25 ` [PATCH v2 7/8] iommu/vt-d: Identify domains using first level page table Lu Baolu
2019-11-28  2:25 ` [PATCH v2 8/8] iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr Lu Baolu
2019-12-02 20:19 ` [PATCH v2 0/8] Use 1st-level for DMA remapping Jacob Pan
2019-12-03  2:19   ` Lu Baolu

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git