IOMMU Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
@ 2019-09-23 12:24 Lu Baolu
  2019-09-23 12:24 ` [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-23 12:24 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: kevin.tian, ashok.raj, kvm, sanjay.k.kumar, iommu, linux-kernel,
	yi.y.sun

This patchset aims to move IOVA (I/O Virtual Address) translation
to 1st-level page table under scalable mode. The major purpose of
this effort is to make guest IOVA support more efficient.

As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
support is now implemented in a shadow page manner. The device
simulation software, like QEMU, has to figure out GIOVA->GPA mapping
and writes to a shadowed page table, which will be used by pIOMMU.
Each time when mappings are created or destroyed in vIOMMU, the
simulation software will intervene. The change on GIOVA->GPA will be
shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
interfaces.


     .-----------.
     |  vIOMMU   |
     |-----------|                 .--------------------.
     |           |IOTLB flush trap |        QEMU        |
     .-----------. (map/unmap)     |--------------------|
     | GVA->GPA  |---------------->|      .----------.  |
     '-----------'                 |      | GPA->HPA |  |
     |           |                 |      '----------'  |
     '-----------'                 |                    |
                                   |                    |
                                   '--------------------'
                                                |
            <------------------------------------
            |
            v VFIO/IOMMU API
      .-----------.
      |  pIOMMU   |
      |-----------|
      |           |
      .-----------.
      | GVA->HPA  |
      '-----------'
      |           |
      '-----------'

In VT-d 3.0, scalable mode is introduced, which offers two level
translation page tables and nested translation mode. Regards to
GIOVA support, it can be simplified by 1) moving the GIOVA support
over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
second level for GPA->HPA translation, and 4) enable nested (a.k.a.
dual stage) translation in host. Compared with current shadow GIOVA
support, the new approach is more secure and software is simplified
as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
when an IOVA mapping in vIOMMU is torn down.

     .-----------.
     |  vIOMMU   |
     |-----------|                 .-----------.
     |           |IOTLB flush trap |   QEMU    |
     .-----------.    (unmap)      |-----------|
     | GVA->GPA  |---------------->|           |
     '-----------'                 '-----------'
     |           |                       |
     '-----------'                       |
           <------------------------------
           |      VFIO/IOMMU          
           |  cache invalidation and  
           | guest gpd bind interfaces
           v
     .-----------.
     |  pIOMMU   |
     |-----------|
     .-----------.
     | GVA->GPA  |<---First level
     '-----------'
     | GPA->HPA  |<---Scond level
     '-----------'
     '-----------'

This patch series only aims to achieve the first goal, a.k.a using
first level translation for IOVA mappings in vIOMMU. I am sending
it out for your comments. Any comments, suggestions and concerns are
welcomed.

Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>

Lu Baolu (4):
  iommu/vt-d: Move domain_flush_cache helper into header
  iommu/vt-d: Add first level page table interfaces
  iommu/vt-d: Map/unmap domain with mmmap/mmunmap
  iommu/vt-d: Identify domains using first level page table

 drivers/iommu/Makefile             |   2 +-
 drivers/iommu/intel-iommu.c        | 142 ++++++++++--
 drivers/iommu/intel-pgtable.c      | 342 +++++++++++++++++++++++++++++
 include/linux/intel-iommu.h        |  31 ++-
 include/trace/events/intel_iommu.h |  60 +++++
 5 files changed, 553 insertions(+), 24 deletions(-)
 create mode 100644 drivers/iommu/intel-pgtable.c

-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header
  2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
@ 2019-09-23 12:24 ` Lu Baolu
  2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-23 12:24 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, yi.y.sun

So that it could be used in other source files as well.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 7 -------
 include/linux/intel-iommu.h | 7 +++++++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 5aa68a094efd..9cfe8098d993 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -828,13 +828,6 @@ static struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devf
 	return iommu;
 }
 
-static void domain_flush_cache(struct dmar_domain *domain,
-			       void *addr, int size)
-{
-	if (!domain->iommu_coherency)
-		clflush_cache_range(addr, size);
-}
-
 static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn)
 {
 	struct context_entry *context;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ed11ef594378..3ee694d4f361 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -629,6 +629,13 @@ static inline int first_pte_in_page(struct dma_pte *pte)
 	return !((unsigned long)pte & ~VTD_PAGE_MASK);
 }
 
+static inline void
+domain_flush_cache(struct dmar_domain *domain, void *addr, int size)
+{
+	if (!domain->iommu_coherency)
+		clflush_cache_range(addr, size);
+}
+
 extern struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev);
 extern int dmar_find_matched_atsr_unit(struct pci_dev *dev);
 
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
  2019-09-23 12:24 ` [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
@ 2019-09-23 12:24 ` Lu Baolu
  2019-09-23 20:31   ` Raj, Ashok
  2019-09-25  5:21   ` Peter Xu
  2019-09-23 12:24 ` [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap Lu Baolu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-23 12:24 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, yi.y.sun

This adds functions to manipulate first level page tables
which could be used by a scalale mode capable IOMMU unit.

intel_mmmap_range(domain, addr, end, phys_addr, prot)
 - Map an iova range of [addr, end) to the physical memory
   started at @phys_addr with the @prot permissions.

intel_mmunmap_range(domain, addr, end)
 - Tear down the map of an iova range [addr, end). A page
   list will be returned which will be freed after iotlb
   flushing.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/Makefile             |   2 +-
 drivers/iommu/intel-pgtable.c      | 342 +++++++++++++++++++++++++++++
 include/linux/intel-iommu.h        |  24 +-
 include/trace/events/intel_iommu.h |  60 +++++
 4 files changed, 426 insertions(+), 2 deletions(-)
 create mode 100644 drivers/iommu/intel-pgtable.c

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 4f405f926e73..dc550e14cc58 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o
 obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
 obj-$(CONFIG_DMAR_TABLE) += dmar.o
 obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o
-obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o
+obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o
 obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o
 obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o
 obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
diff --git a/drivers/iommu/intel-pgtable.c b/drivers/iommu/intel-pgtable.c
new file mode 100644
index 000000000000..8e95978cd381
--- /dev/null
+++ b/drivers/iommu/intel-pgtable.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * intel-pgtable.c - Intel IOMMU page table manipulation library
+ *
+ * Copyright (C) 2019 Intel Corporation
+ *
+ * Author: Lu Baolu <baolu.lu@linux.intel.com>
+ */
+
+#define pr_fmt(fmt)     "DMAR: " fmt
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/io.h>
+#include <linux/export.h>
+#include <linux/intel-iommu.h>
+#include <asm/cacheflush.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <trace/events/intel_iommu.h>
+
+#ifdef CONFIG_X86
+/*
+ * mmmap: Map a range of IO virtual address to physical addresses.
+ */
+#define pgtable_populate(domain, nm)					\
+do {									\
+	void *__new = alloc_pgtable_page(domain->nid);			\
+	if (!__new)							\
+		return -ENOMEM;						\
+	smp_wmb();							\
+	spin_lock(&(domain)->page_table_lock);				\
+	if (nm ## _present(*nm)) {					\
+		free_pgtable_page(__new);				\
+	} else {							\
+		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\
+		domain_flush_cache(domain, nm, sizeof(nm##_t));		\
+	}								\
+	spin_unlock(&(domain)->page_table_lock);			\
+} while(0);
+
+static int
+mmmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, unsigned long addr,
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	pte_t *pte, *first_pte;
+	u64 pfn;
+
+	pfn = phys_addr >> PAGE_SHIFT;
+	if (unlikely(pmd_none(*pmd)))
+		pgtable_populate(domain, pmd);
+
+	first_pte = pte = pte_offset_kernel(pmd, addr);
+
+	do {
+		set_pte(pte, pfn_pte(pfn, prot));
+		pfn++;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
+
+	return 0;
+}
+
+static int
+mmmap_pmd_range(struct dmar_domain *domain, pud_t *pud, unsigned long addr,
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	pmd_t *pmd;
+
+	if (unlikely(pud_none(*pud)))
+		pgtable_populate(domain, pud);
+	pmd = pmd_offset(pud, addr);
+
+	phys_addr -= addr;
+	do {
+		next = pmd_addr_end(addr, end);
+		if (mmmap_pte_range(domain, pmd, addr, next,
+				    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pmd++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int
+mmmap_pud_range(struct dmar_domain *domain, p4d_t *p4d, unsigned long addr,
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	pud_t *pud;
+
+	if (unlikely(p4d_none(*p4d)))
+		pgtable_populate(domain, p4d);
+
+	pud = pud_offset(p4d, addr);
+
+	phys_addr -= addr;
+	do {
+		next = pud_addr_end(addr, end);
+		if (mmmap_pmd_range(domain, pud, addr, next,
+				    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pud++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int
+mmmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd, unsigned long addr,
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned long next;
+	p4d_t *p4d;
+
+	if (cpu_feature_enabled(X86_FEATURE_LA57) && unlikely(pgd_none(*pgd)))
+		pgtable_populate(domain, pgd);
+
+	p4d = p4d_offset(pgd, addr);
+
+	phys_addr -= addr;
+	do {
+		next = p4d_addr_end(addr, end);
+		if (mmmap_pud_range(domain, p4d, addr, next,
+				    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (p4d++, addr = next, addr != end);
+
+	return 0;
+}
+
+int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
+		      unsigned long end, phys_addr_t phys_addr, int dma_prot)
+{
+	unsigned long next;
+	pgprot_t prot;
+	pgd_t *pgd;
+
+	trace_domain_mm_map(domain, addr, end, phys_addr);
+
+	/*
+	 * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW
+	 * for a pte that requires write operation.
+	 */
+	prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO;
+	BUG_ON(addr >= end);
+
+	phys_addr -= addr;
+	pgd = pgd_offset_pgd(domain->pgd, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (mmmap_p4d_range(domain, pgd, addr, next,
+				    phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (pgd++, addr = next, addr != end);
+
+	return 0;
+}
+
+/*
+ * mmunmap: Unmap an existing mapping between a range of IO vitual address
+ *          and physical addresses.
+ */
+static struct page *
+mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
+		  unsigned long addr, unsigned long end,
+		  struct page *freelist, bool reclaim)
+{
+	int i;
+	unsigned long start;
+	pte_t *pte, *first_pte;
+
+	start = addr;
+	pte = pte_offset_kernel(pmd, addr);
+	first_pte = pte;
+	do {
+		set_pte(pte, __pte(0));
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
+
+	/* Add page to free list if all entries are empty. */
+	if (reclaim) {
+		struct page *pte_page;
+
+		pte = (pte_t *)pmd_page_vaddr(*pmd);
+		for (i = 0; i < PTRS_PER_PTE; i++)
+			if (!pte || !pte_none(pte[i]))
+				goto pte_out;
+
+		pte_page = pmd_page(*pmd);
+		pte_page->freelist = freelist;
+		freelist = pte_page;
+		pmd_clear(pmd);
+		domain_flush_cache(domain, pmd, sizeof(pmd_t));
+	}
+
+pte_out:
+	return freelist;
+}
+
+static struct page *
+mmunmap_pmd_range(struct dmar_domain *domain, pud_t *pud,
+		  unsigned long addr, unsigned long end,
+		  struct page *freelist, bool reclaim)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long start, next;
+
+	start = addr;
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		freelist = mmunmap_pte_range(domain, pmd, addr, next,
+					     freelist, reclaim);
+	} while (pmd++, addr = next, addr != end);
+
+	/* Add page to free list if all entries are empty. */
+	if (reclaim) {
+		struct page *pmd_page;
+
+		pmd = (pmd_t *)pud_page_vaddr(*pud);
+		for (i = 0; i < PTRS_PER_PMD; i++)
+			if (!pmd || !pmd_none(pmd[i]))
+				goto pmd_out;
+
+		pmd_page = pud_page(*pud);
+		pmd_page->freelist = freelist;
+		freelist = pmd_page;
+		pud_clear(pud);
+		domain_flush_cache(domain, pud, sizeof(pud_t));
+	}
+
+pmd_out:
+	return freelist;
+}
+
+static struct page *
+mmunmap_pud_range(struct dmar_domain *domain, p4d_t *p4d,
+		  unsigned long addr, unsigned long end,
+		  struct page *freelist, bool reclaim)
+{
+	int i;
+	pud_t *pud;
+	unsigned long start, next;
+
+	start = addr;
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		freelist = mmunmap_pmd_range(domain, pud, addr, next,
+					     freelist, reclaim);
+	} while (pud++, addr = next, addr != end);
+
+	/* Add page to free list if all entries are empty. */
+	if (reclaim) {
+		struct page *pud_page;
+
+		pud = (pud_t *)p4d_page_vaddr(*p4d);
+		for (i = 0; i < PTRS_PER_PUD; i++)
+			if (!pud || !pud_none(pud[i]))
+				goto pud_out;
+
+		pud_page = p4d_page(*p4d);
+		pud_page->freelist = freelist;
+		freelist = pud_page;
+		p4d_clear(p4d);
+		domain_flush_cache(domain, p4d, sizeof(p4d_t));
+	}
+
+pud_out:
+	return freelist;
+}
+
+static struct page *
+mmunmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd,
+		  unsigned long addr, unsigned long end,
+		  struct page *freelist, bool reclaim)
+{
+	p4d_t *p4d;
+	unsigned long start, next;
+
+	start = addr;
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		freelist = mmunmap_pud_range(domain, p4d, addr, next,
+					     freelist, reclaim);
+	} while (p4d++, addr = next, addr != end);
+
+	/* Add page to free list if all entries are empty. */
+	if (cpu_feature_enabled(X86_FEATURE_LA57) && reclaim) {
+		struct page *p4d_page;
+		int i;
+
+		p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+		for (i = 0; i < PTRS_PER_P4D; i++)
+			if (!p4d || !p4d_none(p4d[i]))
+				goto p4d_out;
+
+		p4d_page = pgd_page(*pgd);
+		p4d_page->freelist = freelist;
+		freelist = p4d_page;
+		pgd_clear(pgd);
+		domain_flush_cache(domain, pgd, sizeof(pgd_t));
+	}
+
+p4d_out:
+	return freelist;
+}
+
+struct page *
+intel_mmunmap_range(struct dmar_domain *domain,
+		    unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	struct page *freelist = NULL;
+
+	trace_domain_mm_unmap(domain, addr, end);
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_pgd(domain->pgd, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		freelist = mmunmap_p4d_range(domain, pgd, addr, next,
+					     freelist, !addr);
+	} while (pgd++, addr = next, addr != end);
+
+	return freelist;
+}
+#endif /* CONFIG_X86 */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3ee694d4f361..044a91fa5431 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -489,7 +489,8 @@ struct dmar_domain {
 	struct list_head auxd;		/* link to device's auxiliary list */
 	struct iova_domain iovad;	/* iova's that belong to this domain */
 
-	struct dma_pte	*pgd;		/* virtual address */
+	void		*pgd;		/* virtual address */
+	spinlock_t page_table_lock;	/* Protects page tables */
 	int		gaw;		/* max guest address width */
 
 	/* adjusted guest address width, 0 is level 2 30-bit */
@@ -662,6 +663,27 @@ int for_each_device_domain(int (*fn)(struct device_domain_info *info,
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
 
+#ifdef CONFIG_X86
+int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
+		      unsigned long end, phys_addr_t phys_addr, int dma_prot);
+struct page *intel_mmunmap_range(struct dmar_domain *domain,
+				 unsigned long addr, unsigned long end);
+#else
+static inline int
+intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
+		  unsigned long end, phys_addr_t phys_addr, int dma_prot)
+{
+	return -ENODEV;
+}
+
+static inline struct page *
+intel_mmunmap_range(struct dmar_domain *domain,
+		    unsigned long addr, unsigned long end)
+{
+	return NULL;
+}
+#endif
+
 #ifdef CONFIG_INTEL_IOMMU_SVM
 int intel_svm_init(struct intel_iommu *iommu);
 extern int intel_svm_enable_prq(struct intel_iommu *iommu);
diff --git a/include/trace/events/intel_iommu.h b/include/trace/events/intel_iommu.h
index 54e61d456cdf..e8c95290fd13 100644
--- a/include/trace/events/intel_iommu.h
+++ b/include/trace/events/intel_iommu.h
@@ -99,6 +99,66 @@ DEFINE_EVENT(dma_unmap, bounce_unmap_single,
 	TP_ARGS(dev, dev_addr, size)
 );
 
+DECLARE_EVENT_CLASS(domain_map,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end, phys_addr_t phys_addr),
+
+	TP_ARGS(domain, addr, end, phys_addr),
+
+	TP_STRUCT__entry(
+		__field(struct dmar_domain *, domain)
+		__field(unsigned long, addr)
+		__field(unsigned long, end)
+		__field(phys_addr_t, phys_addr)
+	),
+
+	TP_fast_assign(
+		__entry->domain = domain;
+		__entry->addr = addr;
+		__entry->end = end;
+		__entry->phys_addr = phys_addr;
+	),
+
+	TP_printk("domain=%p addr=0x%lx end=0x%lx phys_addr=0x%llx",
+		  __entry->domain, __entry->addr, __entry->end,
+		  (unsigned long long)__entry->phys_addr)
+);
+
+DEFINE_EVENT(domain_map, domain_mm_map,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end, phys_addr_t phys_addr),
+
+	TP_ARGS(domain, addr, end, phys_addr)
+);
+
+DECLARE_EVENT_CLASS(domain_unmap,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end),
+
+	TP_ARGS(domain, addr, end),
+
+	TP_STRUCT__entry(
+		__field(struct dmar_domain *, domain)
+		__field(unsigned long, addr)
+		__field(unsigned long, end)
+	),
+
+	TP_fast_assign(
+		__entry->domain = domain;
+		__entry->addr = addr;
+		__entry->end = end;
+	),
+
+	TP_printk("domain=%p addr=0x%lx end=0x%lx",
+		  __entry->domain, __entry->addr, __entry->end)
+);
+
+DEFINE_EVENT(domain_unmap, domain_mm_unmap,
+	TP_PROTO(struct dmar_domain *domain, unsigned long addr,
+		 unsigned long end),
+
+	TP_ARGS(domain, addr, end)
+);
 #endif /* _TRACE_INTEL_IOMMU_H */
 
 /* This part must be outside protection */
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap
  2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
  2019-09-23 12:24 ` [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
  2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
@ 2019-09-23 12:24 ` Lu Baolu
  2019-09-25  5:00   ` Tian, Kevin
  2019-09-23 12:24 ` [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table Lu Baolu
  2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
  4 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-23 12:24 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, yi.y.sun

If a dmar domain has DOMAIN_FLAG_FIRST_LEVEL_TRANS bit set
in its flags, IOMMU will use the first level page table for
translation. Hence, we need to map or unmap addresses in the
first level page table.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 94 ++++++++++++++++++++++++++++++++-----
 1 file changed, 82 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 9cfe8098d993..103480016010 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -168,6 +168,11 @@ static inline unsigned long virt_to_dma_pfn(void *p)
 	return page_to_dma_pfn(virt_to_page(p));
 }
 
+static inline unsigned long dma_pfn_to_addr(unsigned long pfn)
+{
+	return pfn << VTD_PAGE_SHIFT;
+}
+
 /* global iommu list, set NULL for ignored DMAR units */
 static struct intel_iommu **g_iommus;
 
@@ -307,6 +312,9 @@ static int hw_pass_through = 1;
  */
 #define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
 
+/* Domain uses first level translation for DMA remapping. */
+#define DOMAIN_FLAG_FIRST_LEVEL_TRANS		BIT(2)
+
 #define for_each_domain_iommu(idx, domain)			\
 	for (idx = 0; idx < g_num_of_iommus; idx++)		\
 		if (domain->iommu_refcnt[idx])
@@ -552,6 +560,11 @@ static inline int domain_type_is_si(struct dmar_domain *domain)
 	return domain->flags & DOMAIN_FLAG_STATIC_IDENTITY;
 }
 
+static inline int domain_type_is_flt(struct dmar_domain *domain)
+{
+	return domain->flags & DOMAIN_FLAG_FIRST_LEVEL_TRANS;
+}
+
 static inline int domain_pfn_supported(struct dmar_domain *domain,
 				       unsigned long pfn)
 {
@@ -1147,8 +1160,15 @@ static struct page *domain_unmap(struct dmar_domain *domain,
 	BUG_ON(start_pfn > last_pfn);
 
 	/* we don't need lock here; nobody else touches the iova range */
-	freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-				       domain->pgd, 0, start_pfn, last_pfn, NULL);
+	if (domain_type_is_flt(domain))
+		freelist = intel_mmunmap_range(domain,
+					       dma_pfn_to_addr(start_pfn),
+					       dma_pfn_to_addr(last_pfn + 1));
+	else
+		freelist = dma_pte_clear_level(domain,
+					       agaw_to_level(domain->agaw),
+					       domain->pgd, 0, start_pfn,
+					       last_pfn, NULL);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2213,9 +2233,10 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain,
 	return level;
 }
 
-static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
-			    struct scatterlist *sg, unsigned long phys_pfn,
-			    unsigned long nr_pages, int prot)
+static int
+__domain_mapping_dma(struct dmar_domain *domain, unsigned long iov_pfn,
+		     struct scatterlist *sg, unsigned long phys_pfn,
+		     unsigned long nr_pages, int prot)
 {
 	struct dma_pte *first_pte = NULL, *pte = NULL;
 	phys_addr_t uninitialized_var(pteval);
@@ -2223,13 +2244,6 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 	unsigned int largepage_lvl = 0;
 	unsigned long lvl_pages = 0;
 
-	BUG_ON(!domain_pfn_supported(domain, iov_pfn + nr_pages - 1));
-
-	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
-		return -EINVAL;
-
-	prot &= DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP;
-
 	if (!sg) {
 		sg_res = nr_pages;
 		pteval = ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) | prot;
@@ -2328,6 +2342,62 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 	return 0;
 }
 
+static int
+__domain_mapping_mm(struct dmar_domain *domain, unsigned long iov_pfn,
+		    struct scatterlist *sg, unsigned long phys_pfn,
+		    unsigned long nr_pages, int prot)
+{
+	int ret = 0;
+
+	if (!sg)
+		return intel_mmmap_range(domain, dma_pfn_to_addr(iov_pfn),
+					 dma_pfn_to_addr(iov_pfn + nr_pages),
+					 dma_pfn_to_addr(phys_pfn), prot);
+
+	while (nr_pages > 0) {
+		unsigned long sg_pages, phys;
+		unsigned long pgoff = sg->offset & ~PAGE_MASK;
+
+		sg_pages = aligned_nrpages(sg->offset, sg->length);
+		phys = sg_phys(sg) - pgoff;
+
+		ret = intel_mmmap_range(domain, dma_pfn_to_addr(iov_pfn),
+					dma_pfn_to_addr(iov_pfn + sg_pages),
+					phys, prot);
+		if (ret)
+			break;
+
+		sg->dma_address = ((dma_addr_t)dma_pfn_to_addr(iov_pfn)) + pgoff;
+		sg->dma_length = sg->length;
+
+		nr_pages -= sg_pages;
+		iov_pfn += sg_pages;
+		sg = sg_next(sg);
+	}
+
+	return ret;
+}
+
+static int
+__domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
+		 struct scatterlist *sg, unsigned long phys_pfn,
+		 unsigned long nr_pages, int prot)
+{
+	BUG_ON(!domain_pfn_supported(domain, iov_pfn + nr_pages - 1));
+
+	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
+		return -EINVAL;
+
+	prot &= DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP;
+
+	if (domain_type_is_flt(domain))
+		return __domain_mapping_mm(domain, iov_pfn, sg,
+					   phys_pfn, nr_pages, prot);
+	else
+		return __domain_mapping_dma(domain, iov_pfn, sg,
+					    phys_pfn, nr_pages, prot);
+}
+
 static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			  struct scatterlist *sg, unsigned long phys_pfn,
 			  unsigned long nr_pages, int prot)
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table
  2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
                   ` (2 preceding siblings ...)
  2019-09-23 12:24 ` [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap Lu Baolu
@ 2019-09-23 12:24 ` Lu Baolu
  2019-09-25  6:50   ` Peter Xu
  2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
  4 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-23 12:24 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, yi.y.sun

This checks whether a domain should use first level page table
for map/unmap. And if so, we should attach the domain to the
device in first level translation mode.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 41 ++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 103480016010..d539e6a6c3dd 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1722,6 +1722,26 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
 #endif
 }
 
+/*
+ * Check and return whether first level is used by default for
+ * DMA translation.
+ */
+static bool first_level_by_default(void)
+{
+	struct dmar_drhd_unit *drhd;
+	struct intel_iommu *iommu;
+
+	rcu_read_lock();
+	for_each_active_iommu(iommu, drhd)
+		if (!sm_supported(iommu) ||
+		    !ecap_flts(iommu->ecap) ||
+		    !cap_caching_mode(iommu->cap))
+			return false;
+	rcu_read_unlock();
+
+	return true;
+}
+
 static struct dmar_domain *alloc_domain(int flags)
 {
 	struct dmar_domain *domain;
@@ -1736,6 +1756,9 @@ static struct dmar_domain *alloc_domain(int flags)
 	domain->has_iotlb_device = false;
 	INIT_LIST_HEAD(&domain->devices);
 
+	if (first_level_by_default())
+		domain->flags |= DOMAIN_FLAG_FIRST_LEVEL_TRANS;
+
 	return domain;
 }
 
@@ -2625,6 +2648,11 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
 		if (hw_pass_through && domain_type_is_si(domain))
 			ret = intel_pasid_setup_pass_through(iommu, domain,
 					dev, PASID_RID2PASID);
+		else if (domain_type_is_flt(domain))
+			ret = intel_pasid_setup_first_level(iommu, dev,
+					domain->pgd, PASID_RID2PASID,
+					domain->iommu_did[iommu->seq_id],
+					PASID_FLAG_SUPERVISOR_MODE);
 		else
 			ret = intel_pasid_setup_second_level(iommu, domain,
 					dev, PASID_RID2PASID);
@@ -5349,8 +5377,14 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 		goto attach_failed;
 
 	/* Setup the PASID entry for mediated devices: */
-	ret = intel_pasid_setup_second_level(iommu, domain, dev,
-					     domain->default_pasid);
+	if (domain_type_is_flt(domain))
+		ret = intel_pasid_setup_first_level(iommu, dev,
+				domain->pgd, domain->default_pasid,
+				domain->iommu_did[iommu->seq_id],
+				PASID_FLAG_SUPERVISOR_MODE);
+	else
+		ret = intel_pasid_setup_second_level(iommu, domain, dev,
+						     domain->default_pasid);
 	if (ret)
 		goto table_failed;
 	spin_unlock(&iommu->lock);
@@ -5583,7 +5617,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 	int level = 0;
 	u64 phys = 0;
 
-	if (dmar_domain->flags & DOMAIN_FLAG_LOSE_CHILDREN)
+	if ((dmar_domain->flags & DOMAIN_FLAG_LOSE_CHILDREN) ||
+	    (dmar_domain->flags & DOMAIN_FLAG_FIRST_LEVEL_TRANS))
 		return 0;
 
 	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
                   ` (3 preceding siblings ...)
  2019-09-23 12:24 ` [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table Lu Baolu
@ 2019-09-23 19:27 ` Jacob Pan
  2019-09-23 20:25   ` Raj, Ashok
  2019-09-24  4:27   ` Lu Baolu
  4 siblings, 2 replies; 38+ messages in thread
From: Jacob Pan @ 2019-09-23 19:27 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, ashok.raj, kvm, sanjay.k.kumar, yi.y.sun, iommu,
	linux-kernel, Alex Williamson, David Woodhouse

Hi Baolu,

On Mon, 23 Sep 2019 20:24:50 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> This patchset aims to move IOVA (I/O Virtual Address) translation
> to 1st-level page table under scalable mode. The major purpose of
> this effort is to make guest IOVA support more efficient.
> 
> As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
> support is now implemented in a shadow page manner. The device
> simulation software, like QEMU, has to figure out GIOVA->GPA mapping
> and writes to a shadowed page table, which will be used by pIOMMU.
> Each time when mappings are created or destroyed in vIOMMU, the
> simulation software will intervene. The change on GIOVA->GPA will be
> shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
> interfaces.
> 
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .--------------------.
>      |           |IOTLB flush trap |        QEMU        |
>      .-----------. (map/unmap)     |--------------------|
>      | GVA->GPA  |---------------->|      .----------.  |
>      '-----------'                 |      | GPA->HPA |  |
>      |           |                 |      '----------'  |
>      '-----------'                 |                    |
>                                    |                    |
>                                    '--------------------'
>                                                 |
>             <------------------------------------
>             |
>             v VFIO/IOMMU API
>       .-----------.
>       |  pIOMMU   |
>       |-----------|
>       |           |
>       .-----------.
>       | GVA->HPA  |
>       '-----------'
>       |           |
>       '-----------'
> 
> In VT-d 3.0, scalable mode is introduced, which offers two level
> translation page tables and nested translation mode. Regards to
> GIOVA support, it can be simplified by 1) moving the GIOVA support
> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> dual stage) translation in host. Compared with current shadow GIOVA
> support, the new approach is more secure and software is simplified
> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> when an IOVA mapping in vIOMMU is torn down.
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .-----------.
>      |           |IOTLB flush trap |   QEMU    |
>      .-----------.    (unmap)      |-----------|
>      | GVA->GPA  |---------------->|           |
>      '-----------'                 '-----------'
>      |           |                       |
>      '-----------'                       |
>            <------------------------------
>            |      VFIO/IOMMU          
>            |  cache invalidation and  
>            | guest gpd bind interfaces
>            v
For vSVA, the guest PGD bind interface will mark the PASID as guest
PASID and will inject page request into the guest. In FL gIOVA case, I
guess we are assuming there is no page fault for GIOVA. I will need to
add a flag in the gpgd bind such that any PRS will be auto responded
with invalid.

Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
and DMA API for native usage will continue to be SL only?
>      .-----------.
>      |  pIOMMU   |
>      |-----------|
>      .-----------.
>      | GVA->GPA  |<---First level
>      '-----------'
>      | GPA->HPA  |<---Scond level
>      '-----------'
>      '-----------'
> 
> This patch series only aims to achieve the first goal, a.k.a using
> first level translation for IOVA mappings in vIOMMU. I am sending
> it out for your comments. Any comments, suggestions and concerns are
> welcomed.
> 


> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> 
> Lu Baolu (4):
>   iommu/vt-d: Move domain_flush_cache helper into header
>   iommu/vt-d: Add first level page table interfaces
>   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
>   iommu/vt-d: Identify domains using first level page table
> 
>  drivers/iommu/Makefile             |   2 +-
>  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
>  drivers/iommu/intel-pgtable.c      | 342
> +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> 31 ++- include/trace/events/intel_iommu.h |  60 +++++
>  5 files changed, 553 insertions(+), 24 deletions(-)
>  create mode 100644 drivers/iommu/intel-pgtable.c
> 

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
@ 2019-09-23 20:25   ` Raj, Ashok
  2019-09-24  4:40     ` Lu Baolu
  2019-09-24  7:00     ` Tian, Kevin
  2019-09-24  4:27   ` Lu Baolu
  1 sibling, 2 replies; 38+ messages in thread
From: Raj, Ashok @ 2019-09-23 20:25 UTC (permalink / raw)
  To: Jacob Pan
  Cc: kevin.tian, Ashok Raj, kvm, sanjay.k.kumar, yi.y.sun, iommu,
	linux-kernel, Alex Williamson, David Woodhouse

Hi Jacob

On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
> > 
> > In VT-d 3.0, scalable mode is introduced, which offers two level
> > translation page tables and nested translation mode. Regards to
> > GIOVA support, it can be simplified by 1) moving the GIOVA support
> > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> > second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> > dual stage) translation in host. Compared with current shadow GIOVA
> > support, the new approach is more secure and software is simplified
> > as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> > when an IOVA mapping in vIOMMU is torn down.
> > 
> >      .-----------.
> >      |  vIOMMU   |
> >      |-----------|                 .-----------.
> >      |           |IOTLB flush trap |   QEMU    |
> >      .-----------.    (unmap)      |-----------|
> >      | GVA->GPA  |---------------->|           |
> >      '-----------'                 '-----------'
> >      |           |                       |
> >      '-----------'                       |
> >            <------------------------------
> >            |      VFIO/IOMMU          
> >            |  cache invalidation and  
> >            | guest gpd bind interfaces
> >            v
> For vSVA, the guest PGD bind interface will mark the PASID as guest
> PASID and will inject page request into the guest. In FL gIOVA case, I
> guess we are assuming there is no page fault for GIOVA. I will need to
> add a flag in the gpgd bind such that any PRS will be auto responded
> with invalid.

Is there real need to enforce this? I'm not sure if there is any
limitation in the spec, and if so, can the guest check that instead?

Also i believe the idea is to overcommit PASID#0 such uses. Thought
we had a capability to expose this to the vIOMMU as well. Not sure if this
is already documented, if not should be up in the next rev.


> 
> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> and DMA API for native usage will continue to be SL only?
> >      .-----------.
> >      |  pIOMMU   |
> >      |-----------|
> >      .-----------.
> >      | GVA->GPA  |<---First level
> >      '-----------'
> >      | GPA->HPA  |<---Scond level

s/Scond/Second

> >      '-----------'
> >      '-----------'
> > 
> > This patch series only aims to achieve the first goal, a.k.a using
> > first level translation for IOVA mappings in vIOMMU. I am sending
> > it out for your comments. Any comments, suggestions and concerns are
> > welcomed.
> > 
> 
> 
> > Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> > Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> > Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> > Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> > 
> > Lu Baolu (4):
> >   iommu/vt-d: Move domain_flush_cache helper into header
> >   iommu/vt-d: Add first level page table interfaces
> >   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
> >   iommu/vt-d: Identify domains using first level page table
> > 
> >  drivers/iommu/Makefile             |   2 +-
> >  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
> >  drivers/iommu/intel-pgtable.c      | 342
> > +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> > 31 ++- include/trace/events/intel_iommu.h |  60 +++++
> >  5 files changed, 553 insertions(+), 24 deletions(-)
> >  create mode 100644 drivers/iommu/intel-pgtable.c
> > 
> 
> [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
@ 2019-09-23 20:31   ` Raj, Ashok
  2019-09-24  1:38     ` Lu Baolu
  2019-09-25  5:21   ` Peter Xu
  1 sibling, 1 reply; 38+ messages in thread
From: Raj, Ashok @ 2019-09-23 20:31 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, Ashok Raj, kvm, sanjay.k.kumar, yi.y.sun,
	iommu, linux-kernel, Alex Williamson, David Woodhouse

On Mon, Sep 23, 2019 at 08:24:52PM +0800, Lu Baolu wrote:
> This adds functions to manipulate first level page tables
> which could be used by a scalale mode capable IOMMU unit.

s/scalale/scalable

> 
> intel_mmmap_range(domain, addr, end, phys_addr, prot)

Maybe think of a different name..? mmmap seems a bit weird :-)

>  - Map an iova range of [addr, end) to the physical memory
>    started at @phys_addr with the @prot permissions.
> 
> intel_mmunmap_range(domain, addr, end)
>  - Tear down the map of an iova range [addr, end). A page
>    list will be returned which will be freed after iotlb
>    flushing.
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Liu Yi L <yi.l.liu@intel.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-23 20:31   ` Raj, Ashok
@ 2019-09-24  1:38     ` Lu Baolu
  2019-09-25  4:30       ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-24  1:38 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: kevin.tian, Yi Sun, kvm, sanjay.k.kumar, yi.y.sun, iommu,
	linux-kernel, Alex Williamson, David Woodhouse

Hi Ashok,

On 9/24/19 4:31 AM, Raj, Ashok wrote:
> On Mon, Sep 23, 2019 at 08:24:52PM +0800, Lu Baolu wrote:
>> This adds functions to manipulate first level page tables
>> which could be used by a scalale mode capable IOMMU unit.
> 
> s/scalale/scalable

Yes.

> 
>>
>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
> 
> Maybe think of a different name..? mmmap seems a bit weird :-)

Yes. I don't like it either. I've thought about it and haven't
figured out a satisfied one. Do you have any suggestions?

Best regards,
Baolu

> 
>>   - Map an iova range of [addr, end) to the physical memory
>>     started at @phys_addr with the @prot permissions.
>>
>> intel_mmunmap_range(domain, addr, end)
>>   - Tear down the map of an iova range [addr, end). A page
>>     list will be returned which will be freed after iotlb
>>     flushing.
>>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Liu Yi L <yi.l.liu@intel.com>
>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
  2019-09-23 20:25   ` Raj, Ashok
@ 2019-09-24  4:27   ` Lu Baolu
  1 sibling, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-24  4:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: kevin.tian, ashok.raj, kvm, sanjay.k.kumar, yi.y.sun, iommu,
	linux-kernel, Alex Williamson, David Woodhouse

Hi Jacob,

On 9/24/19 3:27 AM, Jacob Pan wrote:
> Hi Baolu,
> 
> On Mon, 23 Sep 2019 20:24:50 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
>> This patchset aims to move IOVA (I/O Virtual Address) translation
>> to 1st-level page table under scalable mode. The major purpose of
>> this effort is to make guest IOVA support more efficient.
>>
>> As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
>> support is now implemented in a shadow page manner. The device
>> simulation software, like QEMU, has to figure out GIOVA->GPA mapping
>> and writes to a shadowed page table, which will be used by pIOMMU.
>> Each time when mappings are created or destroyed in vIOMMU, the
>> simulation software will intervene. The change on GIOVA->GPA will be
>> shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
>> interfaces.
>>
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .--------------------.
>>       |           |IOTLB flush trap |        QEMU        |
>>       .-----------. (map/unmap)     |--------------------|
>>       | GVA->GPA  |---------------->|      .----------.  |
>>       '-----------'                 |      | GPA->HPA |  |
>>       |           |                 |      '----------'  |
>>       '-----------'                 |                    |
>>                                     |                    |
>>                                     '--------------------'
>>                                                  |
>>              <------------------------------------
>>              |
>>              v VFIO/IOMMU API
>>        .-----------.
>>        |  pIOMMU   |
>>        |-----------|
>>        |           |
>>        .-----------.
>>        | GVA->HPA  |
>>        '-----------'
>>        |           |
>>        '-----------'
>>
>> In VT-d 3.0, scalable mode is introduced, which offers two level
>> translation page tables and nested translation mode. Regards to
>> GIOVA support, it can be simplified by 1) moving the GIOVA support
>> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
>> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
>> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
>> dual stage) translation in host. Compared with current shadow GIOVA
>> support, the new approach is more secure and software is simplified
>> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
>> when an IOVA mapping in vIOMMU is torn down.
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .-----------.
>>       |           |IOTLB flush trap |   QEMU    |
>>       .-----------.    (unmap)      |-----------|
>>       | GVA->GPA  |---------------->|           |
>>       '-----------'                 '-----------'
>>       |           |                       |
>>       '-----------'                       |
>>             <------------------------------
>>             |      VFIO/IOMMU
>>             |  cache invalidation and
>>             | guest gpd bind interfaces
>>             v
> For vSVA, the guest PGD bind interface will mark the PASID as guest
> PASID and will inject page request into the guest. In FL gIOVA case, I
> guess we are assuming there is no page fault for GIOVA. I will need to
> add a flag in the gpgd bind such that any PRS will be auto responded
> with invalid.

There should be no page fault. The pages should have been pinned.

> 
> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> and DMA API for native usage will continue to be SL only?

Yes. There isn't such use case as far as I can see.

Best regards,
Baolu

>>       .-----------.
>>       |  pIOMMU   |
>>       |-----------|
>>       .-----------.
>>       | GVA->GPA  |<---First level
>>       '-----------'
>>       | GPA->HPA  |<---Scond level
>>       '-----------'
>>       '-----------'
>>
>> This patch series only aims to achieve the first goal, a.k.a using
>> first level translation for IOVA mappings in vIOMMU. I am sending
>> it out for your comments. Any comments, suggestions and concerns are
>> welcomed.
>>
> 
> 
>> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
>> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
>> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
>> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
>>
>> Lu Baolu (4):
>>    iommu/vt-d: Move domain_flush_cache helper into header
>>    iommu/vt-d: Add first level page table interfaces
>>    iommu/vt-d: Map/unmap domain with mmmap/mmunmap
>>    iommu/vt-d: Identify domains using first level page table
>>
>>   drivers/iommu/Makefile             |   2 +-
>>   drivers/iommu/intel-iommu.c        | 142 ++++++++++--
>>   drivers/iommu/intel-pgtable.c      | 342
>> +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
>> 31 ++- include/trace/events/intel_iommu.h |  60 +++++
>>   5 files changed, 553 insertions(+), 24 deletions(-)
>>   create mode 100644 drivers/iommu/intel-pgtable.c
>>
> 
> [Jacob Pan]
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-23 20:25   ` Raj, Ashok
@ 2019-09-24  4:40     ` Lu Baolu
  2019-09-24  7:00     ` Tian, Kevin
  1 sibling, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-24  4:40 UTC (permalink / raw)
  To: Raj, Ashok, Jacob Pan
  Cc: kevin.tian, kvm, sanjay.k.kumar, yi.y.sun, iommu, linux-kernel,
	Alex Williamson, David Woodhouse

Hi,

On 9/24/19 4:25 AM, Raj, Ashok wrote:
> Hi Jacob
> 
> On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
>>>
>>> In VT-d 3.0, scalable mode is introduced, which offers two level
>>> translation page tables and nested translation mode. Regards to
>>> GIOVA support, it can be simplified by 1) moving the GIOVA support
>>> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
>>> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
>>> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
>>> dual stage) translation in host. Compared with current shadow GIOVA
>>> support, the new approach is more secure and software is simplified
>>> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
>>> when an IOVA mapping in vIOMMU is torn down.
>>>
>>>       .-----------.
>>>       |  vIOMMU   |
>>>       |-----------|                 .-----------.
>>>       |           |IOTLB flush trap |   QEMU    |
>>>       .-----------.    (unmap)      |-----------|
>>>       | GVA->GPA  |---------------->|           |
>>>       '-----------'                 '-----------'
>>>       |           |                       |
>>>       '-----------'                       |
>>>             <------------------------------
>>>             |      VFIO/IOMMU
>>>             |  cache invalidation and
>>>             | guest gpd bind interfaces
>>>             v
>> For vSVA, the guest PGD bind interface will mark the PASID as guest
>> PASID and will inject page request into the guest. In FL gIOVA case, I
>> guess we are assuming there is no page fault for GIOVA. I will need to
>> add a flag in the gpgd bind such that any PRS will be auto responded
>> with invalid.
> 
> Is there real need to enforce this? I'm not sure if there is any
> limitation in the spec, and if so, can the guest check that instead?

For FL gIOVA case, gPASID is always 0. If a physical device is passed
through, hPASID is also 0; If an mdev device (representing an ADI)
instead, hPASID would be the PASID corresponding to the ADI. The
simulation software (i.e. QEMU) maintains a map between gPASID and
hPASID.

I second Ashok's idea. We don't need to distinguish these two cases in
the api and handle page request interrupt in guest as an unrecoverable
one.

> 
> Also i believe the idea is to overcommit PASID#0 such uses. Thought
> we had a capability to expose this to the vIOMMU as well. Not sure if this
> is already documented, if not should be up in the next rev.
> 
> 
>>
>> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
>> and DMA API for native usage will continue to be SL only?
>>>       .-----------.
>>>       |  pIOMMU   |
>>>       |-----------|
>>>       .-----------.
>>>       | GVA->GPA  |<---First level
>>>       '-----------'
>>>       | GPA->HPA  |<---Scond level
> 
> s/Scond/Second

Yes. Thanks!

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-23 20:25   ` Raj, Ashok
  2019-09-24  4:40     ` Lu Baolu
@ 2019-09-24  7:00     ` Tian, Kevin
  2019-09-25  2:48       ` Lu Baolu
  1 sibling, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2019-09-24  7:00 UTC (permalink / raw)
  To: Raj, Ashok, Jacob Pan
  Cc: kvm, Kumar, Sanjay K, Sun, Yi Y, iommu, linux-kernel,
	Alex Williamson, David Woodhouse

> From: Raj, Ashok
> Sent: Tuesday, September 24, 2019 4:26 AM
> 
> Hi Jacob
> 
> On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
> > >
> > > In VT-d 3.0, scalable mode is introduced, which offers two level
> > > translation page tables and nested translation mode. Regards to
> > > GIOVA support, it can be simplified by 1) moving the GIOVA support
> > > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> > > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using
> pIOMMU
> > > second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> > > dual stage) translation in host. Compared with current shadow GIOVA
> > > support, the new approach is more secure and software is simplified
> > > as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> > > when an IOVA mapping in vIOMMU is torn down.
> > >
> > >      .-----------.
> > >      |  vIOMMU   |
> > >      |-----------|                 .-----------.
> > >      |           |IOTLB flush trap |   QEMU    |
> > >      .-----------.    (unmap)      |-----------|
> > >      | GVA->GPA  |---------------->|           |
> > >      '-----------'                 '-----------'

GVA should be replaced by GIOVA in all the figures.

> > >      |           |                       |
> > >      '-----------'                       |
> > >            <------------------------------
> > >            |      VFIO/IOMMU
> > >            |  cache invalidation and
> > >            | guest gpd bind interfaces
> > >            v
> > For vSVA, the guest PGD bind interface will mark the PASID as guest
> > PASID and will inject page request into the guest. In FL gIOVA case, I
> > guess we are assuming there is no page fault for GIOVA. I will need to
> > add a flag in the gpgd bind such that any PRS will be auto responded
> > with invalid.
> 
> Is there real need to enforce this? I'm not sure if there is any
> limitation in the spec, and if so, can the guest check that instead?

Whether to allow page fault is not usage specific (GIOVA, GVA, etc.).
It's really about the device capability and IOMMU capability. VT-d
allows page fault on both levels. So we don't need enforce it. 

btw in the future we may need an interface to tell VFIO whether a
 device is 100% DMA-faultable thus pinning can be avoided. But for 
now I'm not sure how such knowledge can be retrieved w/o device 
specific knowledge. PCI PRI capability only indicates that the device 
supports page fault, but not that the device enables page fault on 
its every DMA access. Maybe we need a new bit in PRI capability for
such purpose.

> 
> Also i believe the idea is to overcommit PASID#0 such uses. Thought
> we had a capability to expose this to the vIOMMU as well. Not sure if this
> is already documented, if not should be up in the next rev.
> 
> 
> >
> > Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> > and DMA API for native usage will continue to be SL only?
> > >      .-----------.
> > >      |  pIOMMU   |
> > >      |-----------|
> > >      .-----------.
> > >      | GVA->GPA  |<---First level
> > >      '-----------'
> > >      | GPA->HPA  |<---Scond level
> 
> s/Scond/Second
> 
> > >      '-----------'
> > >      '-----------'
> > >
> > > This patch series only aims to achieve the first goal, a.k.a using

first goal? then what are other goals? I didn't spot such information.

Also earlier you mentioned the new approach (nested) is more secure
than shadowing. why?

> > > first level translation for IOVA mappings in vIOMMU. I am sending
> > > it out for your comments. Any comments, suggestions and concerns are
> > > welcomed.
> > >
> >
> >
> > > Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> > > Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> > > Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> > > Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> > >
> > > Lu Baolu (4):
> > >   iommu/vt-d: Move domain_flush_cache helper into header
> > >   iommu/vt-d: Add first level page table interfaces
> > >   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
> > >   iommu/vt-d: Identify domains using first level page table
> > >
> > >  drivers/iommu/Makefile             |   2 +-
> > >  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
> > >  drivers/iommu/intel-pgtable.c      | 342
> > > +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> > > 31 ++- include/trace/events/intel_iommu.h |  60 +++++
> > >  5 files changed, 553 insertions(+), 24 deletions(-)
> > >  create mode 100644 drivers/iommu/intel-pgtable.c
> > >
> >
> > [Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-24  7:00     ` Tian, Kevin
@ 2019-09-25  2:48       ` Lu Baolu
  2019-09-25  6:56         ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-25  2:48 UTC (permalink / raw)
  To: Tian, Kevin, Raj, Ashok, Jacob Pan
  Cc: kvm, Kumar, Sanjay K, Sun, Yi Y, iommu, linux-kernel,
	Alex Williamson, David Woodhouse

Hi Kevin,

On 9/24/19 3:00 PM, Tian, Kevin wrote:
>>>>       '-----------'
>>>>       '-----------'
>>>>
>>>> This patch series only aims to achieve the first goal, a.k.a using
> first goal? then what are other goals? I didn't spot such information.
> 

The overall goal is to use IOMMU nested mode to avoid shadow page table
and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
accurate, but you could get the point.)

1) GIOVA mappings over 1st-level page table;
2) binding vIOMMU 1st level page table to the pIOMMU;
3) using pIOMMU second level for GPA->HPA translation;
4) enable nested (a.k.a. dual stage) translation in host.

This patch set aims to achieve 1).

> Also earlier you mentioned the new approach (nested) is more secure
> than shadowing. why?
> 

My bad! After reconsideration, I realized that it's not "more secure".

Thanks for pointing this out.

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-24  1:38     ` Lu Baolu
@ 2019-09-25  4:30       ` Peter Xu
  2019-09-25  4:38         ` Tian, Kevin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  4:30 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Alex Williamson, kevin.tian, Yi Sun, Raj, Ashok, kvm,
	sanjay.k.kumar, iommu, linux-kernel, yi.y.sun, David Woodhouse

On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > > intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > 
> > Maybe think of a different name..? mmmap seems a bit weird :-)
> 
> Yes. I don't like it either. I've thought about it and haven't
> figured out a satisfied one. Do you have any suggestions?

How about at least split the word using "_"?  Like "mm_map", then
apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
misread as mmap() which is totally irrelevant to this...

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  4:30       ` Peter Xu
@ 2019-09-25  4:38         ` Tian, Kevin
  2019-09-25  5:24           ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  4:38 UTC (permalink / raw)
  To: Peter Xu, Lu Baolu
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 12:31 PM
> 
> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > > > intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > >
> > > Maybe think of a different name..? mmmap seems a bit weird :-)
> >
> > Yes. I don't like it either. I've thought about it and haven't
> > figured out a satisfied one. Do you have any suggestions?
> 
> How about at least split the word using "_"?  Like "mm_map", then
> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> misread as mmap() which is totally irrelevant to this...
> 

what is the point of keeping 'mm' here? replace it with 'iommu'?

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap
  2019-09-23 12:24 ` [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap Lu Baolu
@ 2019-09-25  5:00   ` Tian, Kevin
  2019-09-25  7:06     ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  5:00 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu, linux-kernel, Sun, Yi Y

> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Monday, September 23, 2019 8:25 PM
> 
> If a dmar domain has DOMAIN_FLAG_FIRST_LEVEL_TRANS bit set
> in its flags, IOMMU will use the first level page table for
> translation. Hence, we need to map or unmap addresses in the
> first level page table.
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Liu Yi L <yi.l.liu@intel.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 94 ++++++++++++++++++++++++++++++++-
> ----
>  1 file changed, 82 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 9cfe8098d993..103480016010 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -168,6 +168,11 @@ static inline unsigned long virt_to_dma_pfn(void
> *p)
>  	return page_to_dma_pfn(virt_to_page(p));
>  }
> 
> +static inline unsigned long dma_pfn_to_addr(unsigned long pfn)
> +{
> +	return pfn << VTD_PAGE_SHIFT;
> +}
> +
>  /* global iommu list, set NULL for ignored DMAR units */
>  static struct intel_iommu **g_iommus;
> 
> @@ -307,6 +312,9 @@ static int hw_pass_through = 1;
>   */
>  #define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
> 
> +/* Domain uses first level translation for DMA remapping. */
> +#define DOMAIN_FLAG_FIRST_LEVEL_TRANS		BIT(2)
> +
>  #define for_each_domain_iommu(idx, domain)			\
>  	for (idx = 0; idx < g_num_of_iommus; idx++)		\
>  		if (domain->iommu_refcnt[idx])
> @@ -552,6 +560,11 @@ static inline int domain_type_is_si(struct
> dmar_domain *domain)
>  	return domain->flags & DOMAIN_FLAG_STATIC_IDENTITY;
>  }
> 
> +static inline int domain_type_is_flt(struct dmar_domain *domain)
> +{
> +	return domain->flags & DOMAIN_FLAG_FIRST_LEVEL_TRANS;
> +}
> +
>  static inline int domain_pfn_supported(struct dmar_domain *domain,
>  				       unsigned long pfn)
>  {
> @@ -1147,8 +1160,15 @@ static struct page *domain_unmap(struct
> dmar_domain *domain,
>  	BUG_ON(start_pfn > last_pfn);
> 
>  	/* we don't need lock here; nobody else touches the iova range */
> -	freelist = dma_pte_clear_level(domain, agaw_to_level(domain-
> >agaw),
> -				       domain->pgd, 0, start_pfn, last_pfn,
> NULL);
> +	if (domain_type_is_flt(domain))
> +		freelist = intel_mmunmap_range(domain,
> +					       dma_pfn_to_addr(start_pfn),
> +					       dma_pfn_to_addr(last_pfn + 1));
> +	else
> +		freelist = dma_pte_clear_level(domain,
> +					       agaw_to_level(domain->agaw),
> +					       domain->pgd, 0, start_pfn,
> +					       last_pfn, NULL);

what about providing an unified interface at the caller side, then having 
the level differentiated within the interface?

> 
>  	/* free pgd */
>  	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw))
> {
> @@ -2213,9 +2233,10 @@ static inline int hardware_largepage_caps(struct
> dmar_domain *domain,
>  	return level;
>  }
> 
> -static int __domain_mapping(struct dmar_domain *domain, unsigned long
> iov_pfn,
> -			    struct scatterlist *sg, unsigned long phys_pfn,
> -			    unsigned long nr_pages, int prot)
> +static int
> +__domain_mapping_dma(struct dmar_domain *domain, unsigned long
> iov_pfn,
> +		     struct scatterlist *sg, unsigned long phys_pfn,
> +		     unsigned long nr_pages, int prot)
>  {
>  	struct dma_pte *first_pte = NULL, *pte = NULL;
>  	phys_addr_t uninitialized_var(pteval);
> @@ -2223,13 +2244,6 @@ static int __domain_mapping(struct
> dmar_domain *domain, unsigned long iov_pfn,
>  	unsigned int largepage_lvl = 0;
>  	unsigned long lvl_pages = 0;
> 
> -	BUG_ON(!domain_pfn_supported(domain, iov_pfn + nr_pages - 1));
> -
> -	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
> -		return -EINVAL;
> -
> -	prot &= DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP;
> -
>  	if (!sg) {
>  		sg_res = nr_pages;
>  		pteval = ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) |
> prot;
> @@ -2328,6 +2342,62 @@ static int __domain_mapping(struct
> dmar_domain *domain, unsigned long iov_pfn,
>  	return 0;
>  }
> 
> +static int
> +__domain_mapping_mm(struct dmar_domain *domain, unsigned long
> iov_pfn,
> +		    struct scatterlist *sg, unsigned long phys_pfn,
> +		    unsigned long nr_pages, int prot)
> +{
> +	int ret = 0;
> +
> +	if (!sg)
> +		return intel_mmmap_range(domain,
> dma_pfn_to_addr(iov_pfn),
> +					 dma_pfn_to_addr(iov_pfn +
> nr_pages),
> +					 dma_pfn_to_addr(phys_pfn), prot);
> +
> +	while (nr_pages > 0) {
> +		unsigned long sg_pages, phys;
> +		unsigned long pgoff = sg->offset & ~PAGE_MASK;
> +
> +		sg_pages = aligned_nrpages(sg->offset, sg->length);
> +		phys = sg_phys(sg) - pgoff;
> +
> +		ret = intel_mmmap_range(domain,
> dma_pfn_to_addr(iov_pfn),
> +					dma_pfn_to_addr(iov_pfn +
> sg_pages),
> +					phys, prot);
> +		if (ret)
> +			break;
> +
> +		sg->dma_address =
> ((dma_addr_t)dma_pfn_to_addr(iov_pfn)) + pgoff;
> +		sg->dma_length = sg->length;
> +
> +		nr_pages -= sg_pages;
> +		iov_pfn += sg_pages;
> +		sg = sg_next(sg);
> +	}
> +
> +	return ret;
> +}
> +
> +static int
> +__domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> +		 struct scatterlist *sg, unsigned long phys_pfn,
> +		 unsigned long nr_pages, int prot)
> +{
> +	BUG_ON(!domain_pfn_supported(domain, iov_pfn + nr_pages - 1));
> +
> +	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
> +		return -EINVAL;
> +
> +	prot &= DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP;
> +
> +	if (domain_type_is_flt(domain))
> +		return __domain_mapping_mm(domain, iov_pfn, sg,
> +					   phys_pfn, nr_pages, prot);
> +	else
> +		return __domain_mapping_dma(domain, iov_pfn, sg,
> +					    phys_pfn, nr_pages, prot);
> +}
> +
>  static int domain_mapping(struct dmar_domain *domain, unsigned long
> iov_pfn,
>  			  struct scatterlist *sg, unsigned long phys_pfn,
>  			  unsigned long nr_pages, int prot)
> --
> 2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
  2019-09-23 20:31   ` Raj, Ashok
@ 2019-09-25  5:21   ` Peter Xu
  2019-09-26  2:35     ` Lu Baolu
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  5:21 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

On Mon, Sep 23, 2019 at 08:24:52PM +0800, Lu Baolu wrote:
> This adds functions to manipulate first level page tables
> which could be used by a scalale mode capable IOMMU unit.
> 
> intel_mmmap_range(domain, addr, end, phys_addr, prot)
>  - Map an iova range of [addr, end) to the physical memory
>    started at @phys_addr with the @prot permissions.
> 
> intel_mmunmap_range(domain, addr, end)
>  - Tear down the map of an iova range [addr, end). A page
>    list will be returned which will be freed after iotlb
>    flushing.
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Liu Yi L <yi.l.liu@intel.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/Makefile             |   2 +-
>  drivers/iommu/intel-pgtable.c      | 342 +++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h        |  24 +-
>  include/trace/events/intel_iommu.h |  60 +++++
>  4 files changed, 426 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/iommu/intel-pgtable.c
> 
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 4f405f926e73..dc550e14cc58 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -17,7 +17,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o
>  obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
>  obj-$(CONFIG_DMAR_TABLE) += dmar.o
>  obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o
> -obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o
> +obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o
>  obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o
>  obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o
>  obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
> diff --git a/drivers/iommu/intel-pgtable.c b/drivers/iommu/intel-pgtable.c
> new file mode 100644
> index 000000000000..8e95978cd381
> --- /dev/null
> +++ b/drivers/iommu/intel-pgtable.c
> @@ -0,0 +1,342 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/**
> + * intel-pgtable.c - Intel IOMMU page table manipulation library

Could this be a bit misleading?  Normally I'll use "IOMMU page table"
to refer to the 2nd level page table only, and I'm always
understanding it as "the new IOMMU will understand MMU page table as
the 1st level".  At least mention "IOMMU 1st level page table"?

> + *
> + * Copyright (C) 2019 Intel Corporation
> + *
> + * Author: Lu Baolu <baolu.lu@linux.intel.com>
> + */
> +
> +#define pr_fmt(fmt)     "DMAR: " fmt
> +#include <linux/vmalloc.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/io.h>
> +#include <linux/export.h>
> +#include <linux/intel-iommu.h>
> +#include <asm/cacheflush.h>
> +#include <asm/pgtable.h>
> +#include <asm/pgalloc.h>
> +#include <trace/events/intel_iommu.h>
> +
> +#ifdef CONFIG_X86
> +/*
> + * mmmap: Map a range of IO virtual address to physical addresses.

"... to physical addresses using MMU page table"?

Might be clearer?

> + */
> +#define pgtable_populate(domain, nm)					\
> +do {									\
> +	void *__new = alloc_pgtable_page(domain->nid);			\
> +	if (!__new)							\
> +		return -ENOMEM;						\
> +	smp_wmb();							\

Could I ask what's this wmb used for?

> +	spin_lock(&(domain)->page_table_lock);				\

Is this intended to lock here instead of taking the lock during the
whole page table walk?  Is it safe?

Taking the example where nm==PTE: when we reach here how do we
guarantee that the PMD page that has this PTE is still valid?

> +	if (nm ## _present(*nm)) {					\
> +		free_pgtable_page(__new);				\
> +	} else {							\
> +		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\

It seems to me that PV could trap calls to set_pte().  Then these
could also be trapped by e.g. Xen?  Are these traps needed?  Is there
side effect?  I'm totally not familiar with this, but just ask aloud...

> +		domain_flush_cache(domain, nm, sizeof(nm##_t));		\
> +	}								\
> +	spin_unlock(&(domain)->page_table_lock);			\
> +} while(0);
> +
> +static int
> +mmmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, unsigned long addr,
> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	pte_t *pte, *first_pte;
> +	u64 pfn;
> +
> +	pfn = phys_addr >> PAGE_SHIFT;
> +	if (unlikely(pmd_none(*pmd)))
> +		pgtable_populate(domain, pmd);
> +
> +	first_pte = pte = pte_offset_kernel(pmd, addr);
> +
> +	do {
> +		set_pte(pte, pfn_pte(pfn, prot));
> +		pfn++;
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> +
> +	return 0;
> +}
> +
> +static int
> +mmmap_pmd_range(struct dmar_domain *domain, pud_t *pud, unsigned long addr,
> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	pmd_t *pmd;
> +
> +	if (unlikely(pud_none(*pud)))
> +		pgtable_populate(domain, pud);
> +	pmd = pmd_offset(pud, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (mmmap_pte_range(domain, pmd, addr, next,
> +				    phys_addr + addr, prot))
> +			return -ENOMEM;

How about return the errcode of mmmap_pte_range() directly?

> +	} while (pmd++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int
> +mmmap_pud_range(struct dmar_domain *domain, p4d_t *p4d, unsigned long addr,
> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	pud_t *pud;
> +
> +	if (unlikely(p4d_none(*p4d)))
> +		pgtable_populate(domain, p4d);
> +
> +	pud = pud_offset(p4d, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (mmmap_pmd_range(domain, pud, addr, next,
> +				    phys_addr + addr, prot))
> +			return -ENOMEM;

Same.

> +	} while (pud++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int
> +mmmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd, unsigned long addr,
> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> +	unsigned long next;
> +	p4d_t *p4d;
> +
> +	if (cpu_feature_enabled(X86_FEATURE_LA57) && unlikely(pgd_none(*pgd)))
> +		pgtable_populate(domain, pgd);
> +
> +	p4d = p4d_offset(pgd, addr);
> +
> +	phys_addr -= addr;
> +	do {
> +		next = p4d_addr_end(addr, end);
> +		if (mmmap_pud_range(domain, p4d, addr, next,
> +				    phys_addr + addr, prot))
> +			return -ENOMEM;

Same.

> +	} while (p4d++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
> +		      unsigned long end, phys_addr_t phys_addr, int dma_prot)
> +{
> +	unsigned long next;
> +	pgprot_t prot;
> +	pgd_t *pgd;
> +
> +	trace_domain_mm_map(domain, addr, end, phys_addr);
> +
> +	/*
> +	 * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW
> +	 * for a pte that requires write operation.
> +	 */
> +	prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO;
> +	BUG_ON(addr >= end);
> +
> +	phys_addr -= addr;
> +	pgd = pgd_offset_pgd(domain->pgd, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (mmmap_p4d_range(domain, pgd, addr, next,
> +				    phys_addr + addr, prot))
> +			return -ENOMEM;

Same.

> +	} while (pgd++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +/*
> + * mmunmap: Unmap an existing mapping between a range of IO vitual address
> + *          and physical addresses.
> + */
> +static struct page *
> +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> +		  unsigned long addr, unsigned long end,
> +		  struct page *freelist, bool reclaim)
> +{
> +	int i;
> +	unsigned long start;
> +	pte_t *pte, *first_pte;
> +
> +	start = addr;
> +	pte = pte_offset_kernel(pmd, addr);
> +	first_pte = pte;
> +	do {
> +		set_pte(pte, __pte(0));
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> +
> +	/* Add page to free list if all entries are empty. */
> +	if (reclaim) {

Shouldn't we know whether to reclaim if with (addr, end) specified as
long as they cover the whole range of this PMD?

Also I noticed that this value right now is passed in from the very
top of the unmap() call.  I didn't really catch the point of that...

I'll have similar question to below a few places but I'll skip to
comment after I understand this one.

> +		struct page *pte_page;
> +
> +		pte = (pte_t *)pmd_page_vaddr(*pmd);
> +		for (i = 0; i < PTRS_PER_PTE; i++)
> +			if (!pte || !pte_none(pte[i]))
> +				goto pte_out;
> +
> +		pte_page = pmd_page(*pmd);
> +		pte_page->freelist = freelist;
> +		freelist = pte_page;
> +		pmd_clear(pmd);
> +		domain_flush_cache(domain, pmd, sizeof(pmd_t));
> +	}
> +
> +pte_out:
> +	return freelist;
> +}

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  4:38         ` Tian, Kevin
@ 2019-09-25  5:24           ` Peter Xu
  2019-09-25  6:52             ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  5:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 12:31 PM
> > 
> > On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > > > > intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > > >
> > > > Maybe think of a different name..? mmmap seems a bit weird :-)
> > >
> > > Yes. I don't like it either. I've thought about it and haven't
> > > figured out a satisfied one. Do you have any suggestions?
> > 
> > How about at least split the word using "_"?  Like "mm_map", then
> > apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> > misread as mmap() which is totally irrelevant to this...
> > 
> 
> what is the point of keeping 'mm' here? replace it with 'iommu'?

I'm not sure of what Baolu thought, but to me "mm" makes sense itself
to identify this from real IOMMU page tables (because IIUC these will
be MMU page tables).  We can come up with better names, but IMHO
"iommu" can be a bit misleading to let people refer to the 2nd level
page table.

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table
  2019-09-23 12:24 ` [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table Lu Baolu
@ 2019-09-25  6:50   ` Peter Xu
  2019-09-25  7:35     ` Tian, Kevin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  6:50 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

On Mon, Sep 23, 2019 at 08:24:54PM +0800, Lu Baolu wrote:
> +/*
> + * Check and return whether first level is used by default for
> + * DMA translation.
> + */
> +static bool first_level_by_default(void)
> +{
> +	struct dmar_drhd_unit *drhd;
> +	struct intel_iommu *iommu;
> +
> +	rcu_read_lock();
> +	for_each_active_iommu(iommu, drhd)
> +		if (!sm_supported(iommu) ||
> +		    !ecap_flts(iommu->ecap) ||
> +		    !cap_caching_mode(iommu->cap))
> +			return false;
> +	rcu_read_unlock();
> +
> +	return true;
> +}

"If no caching mode, then we will not use 1st level."

Hmm, does the vIOMMU needs to support caching-mode if with the
solution you proposed here?  Caching mode is only necessary for
shadowing AFAICT, and after all you're going to use full-nested,
then... then I would think it's not needed.  And if so, with this
patch 1st level will be disabled. Sounds like a paradox...

I'm thinking what would be the big picture for this to work now: For
the vIOMMU, instead of exposing the caching-mode, I'm thinking maybe
we should expose it with ecap.FLTS=1 while we can keep ecap.SLTS=0
then it's natural that we can only use 1st level translation in the
guest for all the domains (and I assume such an ecap value should
never happen on real hardware, am I right?).

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  5:24           ` Peter Xu
@ 2019-09-25  6:52             ` Lu Baolu
  2019-09-25  7:32               ` Tian, Kevin
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-25  6:52 UTC (permalink / raw)
  To: Peter Xu, Tian, Kevin
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

Hi Peter and Kevin,

On 9/25/19 1:24 PM, Peter Xu wrote:
> On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, September 25, 2019 12:31 PM
>>>
>>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
>>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
>>>>>
>>>>> Maybe think of a different name..? mmmap seems a bit weird :-)
>>>>
>>>> Yes. I don't like it either. I've thought about it and haven't
>>>> figured out a satisfied one. Do you have any suggestions?
>>>
>>> How about at least split the word using "_"?  Like "mm_map", then
>>> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
>>> misread as mmap() which is totally irrelevant to this...
>>>
>>
>> what is the point of keeping 'mm' here? replace it with 'iommu'?
> 
> I'm not sure of what Baolu thought, but to me "mm" makes sense itself
> to identify this from real IOMMU page tables (because IIUC these will
> be MMU page tables).  We can come up with better names, but IMHO
> "iommu" can be a bit misleading to let people refer to the 2nd level
> page table.

"mm" represents a CPU (first level) page table;

vs.

"io" represents an IOMMU (second level) page table.

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  2:48       ` Lu Baolu
@ 2019-09-25  6:56         ` Peter Xu
  2019-09-25  7:21           ` Tian, Kevin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  6:56 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K,
	iommu, linux-kernel, Sun, Yi Y, David Woodhouse

On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> Hi Kevin,
> 
> On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > >       '-----------'
> > > > >       '-----------'
> > > > > 
> > > > > This patch series only aims to achieve the first goal, a.k.a using
> > first goal? then what are other goals? I didn't spot such information.
> > 
> 
> The overall goal is to use IOMMU nested mode to avoid shadow page table
> and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> accurate, but you could get the point.)
> 
> 1) GIOVA mappings over 1st-level page table;
> 2) binding vIOMMU 1st level page table to the pIOMMU;
> 3) using pIOMMU second level for GPA->HPA translation;
> 4) enable nested (a.k.a. dual stage) translation in host.
> 
> This patch set aims to achieve 1).

Would it make sense to use 1st level even for bare-metal to replace
the 2nd level?

What I'm thinking is the DPDK apps - they have MMU page table already
there for the huge pages, then if they can use 1st level as the
default device page table then it even does not need to map, because
it can simply bind the process root page table pointer to the 1st
level page root pointer of the device contexts that it uses.

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap
  2019-09-25  5:00   ` Tian, Kevin
@ 2019-09-25  7:06     ` Lu Baolu
  0 siblings, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-25  7:06 UTC (permalink / raw)
  To: Tian, Kevin, Joerg Roedel, David Woodhouse, Alex Williamson
  Cc: Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu, linux-kernel, Sun, Yi Y

Hi,

On 9/25/19 1:00 PM, Tian, Kevin wrote:
>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>> Sent: Monday, September 23, 2019 8:25 PM
>>
>> If a dmar domain has DOMAIN_FLAG_FIRST_LEVEL_TRANS bit set
>> in its flags, IOMMU will use the first level page table for
>> translation. Hence, we need to map or unmap addresses in the
>> first level page table.
>>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Liu Yi L <yi.l.liu@intel.com>
>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> ---
>>   drivers/iommu/intel-iommu.c | 94 ++++++++++++++++++++++++++++++++-
>> ----
>>   1 file changed, 82 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index 9cfe8098d993..103480016010 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -168,6 +168,11 @@ static inline unsigned long virt_to_dma_pfn(void
>> *p)
>>   	return page_to_dma_pfn(virt_to_page(p));
>>   }
>>
>> +static inline unsigned long dma_pfn_to_addr(unsigned long pfn)
>> +{
>> +	return pfn << VTD_PAGE_SHIFT;
>> +}
>> +
>>   /* global iommu list, set NULL for ignored DMAR units */
>>   static struct intel_iommu **g_iommus;
>>
>> @@ -307,6 +312,9 @@ static int hw_pass_through = 1;
>>    */
>>   #define DOMAIN_FLAG_LOSE_CHILDREN		BIT(1)
>>
>> +/* Domain uses first level translation for DMA remapping. */
>> +#define DOMAIN_FLAG_FIRST_LEVEL_TRANS		BIT(2)
>> +
>>   #define for_each_domain_iommu(idx, domain)			\
>>   	for (idx = 0; idx < g_num_of_iommus; idx++)		\
>>   		if (domain->iommu_refcnt[idx])
>> @@ -552,6 +560,11 @@ static inline int domain_type_is_si(struct
>> dmar_domain *domain)
>>   	return domain->flags & DOMAIN_FLAG_STATIC_IDENTITY;
>>   }
>>
>> +static inline int domain_type_is_flt(struct dmar_domain *domain)
>> +{
>> +	return domain->flags & DOMAIN_FLAG_FIRST_LEVEL_TRANS;
>> +}
>> +
>>   static inline int domain_pfn_supported(struct dmar_domain *domain,
>>   				       unsigned long pfn)
>>   {
>> @@ -1147,8 +1160,15 @@ static struct page *domain_unmap(struct
>> dmar_domain *domain,
>>   	BUG_ON(start_pfn > last_pfn);
>>
>>   	/* we don't need lock here; nobody else touches the iova range */
>> -	freelist = dma_pte_clear_level(domain, agaw_to_level(domain-
>>> agaw),
>> -				       domain->pgd, 0, start_pfn, last_pfn,
>> NULL);
>> +	if (domain_type_is_flt(domain))
>> +		freelist = intel_mmunmap_range(domain,
>> +					       dma_pfn_to_addr(start_pfn),
>> +					       dma_pfn_to_addr(last_pfn + 1));
>> +	else
>> +		freelist = dma_pte_clear_level(domain,
>> +					       agaw_to_level(domain->agaw),
>> +					       domain->pgd, 0, start_pfn,
>> +					       last_pfn, NULL);
> 
> what about providing an unified interface at the caller side, then having
> the level differentiated within the interface?

Good point! I ever thought about adding some ops in struct dmar_domain,
something like:

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ed11ef594378..1dd184f76bfb 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -489,7 +489,14 @@ struct dmar_domain {
         struct list_head auxd;          /* link to device's auxiliary 
list */
         struct iova_domain iovad;       /* iova's that belong to this 
domain */

+       /* per domain page table and manipulation ops */
         struct dma_pte  *pgd;           /* virtual address */
+       int (*map)(struct dmar_domain *domain,
+                  unsigned long addr, unsigned long end,
+                  phys_addr_t phys_addr, int dma_prot);
+       struct page *(*unmap)(struct dmar_domain *domain,
+                             unsigned long addr, unsigned long end);
+
         int             gaw;            /* max guest address width */

         /* adjusted guest address width, 0 is level 2 30-bit */

So that this code could be simply like this:

	freelist = domain->unmap(...);

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  6:56         ` Peter Xu
@ 2019-09-25  7:21           ` Tian, Kevin
  2019-09-25  7:45             ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  7:21 UTC (permalink / raw)
  To: Peter Xu, Lu Baolu
  Cc: Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 2:57 PM
> 
> On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > Hi Kevin,
> >
> > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > >       '-----------'
> > > > > >       '-----------'
> > > > > >
> > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > first goal? then what are other goals? I didn't spot such information.
> > >
> >
> > The overall goal is to use IOMMU nested mode to avoid shadow page
> table
> > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> > accurate, but you could get the point.)
> >
> > 1) GIOVA mappings over 1st-level page table;
> > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > 3) using pIOMMU second level for GPA->HPA translation;
> > 4) enable nested (a.k.a. dual stage) translation in host.
> >
> > This patch set aims to achieve 1).
> 
> Would it make sense to use 1st level even for bare-metal to replace
> the 2nd level?
> 
> What I'm thinking is the DPDK apps - they have MMU page table already
> there for the huge pages, then if they can use 1st level as the
> default device page table then it even does not need to map, because
> it can simply bind the process root page table pointer to the 1st
> level page root pointer of the device contexts that it uses.
> 

Then you need bear with possible page faults from using CPU page
table, while most devices don't support it today. 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  6:52             ` Lu Baolu
@ 2019-09-25  7:32               ` Tian, Kevin
  2019-09-25  8:35                 ` Peter Xu
  2019-09-26  1:42                 ` Lu Baolu
  0 siblings, 2 replies; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  7:32 UTC (permalink / raw)
  To: Lu Baolu, Peter Xu
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Wednesday, September 25, 2019 2:52 PM
> 
> Hi Peter and Kevin,
> 
> On 9/25/19 1:24 PM, Peter Xu wrote:
> > On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
> >>> From: Peter Xu [mailto:peterx@redhat.com]
> >>> Sent: Wednesday, September 25, 2019 12:31 PM
> >>>
> >>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> >>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
> >>>>>
> >>>>> Maybe think of a different name..? mmmap seems a bit weird :-)
> >>>>
> >>>> Yes. I don't like it either. I've thought about it and haven't
> >>>> figured out a satisfied one. Do you have any suggestions?
> >>>
> >>> How about at least split the word using "_"?  Like "mm_map", then
> >>> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> >>> misread as mmap() which is totally irrelevant to this...
> >>>
> >>
> >> what is the point of keeping 'mm' here? replace it with 'iommu'?
> >
> > I'm not sure of what Baolu thought, but to me "mm" makes sense itself
> > to identify this from real IOMMU page tables (because IIUC these will
> > be MMU page tables).  We can come up with better names, but IMHO
> > "iommu" can be a bit misleading to let people refer to the 2nd level
> > page table.
> 
> "mm" represents a CPU (first level) page table;
> 
> vs.
> 
> "io" represents an IOMMU (second level) page table.
> 

IOMMU first level is not equivalent to CPU page table, though you can
use the latter as the first level (e.g. in SVA). Especially here you are
making IOVA->GPA as the first level, which is not CPU page table.

btw both levels are for "io" i.e. DMA purposes from VT-d p.o.v. They
are just hierarchical structures implemented by VT-d, with slightly
different format. The specification doesn't limit how you use them for.
In a hypothetical case, an IOMMU may implement exactly same CPU-page-
table format and support page faults for both levels. Then you can even
link the CPU page table to the 2nd level for sure.

Maybe we just name it from VT-d context, e.g. intel_map_first_level_range,
Intel_map_second_level_range, and then register them as dmar domain
callback as you replied in another mail.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table
  2019-09-25  6:50   ` Peter Xu
@ 2019-09-25  7:35     ` Tian, Kevin
  0 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  7:35 UTC (permalink / raw)
  To: Peter Xu, Lu Baolu
  Cc: Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu, linux-kernel,
	Alex Williamson, David Woodhouse, Sun, Yi Y

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 2:50 PM
> 
> On Mon, Sep 23, 2019 at 08:24:54PM +0800, Lu Baolu wrote:
> > +/*
> > + * Check and return whether first level is used by default for
> > + * DMA translation.
> > + */
> > +static bool first_level_by_default(void)
> > +{
> > +	struct dmar_drhd_unit *drhd;
> > +	struct intel_iommu *iommu;
> > +
> > +	rcu_read_lock();
> > +	for_each_active_iommu(iommu, drhd)
> > +		if (!sm_supported(iommu) ||
> > +		    !ecap_flts(iommu->ecap) ||
> > +		    !cap_caching_mode(iommu->cap))
> > +			return false;
> > +	rcu_read_unlock();
> > +
> > +	return true;
> > +}
> 
> "If no caching mode, then we will not use 1st level."
> 
> Hmm, does the vIOMMU needs to support caching-mode if with the
> solution you proposed here?  Caching mode is only necessary for
> shadowing AFAICT, and after all you're going to use full-nested,
> then... then I would think it's not needed.  And if so, with this
> patch 1st level will be disabled. Sounds like a paradox...
> 
> I'm thinking what would be the big picture for this to work now: For
> the vIOMMU, instead of exposing the caching-mode, I'm thinking maybe
> we should expose it with ecap.FLTS=1 while we can keep ecap.SLTS=0
> then it's natural that we can only use 1st level translation in the
> guest for all the domains (and I assume such an ecap value should
> never happen on real hardware, am I right?).
> 

yes, that's also the picture in my mind. :-)

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  7:21           ` Tian, Kevin
@ 2019-09-25  7:45             ` Peter Xu
  2019-09-25  8:02               ` Tian, Kevin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  7:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 2:57 PM
> > 
> > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > Hi Kevin,
> > >
> > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > >       '-----------'
> > > > > > >       '-----------'
> > > > > > >
> > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > first goal? then what are other goals? I didn't spot such information.
> > > >
> > >
> > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > table
> > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> > > accurate, but you could get the point.)
> > >
> > > 1) GIOVA mappings over 1st-level page table;
> > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > 3) using pIOMMU second level for GPA->HPA translation;
> > > 4) enable nested (a.k.a. dual stage) translation in host.
> > >
> > > This patch set aims to achieve 1).
> > 
> > Would it make sense to use 1st level even for bare-metal to replace
> > the 2nd level?
> > 
> > What I'm thinking is the DPDK apps - they have MMU page table already
> > there for the huge pages, then if they can use 1st level as the
> > default device page table then it even does not need to map, because
> > it can simply bind the process root page table pointer to the 1st
> > level page root pointer of the device contexts that it uses.
> > 
> 
> Then you need bear with possible page faults from using CPU page
> table, while most devices don't support it today. 

Right, I was just thinking aloud.  After all neither do we have IOMMU
hardware to support 1st level (or am I wrong?)...  It's just that when
the 1st level is ready it should sound doable because IIUC PRI should
be always with the 1st level support no matter on IOMMU side or the
device side?

I'm actually not sure about whether my understanding here is
correct... I thought the pasid binding previously was only for some
vendor kernel drivers but not a general thing to userspace.  I feel
like that should be doable in the future once we've got some new
syscall interface ready to deliver 1st level page table (e.g., via
vfio?) then applications like DPDK seems to be able to use that too
even directly via bare metal.

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  7:45             ` Peter Xu
@ 2019-09-25  8:02               ` Tian, Kevin
  2019-09-25  8:52                 ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2019-09-25  8:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun,  Yi Y, David Woodhouse

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 3:45 PM
> 
> On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, September 25, 2019 2:57 PM
> > >
> > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > Hi Kevin,
> > > >
> > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > >       '-----------'
> > > > > > > >       '-----------'
> > > > > > > >
> > > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > > first goal? then what are other goals? I didn't spot such information.
> > > > >
> > > >
> > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > table
> > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> not
> > > > accurate, but you could get the point.)
> > > >
> > > > 1) GIOVA mappings over 1st-level page table;
> > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > >
> > > > This patch set aims to achieve 1).
> > >
> > > Would it make sense to use 1st level even for bare-metal to replace
> > > the 2nd level?
> > >
> > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > there for the huge pages, then if they can use 1st level as the
> > > default device page table then it even does not need to map, because
> > > it can simply bind the process root page table pointer to the 1st
> > > level page root pointer of the device contexts that it uses.
> > >
> >
> > Then you need bear with possible page faults from using CPU page
> > table, while most devices don't support it today.
> 
> Right, I was just thinking aloud.  After all neither do we have IOMMU
> hardware to support 1st level (or am I wrong?)...  It's just that when

You are right. Current VT-d supports only 2nd level.

> the 1st level is ready it should sound doable because IIUC PRI should
> be always with the 1st level support no matter on IOMMU side or the
> device side?

No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
just a protocol to trigger page fault, but the device doesn't care whether
the page fault is on 1st or 2nd level in the IOMMU side. The only
relevant part is that a PRI request can have PASID tagged or cleared.
When it's tagged with PASID, the IOMMU will locate the translation
table under the given PASID (either 1st or 2nd level is fine, according
to PASID entry setting). When no PASID is included, the IOMMU locates
the translation from default entry (e.g. PASID#0 or any PASID contained
in RID2PASID in VT-d).

Your knowledge happened to be correct in deprecated ECS mode. At
that time, there is only one 2nd level per context entry which doesn't
support page fault, and there is only one 1st level per PASID entry which
supports page fault. Then PRI could be indirectly connected to 1st level,
but this just changed with new scalable mode.

Another note is that the PRI capability only indicates that a device is
capable of handling page faults, but not that a device can tolerate
page fault for any of its DMA access. If the latter is fasle, using CPU 
page table for DPDK usage is still risky (and specific to device behavior)

> 
> I'm actually not sure about whether my understanding here is
> correct... I thought the pasid binding previously was only for some
> vendor kernel drivers but not a general thing to userspace.  I feel
> like that should be doable in the future once we've got some new
> syscall interface ready to deliver 1st level page table (e.g., via
> vfio?) then applications like DPDK seems to be able to use that too
> even directly via bare metal.
> 

using 1st level for userspace is different from supporting DMA page
fault in userspace. The former is purely about which structure to
keep the mapping. I think we may do the same thing for both bare
metal and guest (using 2nd level only for GPA when nested is enabled
on the IOMMU). But reusing CPU page table for userspace is more
tricky. :-)

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  7:32               ` Tian, Kevin
@ 2019-09-25  8:35                 ` Peter Xu
  2019-09-26  1:42                 ` Lu Baolu
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Xu @ 2019-09-25  8:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

On Wed, Sep 25, 2019 at 07:32:48AM +0000, Tian, Kevin wrote:
> > From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> > Sent: Wednesday, September 25, 2019 2:52 PM
> > 
> > Hi Peter and Kevin,
> > 
> > On 9/25/19 1:24 PM, Peter Xu wrote:
> > > On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
> > >>> From: Peter Xu [mailto:peterx@redhat.com]
> > >>> Sent: Wednesday, September 25, 2019 12:31 PM
> > >>>
> > >>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > >>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > >>>>>
> > >>>>> Maybe think of a different name..? mmmap seems a bit weird :-)
> > >>>>
> > >>>> Yes. I don't like it either. I've thought about it and haven't
> > >>>> figured out a satisfied one. Do you have any suggestions?
> > >>>
> > >>> How about at least split the word using "_"?  Like "mm_map", then
> > >>> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> > >>> misread as mmap() which is totally irrelevant to this...
> > >>>
> > >>
> > >> what is the point of keeping 'mm' here? replace it with 'iommu'?
> > >
> > > I'm not sure of what Baolu thought, but to me "mm" makes sense itself
> > > to identify this from real IOMMU page tables (because IIUC these will
> > > be MMU page tables).  We can come up with better names, but IMHO
> > > "iommu" can be a bit misleading to let people refer to the 2nd level
> > > page table.
> > 
> > "mm" represents a CPU (first level) page table;
> > 
> > vs.
> > 
> > "io" represents an IOMMU (second level) page table.
> > 
> 
> IOMMU first level is not equivalent to CPU page table, though you can
> use the latter as the first level (e.g. in SVA). Especially here you are
> making IOVA->GPA as the first level, which is not CPU page table.
> 
> btw both levels are for "io" i.e. DMA purposes from VT-d p.o.v. They
> are just hierarchical structures implemented by VT-d, with slightly
> different format.

Regarding to the "slightly different format", do you mean the
extended-accessed bit?

Even if there are differences, they do look very similar.  If you see
the same chap 9.7 table, the elements are exactly called PTE, PDE,
PDPE, and so on - they're named exactly the same as MMU page tables.
With that, IMHO it still sounds reasonable if we want to relate this
"1st level iommu page table" with the existing MMU page table using
the "mm" prefix...

Regards,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  8:02               ` Tian, Kevin
@ 2019-09-25  8:52                 ` Peter Xu
  2019-09-26  1:37                   ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-25  8:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

On Wed, Sep 25, 2019 at 08:02:23AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 3:45 PM
> > 
> > On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, September 25, 2019 2:57 PM
> > > >
> > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > > Hi Kevin,
> > > > >
> > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > > >       '-----------'
> > > > > > > > >       '-----------'
> > > > > > > > >
> > > > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > > > first goal? then what are other goals? I didn't spot such information.
> > > > > >
> > > > >
> > > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > > table
> > > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> > not
> > > > > accurate, but you could get the point.)
> > > > >
> > > > > 1) GIOVA mappings over 1st-level page table;
> > > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > > >
> > > > > This patch set aims to achieve 1).
> > > >
> > > > Would it make sense to use 1st level even for bare-metal to replace
> > > > the 2nd level?
> > > >
> > > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > > there for the huge pages, then if they can use 1st level as the
> > > > default device page table then it even does not need to map, because
> > > > it can simply bind the process root page table pointer to the 1st
> > > > level page root pointer of the device contexts that it uses.
> > > >
> > >
> > > Then you need bear with possible page faults from using CPU page
> > > table, while most devices don't support it today.
> > 
> > Right, I was just thinking aloud.  After all neither do we have IOMMU
> > hardware to support 1st level (or am I wrong?)...  It's just that when
> 
> You are right. Current VT-d supports only 2nd level.
> 
> > the 1st level is ready it should sound doable because IIUC PRI should
> > be always with the 1st level support no matter on IOMMU side or the
> > device side?
> 
> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
> just a protocol to trigger page fault, but the device doesn't care whether
> the page fault is on 1st or 2nd level in the IOMMU side. The only
> relevant part is that a PRI request can have PASID tagged or cleared.
> When it's tagged with PASID, the IOMMU will locate the translation
> table under the given PASID (either 1st or 2nd level is fine, according
> to PASID entry setting). When no PASID is included, the IOMMU locates
> the translation from default entry (e.g. PASID#0 or any PASID contained
> in RID2PASID in VT-d).
> 
> Your knowledge happened to be correct in deprecated ECS mode. At
> that time, there is only one 2nd level per context entry which doesn't
> support page fault, and there is only one 1st level per PASID entry which
> supports page fault. Then PRI could be indirectly connected to 1st level,
> but this just changed with new scalable mode.
> 
> Another note is that the PRI capability only indicates that a device is
> capable of handling page faults, but not that a device can tolerate
> page fault for any of its DMA access. If the latter is fasle, using CPU 
> page table for DPDK usage is still risky (and specific to device behavior)
> 
> > 
> > I'm actually not sure about whether my understanding here is
> > correct... I thought the pasid binding previously was only for some
> > vendor kernel drivers but not a general thing to userspace.  I feel
> > like that should be doable in the future once we've got some new
> > syscall interface ready to deliver 1st level page table (e.g., via
> > vfio?) then applications like DPDK seems to be able to use that too
> > even directly via bare metal.
> > 
> 
> using 1st level for userspace is different from supporting DMA page
> fault in userspace. The former is purely about which structure to
> keep the mapping. I think we may do the same thing for both bare
> metal and guest (using 2nd level only for GPA when nested is enabled
> on the IOMMU). But reusing CPU page table for userspace is more
> tricky. :-)

Yes I should have mixed up the 1st level page table and PRI a bit, and
after all my initial question should be irrelevant to this series as
well so it's already a bit out of topic (sorry for that).

And, thanks for explaining these. :)

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
  2019-09-25  8:52                 ` Peter Xu
@ 2019-09-26  1:37                   ` Lu Baolu
  0 siblings, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-26  1:37 UTC (permalink / raw)
  To: Peter Xu, Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

Hi Peter,

On 9/25/19 4:52 PM, Peter Xu wrote:
> On Wed, Sep 25, 2019 at 08:02:23AM +0000, Tian, Kevin wrote:
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, September 25, 2019 3:45 PM
>>>
>>> On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>> Sent: Wednesday, September 25, 2019 2:57 PM
>>>>>
>>>>> On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
>>>>>> Hi Kevin,
>>>>>>
>>>>>> On 9/24/19 3:00 PM, Tian, Kevin wrote:
>>>>>>>>>>        '-----------'
>>>>>>>>>>        '-----------'
>>>>>>>>>>
>>>>>>>>>> This patch series only aims to achieve the first goal, a.k.a using
>>>>>>> first goal? then what are other goals? I didn't spot such information.
>>>>>>>
>>>>>>
>>>>>> The overall goal is to use IOMMU nested mode to avoid shadow page
>>>>> table
>>>>>> and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
>>> not
>>>>>> accurate, but you could get the point.)
>>>>>>
>>>>>> 1) GIOVA mappings over 1st-level page table;
>>>>>> 2) binding vIOMMU 1st level page table to the pIOMMU;
>>>>>> 3) using pIOMMU second level for GPA->HPA translation;
>>>>>> 4) enable nested (a.k.a. dual stage) translation in host.
>>>>>>
>>>>>> This patch set aims to achieve 1).
>>>>>
>>>>> Would it make sense to use 1st level even for bare-metal to replace
>>>>> the 2nd level?
>>>>>
>>>>> What I'm thinking is the DPDK apps - they have MMU page table already
>>>>> there for the huge pages, then if they can use 1st level as the
>>>>> default device page table then it even does not need to map, because
>>>>> it can simply bind the process root page table pointer to the 1st
>>>>> level page root pointer of the device contexts that it uses.
>>>>>
>>>>
>>>> Then you need bear with possible page faults from using CPU page
>>>> table, while most devices don't support it today.
>>>
>>> Right, I was just thinking aloud.  After all neither do we have IOMMU
>>> hardware to support 1st level (or am I wrong?)...  It's just that when
>>
>> You are right. Current VT-d supports only 2nd level.
>>
>>> the 1st level is ready it should sound doable because IIUC PRI should
>>> be always with the 1st level support no matter on IOMMU side or the
>>> device side?
>>
>> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
>> just a protocol to trigger page fault, but the device doesn't care whether
>> the page fault is on 1st or 2nd level in the IOMMU side. The only
>> relevant part is that a PRI request can have PASID tagged or cleared.
>> When it's tagged with PASID, the IOMMU will locate the translation
>> table under the given PASID (either 1st or 2nd level is fine, according
>> to PASID entry setting). When no PASID is included, the IOMMU locates
>> the translation from default entry (e.g. PASID#0 or any PASID contained
>> in RID2PASID in VT-d).
>>
>> Your knowledge happened to be correct in deprecated ECS mode. At
>> that time, there is only one 2nd level per context entry which doesn't
>> support page fault, and there is only one 1st level per PASID entry which
>> supports page fault. Then PRI could be indirectly connected to 1st level,
>> but this just changed with new scalable mode.
>>
>> Another note is that the PRI capability only indicates that a device is
>> capable of handling page faults, but not that a device can tolerate
>> page fault for any of its DMA access. If the latter is fasle, using CPU
>> page table for DPDK usage is still risky (and specific to device behavior)
>>
>>>
>>> I'm actually not sure about whether my understanding here is
>>> correct... I thought the pasid binding previously was only for some
>>> vendor kernel drivers but not a general thing to userspace.  I feel
>>> like that should be doable in the future once we've got some new
>>> syscall interface ready to deliver 1st level page table (e.g., via
>>> vfio?) then applications like DPDK seems to be able to use that too
>>> even directly via bare metal.
>>>
>>
>> using 1st level for userspace is different from supporting DMA page
>> fault in userspace. The former is purely about which structure to
>> keep the mapping. I think we may do the same thing for both bare
>> metal and guest (using 2nd level only for GPA when nested is enabled
>> on the IOMMU). But reusing CPU page table for userspace is more
>> tricky. :-)
> 
> Yes I should have mixed up the 1st level page table and PRI a bit, and
> after all my initial question should be irrelevant to this series as
> well so it's already a bit out of topic (sorry for that).

Never mind. Good discussion. :-)

Actually I have plan to use 1st level on bare metal as well. Just
looking forward to more motivation and use cases.

> 
> And, thanks for explaining these. :)
> 

Thanks for Kevin's explanation. :-)

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  7:32               ` Tian, Kevin
  2019-09-25  8:35                 ` Peter Xu
@ 2019-09-26  1:42                 ` Lu Baolu
  1 sibling, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-09-26  1:42 UTC (permalink / raw)
  To: Tian, Kevin, Peter Xu
  Cc: Alex Williamson, Yi Sun, Raj, Ashok, kvm, Kumar, Sanjay K, iommu,
	linux-kernel, Sun, Yi Y, David Woodhouse

Hi Kevin,

On 9/25/19 3:32 PM, Tian, Kevin wrote:
>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>> Sent: Wednesday, September 25, 2019 2:52 PM
>>
>> Hi Peter and Kevin,
>>
>> On 9/25/19 1:24 PM, Peter Xu wrote:
>>> On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>> Sent: Wednesday, September 25, 2019 12:31 PM
>>>>>
>>>>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
>>>>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
>>>>>>>
>>>>>>> Maybe think of a different name..? mmmap seems a bit weird :-)
>>>>>>
>>>>>> Yes. I don't like it either. I've thought about it and haven't
>>>>>> figured out a satisfied one. Do you have any suggestions?
>>>>>
>>>>> How about at least split the word using "_"?  Like "mm_map", then
>>>>> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
>>>>> misread as mmap() which is totally irrelevant to this...
>>>>>
>>>>
>>>> what is the point of keeping 'mm' here? replace it with 'iommu'?
>>>
>>> I'm not sure of what Baolu thought, but to me "mm" makes sense itself
>>> to identify this from real IOMMU page tables (because IIUC these will
>>> be MMU page tables).  We can come up with better names, but IMHO
>>> "iommu" can be a bit misleading to let people refer to the 2nd level
>>> page table.
>>
>> "mm" represents a CPU (first level) page table;
>>
>> vs.
>>
>> "io" represents an IOMMU (second level) page table.
>>
> 
> IOMMU first level is not equivalent to CPU page table, though you can
> use the latter as the first level (e.g. in SVA). Especially here you are
> making IOVA->GPA as the first level, which is not CPU page table.
> 
> btw both levels are for "io" i.e. DMA purposes from VT-d p.o.v. They
> are just hierarchical structures implemented by VT-d, with slightly
> different format. The specification doesn't limit how you use them for.
> In a hypothetical case, an IOMMU may implement exactly same CPU-page-
> table format and support page faults for both levels. Then you can even
> link the CPU page table to the 2nd level for sure.

Fair enough. A good conceptual gap fix.

> 
> Maybe we just name it from VT-d context, e.g. intel_map_first_level_range,
> Intel_map_second_level_range, and then register them as dmar domain
> callback as you replied in another mail.

Yes. Make sense.

> 
> Thanks
> Kevin
> 

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-25  5:21   ` Peter Xu
@ 2019-09-26  2:35     ` Lu Baolu
  2019-09-26  3:49       ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-26  2:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

Hi Peter,

Thanks for reviewing my code.

On 9/25/19 1:21 PM, Peter Xu wrote:
> On Mon, Sep 23, 2019 at 08:24:52PM +0800, Lu Baolu wrote:
>> This adds functions to manipulate first level page tables
>> which could be used by a scalale mode capable IOMMU unit.
>>
>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
>>   - Map an iova range of [addr, end) to the physical memory
>>     started at @phys_addr with the @prot permissions.
>>
>> intel_mmunmap_range(domain, addr, end)
>>   - Tear down the map of an iova range [addr, end). A page
>>     list will be returned which will be freed after iotlb
>>     flushing.
>>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Liu Yi L <yi.l.liu@intel.com>
>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> ---
>>   drivers/iommu/Makefile             |   2 +-
>>   drivers/iommu/intel-pgtable.c      | 342 +++++++++++++++++++++++++++++
>>   include/linux/intel-iommu.h        |  24 +-
>>   include/trace/events/intel_iommu.h |  60 +++++
>>   4 files changed, 426 insertions(+), 2 deletions(-)
>>   create mode 100644 drivers/iommu/intel-pgtable.c
>>
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 4f405f926e73..dc550e14cc58 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -17,7 +17,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o
>>   obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
>>   obj-$(CONFIG_DMAR_TABLE) += dmar.o
>>   obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o
>> -obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o
>> +obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o
>>   obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o
>>   obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o
>>   obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
>> diff --git a/drivers/iommu/intel-pgtable.c b/drivers/iommu/intel-pgtable.c
>> new file mode 100644
>> index 000000000000..8e95978cd381
>> --- /dev/null
>> +++ b/drivers/iommu/intel-pgtable.c
>> @@ -0,0 +1,342 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/**
>> + * intel-pgtable.c - Intel IOMMU page table manipulation library
> 
> Could this be a bit misleading?  Normally I'll use "IOMMU page table"
> to refer to the 2nd level page table only, and I'm always
> understanding it as "the new IOMMU will understand MMU page table as
> the 1st level".  At least mention "IOMMU 1st level page table"?
> 

This file is a place holder for all code that manipulating iommu page
tables (both first level and second level). Instead of putting
everything in intel_iommu.c, let's make the code more structured so that
it's easier for reading and maintaining. This is the motivation of this
file.

>> + *
>> + * Copyright (C) 2019 Intel Corporation
>> + *
>> + * Author: Lu Baolu <baolu.lu@linux.intel.com>
>> + */
>> +
>> +#define pr_fmt(fmt)     "DMAR: " fmt
>> +#include <linux/vmalloc.h>
>> +#include <linux/mm.h>
>> +#include <linux/sched.h>
>> +#include <linux/io.h>
>> +#include <linux/export.h>
>> +#include <linux/intel-iommu.h>
>> +#include <asm/cacheflush.h>
>> +#include <asm/pgtable.h>
>> +#include <asm/pgalloc.h>
>> +#include <trace/events/intel_iommu.h>
>> +
>> +#ifdef CONFIG_X86
>> +/*
>> + * mmmap: Map a range of IO virtual address to physical addresses.
> 
> "... to physical addresses using MMU page table"?
> 
> Might be clearer?

Yes.

> 
>> + */
>> +#define pgtable_populate(domain, nm)					\
>> +do {									\
>> +	void *__new = alloc_pgtable_page(domain->nid);			\
>> +	if (!__new)							\
>> +		return -ENOMEM;						\
>> +	smp_wmb();							\
> 
> Could I ask what's this wmb used for?

Sure. This is answered by a comment in __pte_alloc() in mm/memory.c. Let
me post it here.

         /*
          * Ensure all pte setup (eg. pte page lock and page clearing) are
          * visible before the pte is made visible to other CPUs by being
          * put into page tables.
          *
          * The other side of the story is the pointer chasing in the page
          * table walking code (when walking the page table without locking;
          * ie. most of the time). Fortunately, these data accesses consist
          * of a chain of data-dependent loads, meaning most CPUs (alpha
          * being the notable exception) will already guarantee loads are
          * seen in-order. See the alpha page table accessors for the
          * smp_read_barrier_depends() barriers in page table walking code.
          */
         smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

> 
>> +	spin_lock(&(domain)->page_table_lock);				\
> 
> Is this intended to lock here instead of taking the lock during the
> whole page table walk?  Is it safe?
> 
> Taking the example where nm==PTE: when we reach here how do we
> guarantee that the PMD page that has this PTE is still valid?

We will always keep the non-leaf pages in the table, hence we only need
a spin lock to serialize multiple tries of populating a entry for pde.
As for pte, we can assume there is only single thread which can access
it at a time because different mappings will have different iova's.

> 
>> +	if (nm ## _present(*nm)) {					\
>> +		free_pgtable_page(__new);				\
>> +	} else {							\
>> +		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\
> 
> It seems to me that PV could trap calls to set_pte().  Then these
> could also be trapped by e.g. Xen?  Are these traps needed?  Is there
> side effect?  I'm totally not familiar with this, but just ask aloud...

Good catch. But I don't think a vIOMMU could get a chance to run in a PV
environment. I might miss something?

> 
>> +		domain_flush_cache(domain, nm, sizeof(nm##_t));		\
>> +	}								\
>> +	spin_unlock(&(domain)->page_table_lock);			\
>> +} while(0);
>> +
>> +static int
>> +mmmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, unsigned long addr,
>> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
>> +{
>> +	pte_t *pte, *first_pte;
>> +	u64 pfn;
>> +
>> +	pfn = phys_addr >> PAGE_SHIFT;
>> +	if (unlikely(pmd_none(*pmd)))
>> +		pgtable_populate(domain, pmd);
>> +
>> +	first_pte = pte = pte_offset_kernel(pmd, addr);
>> +
>> +	do {
>> +		set_pte(pte, pfn_pte(pfn, prot));
>> +		pfn++;
>> +	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +
>> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +mmmap_pmd_range(struct dmar_domain *domain, pud_t *pud, unsigned long addr,
>> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
>> +{
>> +	unsigned long next;
>> +	pmd_t *pmd;
>> +
>> +	if (unlikely(pud_none(*pud)))
>> +		pgtable_populate(domain, pud);
>> +	pmd = pmd_offset(pud, addr);
>> +
>> +	phys_addr -= addr;
>> +	do {
>> +		next = pmd_addr_end(addr, end);
>> +		if (mmmap_pte_range(domain, pmd, addr, next,
>> +				    phys_addr + addr, prot))
>> +			return -ENOMEM;
> 
> How about return the errcode of mmmap_pte_range() directly?

Yes. Fair enough.

> 
>> +	} while (pmd++, addr = next, addr != end);
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +mmmap_pud_range(struct dmar_domain *domain, p4d_t *p4d, unsigned long addr,
>> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
>> +{
>> +	unsigned long next;
>> +	pud_t *pud;
>> +
>> +	if (unlikely(p4d_none(*p4d)))
>> +		pgtable_populate(domain, p4d);
>> +
>> +	pud = pud_offset(p4d, addr);
>> +
>> +	phys_addr -= addr;
>> +	do {
>> +		next = pud_addr_end(addr, end);
>> +		if (mmmap_pmd_range(domain, pud, addr, next,
>> +				    phys_addr + addr, prot))
>> +			return -ENOMEM;
> 
> Same.
> 
>> +	} while (pud++, addr = next, addr != end);
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +mmmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd, unsigned long addr,
>> +		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
>> +{
>> +	unsigned long next;
>> +	p4d_t *p4d;
>> +
>> +	if (cpu_feature_enabled(X86_FEATURE_LA57) && unlikely(pgd_none(*pgd)))
>> +		pgtable_populate(domain, pgd);
>> +
>> +	p4d = p4d_offset(pgd, addr);
>> +
>> +	phys_addr -= addr;
>> +	do {
>> +		next = p4d_addr_end(addr, end);
>> +		if (mmmap_pud_range(domain, p4d, addr, next,
>> +				    phys_addr + addr, prot))
>> +			return -ENOMEM;
> 
> Same.
> 
>> +	} while (p4d++, addr = next, addr != end);
>> +
>> +	return 0;
>> +}
>> +
>> +int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
>> +		      unsigned long end, phys_addr_t phys_addr, int dma_prot)
>> +{
>> +	unsigned long next;
>> +	pgprot_t prot;
>> +	pgd_t *pgd;
>> +
>> +	trace_domain_mm_map(domain, addr, end, phys_addr);
>> +
>> +	/*
>> +	 * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW
>> +	 * for a pte that requires write operation.
>> +	 */
>> +	prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO;
>> +	BUG_ON(addr >= end);
>> +
>> +	phys_addr -= addr;
>> +	pgd = pgd_offset_pgd(domain->pgd, addr);
>> +	do {
>> +		next = pgd_addr_end(addr, end);
>> +		if (mmmap_p4d_range(domain, pgd, addr, next,
>> +				    phys_addr + addr, prot))
>> +			return -ENOMEM;
> 
> Same.
> 
>> +	} while (pgd++, addr = next, addr != end);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * mmunmap: Unmap an existing mapping between a range of IO vitual address
>> + *          and physical addresses.
>> + */
>> +static struct page *
>> +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
>> +		  unsigned long addr, unsigned long end,
>> +		  struct page *freelist, bool reclaim)
>> +{
>> +	int i;
>> +	unsigned long start;
>> +	pte_t *pte, *first_pte;
>> +
>> +	start = addr;
>> +	pte = pte_offset_kernel(pmd, addr);
>> +	first_pte = pte;
>> +	do {
>> +		set_pte(pte, __pte(0));
>> +	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +
>> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
>> +
>> +	/* Add page to free list if all entries are empty. */
>> +	if (reclaim) {
> 
> Shouldn't we know whether to reclaim if with (addr, end) specified as
> long as they cover the whole range of this PMD?

Current policy is that we don't reclaim any pages until the whole page
table will be torn down. The gain is that we don't have to use a
spinlock when map/unmap a pmd entry anymore.

> 
> Also I noticed that this value right now is passed in from the very
> top of the unmap() call.  I didn't really catch the point of that...
> 

The caller decides whether to reclaim the pages. Current policy is only
reclaiming pages when torn down a page table.

> I'll have similar question to below a few places but I'll skip to
> comment after I understand this one.
> 
>> +		struct page *pte_page;
>> +
>> +		pte = (pte_t *)pmd_page_vaddr(*pmd);
>> +		for (i = 0; i < PTRS_PER_PTE; i++)
>> +			if (!pte || !pte_none(pte[i]))
>> +				goto pte_out;
>> +
>> +		pte_page = pmd_page(*pmd);
>> +		pte_page->freelist = freelist;
>> +		freelist = pte_page;
>> +		pmd_clear(pmd);
>> +		domain_flush_cache(domain, pmd, sizeof(pmd_t));
>> +	}
>> +
>> +pte_out:
>> +	return freelist;
>> +}
> 
> Regards,
> 

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-26  2:35     ` Lu Baolu
@ 2019-09-26  3:49       ` Peter Xu
  2019-09-27  2:27         ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-26  3:49 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

On Thu, Sep 26, 2019 at 10:35:24AM +0800, Lu Baolu wrote:

[...]

> > > @@ -0,0 +1,342 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/**
> > > + * intel-pgtable.c - Intel IOMMU page table manipulation library
> > 
> > Could this be a bit misleading?  Normally I'll use "IOMMU page table"
> > to refer to the 2nd level page table only, and I'm always
> > understanding it as "the new IOMMU will understand MMU page table as
> > the 1st level".  At least mention "IOMMU 1st level page table"?
> > 
> 
> This file is a place holder for all code that manipulating iommu page
> tables (both first level and second level). Instead of putting
> everything in intel_iommu.c, let's make the code more structured so that
> it's easier for reading and maintaining. This is the motivation of this
> file.

I see.

> 
> > > + *
> > > + * Copyright (C) 2019 Intel Corporation
> > > + *
> > > + * Author: Lu Baolu <baolu.lu@linux.intel.com>
> > > + */
> > > +
> > > +#define pr_fmt(fmt)     "DMAR: " fmt
> > > +#include <linux/vmalloc.h>
> > > +#include <linux/mm.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/io.h>
> > > +#include <linux/export.h>
> > > +#include <linux/intel-iommu.h>
> > > +#include <asm/cacheflush.h>
> > > +#include <asm/pgtable.h>
> > > +#include <asm/pgalloc.h>
> > > +#include <trace/events/intel_iommu.h>
> > > +
> > > +#ifdef CONFIG_X86
> > > +/*
> > > + * mmmap: Map a range of IO virtual address to physical addresses.
> > 
> > "... to physical addresses using MMU page table"?
> > 
> > Might be clearer?
> 
> Yes.
> 
> > 
> > > + */
> > > +#define pgtable_populate(domain, nm)					\
> > > +do {									\
> > > +	void *__new = alloc_pgtable_page(domain->nid);			\
> > > +	if (!__new)							\
> > > +		return -ENOMEM;						\
> > > +	smp_wmb();							\
> > 
> > Could I ask what's this wmb used for?
> 
> Sure. This is answered by a comment in __pte_alloc() in mm/memory.c. Let
> me post it here.
> 
>         /*
>          * Ensure all pte setup (eg. pte page lock and page clearing) are
>          * visible before the pte is made visible to other CPUs by being
>          * put into page tables.
>          *
>          * The other side of the story is the pointer chasing in the page
>          * table walking code (when walking the page table without locking;
>          * ie. most of the time). Fortunately, these data accesses consist
>          * of a chain of data-dependent loads, meaning most CPUs (alpha
>          * being the notable exception) will already guarantee loads are
>          * seen in-order. See the alpha page table accessors for the
>          * smp_read_barrier_depends() barriers in page table walking code.
>          */
>         smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

Ok.  I don't understand the rationale much behind but the comment
seems to make sense...  Could you help to comment above, like "please
reference to comment in __pte_alloc" above the line?

> 
> > 
> > > +	spin_lock(&(domain)->page_table_lock);				\
> > 
> > Is this intended to lock here instead of taking the lock during the
> > whole page table walk?  Is it safe?
> > 
> > Taking the example where nm==PTE: when we reach here how do we
> > guarantee that the PMD page that has this PTE is still valid?
> 
> We will always keep the non-leaf pages in the table,

I see.  Though, could I ask why?  It seems to me that the existing 2nd
level page table does not keep these when unmap, and it's not even use
locking at all by leveraging cmpxchg()?

> hence we only need
> a spin lock to serialize multiple tries of populating a entry for pde.
> As for pte, we can assume there is only single thread which can access
> it at a time because different mappings will have different iova's.

Ah yes sorry nm will never be pte here... so do you mean the upper
layer, e.g., the iova allocator will make sure the ranges to be mapped
will never collapse with each other so setting PTEs do not need lock?

> 
> > 
> > > +	if (nm ## _present(*nm)) {					\
> > > +		free_pgtable_page(__new);				\
> > > +	} else {							\
> > > +		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\
> > 
> > It seems to me that PV could trap calls to set_pte().  Then these
> > could also be trapped by e.g. Xen?  Are these traps needed?  Is there
> > side effect?  I'm totally not familiar with this, but just ask aloud...
> 
> Good catch. But I don't think a vIOMMU could get a chance to run in a PV
> environment. I might miss something?

I don't know... Is there reason to not allow a Xen guest to use 1st
level mapping?

While on the other side... If the PV interface will never be used,
then could native_set_##nm() be used directly?

[...]

> > > +static struct page *
> > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> > > +		  unsigned long addr, unsigned long end,
> > > +		  struct page *freelist, bool reclaim)
> > > +{
> > > +	int i;
> > > +	unsigned long start;
> > > +	pte_t *pte, *first_pte;
> > > +
> > > +	start = addr;
> > > +	pte = pte_offset_kernel(pmd, addr);
> > > +	first_pte = pte;
> > > +	do {
> > > +		set_pte(pte, __pte(0));
> > > +	} while (pte++, addr += PAGE_SIZE, addr != end);
> > > +
> > > +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> > > +
> > > +	/* Add page to free list if all entries are empty. */
> > > +	if (reclaim) {
> > 
> > Shouldn't we know whether to reclaim if with (addr, end) specified as
> > long as they cover the whole range of this PMD?
> 
> Current policy is that we don't reclaim any pages until the whole page
> table will be torn down.

Ah OK.  But I saw that you're passing in relaim==!start_addr.
Shouldn't that errornously trigger if one wants to unmap the 1st page
as well even if not the whole address space?

> The gain is that we don't have to use a
> spinlock when map/unmap a pmd entry anymore.

So this question should also related to above on the locking - have
you thought about using the same way (IIUC) as the 2nd level page
table to use cmpxchg()?  AFAIU that does not need any lock?

For me it's perfectly fine to use a lock at least for initial version,
I just want to know the considerations behind in case I missed
anything important.

Thanks,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-26  3:49       ` Peter Xu
@ 2019-09-27  2:27         ` Lu Baolu
  2019-09-27  5:34           ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-27  2:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

Hi Peter,

On 9/26/19 11:49 AM, Peter Xu wrote:
> On Thu, Sep 26, 2019 at 10:35:24AM +0800, Lu Baolu wrote:
> 
> [...]
> 
>>>> @@ -0,0 +1,342 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/**
>>>> + * intel-pgtable.c - Intel IOMMU page table manipulation library
>>>
>>> Could this be a bit misleading?  Normally I'll use "IOMMU page table"
>>> to refer to the 2nd level page table only, and I'm always
>>> understanding it as "the new IOMMU will understand MMU page table as
>>> the 1st level".  At least mention "IOMMU 1st level page table"?
>>>
>>
>> This file is a place holder for all code that manipulating iommu page
>> tables (both first level and second level). Instead of putting
>> everything in intel_iommu.c, let's make the code more structured so that
>> it's easier for reading and maintaining. This is the motivation of this
>> file.
> 
> I see.
> 
>>
>>>> + *
>>>> + * Copyright (C) 2019 Intel Corporation
>>>> + *
>>>> + * Author: Lu Baolu <baolu.lu@linux.intel.com>
>>>> + */
>>>> +
>>>> +#define pr_fmt(fmt)     "DMAR: " fmt
>>>> +#include <linux/vmalloc.h>
>>>> +#include <linux/mm.h>
>>>> +#include <linux/sched.h>
>>>> +#include <linux/io.h>
>>>> +#include <linux/export.h>
>>>> +#include <linux/intel-iommu.h>
>>>> +#include <asm/cacheflush.h>
>>>> +#include <asm/pgtable.h>
>>>> +#include <asm/pgalloc.h>
>>>> +#include <trace/events/intel_iommu.h>
>>>> +
>>>> +#ifdef CONFIG_X86
>>>> +/*
>>>> + * mmmap: Map a range of IO virtual address to physical addresses.
>>>
>>> "... to physical addresses using MMU page table"?
>>>
>>> Might be clearer?
>>
>> Yes.
>>
>>>
>>>> + */
>>>> +#define pgtable_populate(domain, nm)					\
>>>> +do {									\
>>>> +	void *__new = alloc_pgtable_page(domain->nid);			\
>>>> +	if (!__new)							\
>>>> +		return -ENOMEM;						\
>>>> +	smp_wmb();							\
>>>
>>> Could I ask what's this wmb used for?
>>
>> Sure. This is answered by a comment in __pte_alloc() in mm/memory.c. Let
>> me post it here.
>>
>>          /*
>>           * Ensure all pte setup (eg. pte page lock and page clearing) are
>>           * visible before the pte is made visible to other CPUs by being
>>           * put into page tables.
>>           *
>>           * The other side of the story is the pointer chasing in the page
>>           * table walking code (when walking the page table without locking;
>>           * ie. most of the time). Fortunately, these data accesses consist
>>           * of a chain of data-dependent loads, meaning most CPUs (alpha
>>           * being the notable exception) will already guarantee loads are
>>           * seen in-order. See the alpha page table accessors for the
>>           * smp_read_barrier_depends() barriers in page table walking code.
>>           */
>>          smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> 
> Ok.  I don't understand the rationale much behind but the comment
> seems to make sense...  Could you help to comment above, like "please
> reference to comment in __pte_alloc" above the line?

Yes.

> 
>>
>>>
>>>> +	spin_lock(&(domain)->page_table_lock);				\
>>>
>>> Is this intended to lock here instead of taking the lock during the
>>> whole page table walk?  Is it safe?
>>>
>>> Taking the example where nm==PTE: when we reach here how do we
>>> guarantee that the PMD page that has this PTE is still valid?
>>
>> We will always keep the non-leaf pages in the table,
> 
> I see.  Though, could I ask why?  It seems to me that the existing 2nd
> level page table does not keep these when unmap, and it's not even use
> locking at all by leveraging cmpxchg()?

I still need some time to understand how cmpxchg() solves the race issue
when reclaims pages. For example.

Thread A				Thread B
-A1: check all PTE's empty		-B1: up-level PDE valid
-A2: clear the up-level PDE
-A3: reclaim the page			-B2: populate the PTEs

Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.

Actually, the iova allocator always packs IOVA ranges close to the top
of the address space. This results in requiring a minimal number of
pages to map the allocated IOVA ranges, which makes memory onsumption
by IOMMU page tables tolerable. Hence, we don't need to reclaim the
pages until the whole page table is about to tear down. The real data
on my test machine also improves this.

> 
>> hence we only need
>> a spin lock to serialize multiple tries of populating a entry for pde.
>> As for pte, we can assume there is only single thread which can access
>> it at a time because different mappings will have different iova's.
> 
> Ah yes sorry nm will never be pte here... so do you mean the upper
> layer, e.g., the iova allocator will make sure the ranges to be mapped
> will never collapse with each other so setting PTEs do not need lock?

Yes.

> 
>>
>>>
>>>> +	if (nm ## _present(*nm)) {					\
>>>> +		free_pgtable_page(__new);				\
>>>> +	} else {							\
>>>> +		set_##nm(nm, __##nm(__pa(__new) | _PAGE_TABLE));	\
>>>
>>> It seems to me that PV could trap calls to set_pte().  Then these
>>> could also be trapped by e.g. Xen?  Are these traps needed?  Is there
>>> side effect?  I'm totally not familiar with this, but just ask aloud...
>>
>> Good catch. But I don't think a vIOMMU could get a chance to run in a PV
>> environment. I might miss something?
> 
> I don't know... Is there reason to not allow a Xen guest to use 1st
> level mapping?

I was thinking that a PV driver should be used in the PV environment. So
the vIOMMU driver (which is for full virtualization) would never get a
chance to run in PV environment.

> 
> While on the other side... If the PV interface will never be used,
> then could native_set_##nm() be used directly?

But, anyway, as you pointed out, native_set_##nm() looks better.

> 
> [...]
> 
>>>> +static struct page *
>>>> +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
>>>> +		  unsigned long addr, unsigned long end,
>>>> +		  struct page *freelist, bool reclaim)
>>>> +{
>>>> +	int i;
>>>> +	unsigned long start;
>>>> +	pte_t *pte, *first_pte;
>>>> +
>>>> +	start = addr;
>>>> +	pte = pte_offset_kernel(pmd, addr);
>>>> +	first_pte = pte;
>>>> +	do {
>>>> +		set_pte(pte, __pte(0));
>>>> +	} while (pte++, addr += PAGE_SIZE, addr != end);
>>>> +
>>>> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
>>>> +
>>>> +	/* Add page to free list if all entries are empty. */
>>>> +	if (reclaim) {
>>>
>>> Shouldn't we know whether to reclaim if with (addr, end) specified as
>>> long as they cover the whole range of this PMD?
>>
>> Current policy is that we don't reclaim any pages until the whole page
>> table will be torn down.
> 
> Ah OK.  But I saw that you're passing in relaim==!start_addr.
> Shouldn't that errornously trigger if one wants to unmap the 1st page
> as well even if not the whole address space?

IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no
means to check whether a IOVA is valid.

> 
>> The gain is that we don't have to use a
>> spinlock when map/unmap a pmd entry anymore.
> 
> So this question should also related to above on the locking - have
> you thought about using the same way (IIUC) as the 2nd level page
> table to use cmpxchg()?  AFAIU that does not need any lock?
> 
> For me it's perfectly fine to use a lock at least for initial version,
> I just want to know the considerations behind in case I missed
> anything important.

I agree that we can use cmpxchg() to replace spinlock when populating
a page directory entry.

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-27  2:27         ` Lu Baolu
@ 2019-09-27  5:34           ` Peter Xu
  2019-09-28  8:23             ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-27  5:34 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

Hi, Baolu,

On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
> > > > > +	spin_lock(&(domain)->page_table_lock);				\
> > > > 
> > > > Is this intended to lock here instead of taking the lock during the
> > > > whole page table walk?  Is it safe?
> > > > 
> > > > Taking the example where nm==PTE: when we reach here how do we
> > > > guarantee that the PMD page that has this PTE is still valid?
> > > 
> > > We will always keep the non-leaf pages in the table,
> > 
> > I see.  Though, could I ask why?  It seems to me that the existing 2nd
> > level page table does not keep these when unmap, and it's not even use
> > locking at all by leveraging cmpxchg()?
> 
> I still need some time to understand how cmpxchg() solves the race issue
> when reclaims pages. For example.
> 
> Thread A				Thread B
> -A1: check all PTE's empty		-B1: up-level PDE valid
> -A2: clear the up-level PDE
> -A3: reclaim the page			-B2: populate the PTEs
> 
> Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.

I'm not sure of this, but IMHO it is similarly because we need to
allocate the iova ranges from iova allocator first, so thread A (who's
going to unmap pages) and thread B (who's going to map new pages)
should never have collapsed regions if happening concurrently.  I'm
referring to intel_unmap() in which we won't free the iova region
before domain_unmap() completes (which should cover the whole process
of A1-A3) so the same iova range to be unmapped won't be allocated to
any new pages in some other thread.

There's also a hint in domain_unmap():

  /* we don't need lock here; nobody else touches the iova range */

> 
> Actually, the iova allocator always packs IOVA ranges close to the top
> of the address space. This results in requiring a minimal number of
> pages to map the allocated IOVA ranges, which makes memory onsumption
> by IOMMU page tables tolerable. Hence, we don't need to reclaim the
> pages until the whole page table is about to tear down. The real data
> on my test machine also improves this.

Do you mean you have run the code with a 1st-level-supported IOMMU
hardware?  IMHO any data point would be good to be in the cover letter
as reference.

[...]

> > > > > +static struct page *
> > > > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> > > > > +		  unsigned long addr, unsigned long end,
> > > > > +		  struct page *freelist, bool reclaim)
> > > > > +{
> > > > > +	int i;
> > > > > +	unsigned long start;
> > > > > +	pte_t *pte, *first_pte;
> > > > > +
> > > > > +	start = addr;
> > > > > +	pte = pte_offset_kernel(pmd, addr);
> > > > > +	first_pte = pte;
> > > > > +	do {
> > > > > +		set_pte(pte, __pte(0));
> > > > > +	} while (pte++, addr += PAGE_SIZE, addr != end);
> > > > > +
> > > > > +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> > > > > +
> > > > > +	/* Add page to free list if all entries are empty. */
> > > > > +	if (reclaim) {
> > > > 
> > > > Shouldn't we know whether to reclaim if with (addr, end) specified as
> > > > long as they cover the whole range of this PMD?
> > > 
> > > Current policy is that we don't reclaim any pages until the whole page
> > > table will be torn down.
> > 
> > Ah OK.  But I saw that you're passing in relaim==!start_addr.
> > Shouldn't that errornously trigger if one wants to unmap the 1st page
> > as well even if not the whole address space?
> 
> IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no
> means to check whether a IOVA is valid.

Is this an assumption of the allocator?  Could that change in the future?

IMHO that's not necessary if so, after all it's as simple as replacing
(!start_addr) with (start == 0 && end == END).  I see that in
domain_unmap() it has a similar check when freeing pgd:

  if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw))

Thanks,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-27  5:34           ` Peter Xu
@ 2019-09-28  8:23             ` Lu Baolu
  2019-09-29  5:25               ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Lu Baolu @ 2019-09-28  8:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

Hi Peter,

On 9/27/19 1:34 PM, Peter Xu wrote:
> Hi, Baolu,
> 
> On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
>>>>>> +	spin_lock(&(domain)->page_table_lock);				\
>>>>>
>>>>> Is this intended to lock here instead of taking the lock during the
>>>>> whole page table walk?  Is it safe?
>>>>>
>>>>> Taking the example where nm==PTE: when we reach here how do we
>>>>> guarantee that the PMD page that has this PTE is still valid?
>>>>
>>>> We will always keep the non-leaf pages in the table,
>>>
>>> I see.  Though, could I ask why?  It seems to me that the existing 2nd
>>> level page table does not keep these when unmap, and it's not even use
>>> locking at all by leveraging cmpxchg()?
>>
>> I still need some time to understand how cmpxchg() solves the race issue
>> when reclaims pages. For example.
>>
>> Thread A				Thread B
>> -A1: check all PTE's empty		-B1: up-level PDE valid
>> -A2: clear the up-level PDE
>> -A3: reclaim the page			-B2: populate the PTEs
>>
>> Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.
> 
> I'm not sure of this, but IMHO it is similarly because we need to
> allocate the iova ranges from iova allocator first, so thread A (who's
> going to unmap pages) and thread B (who's going to map new pages)
> should never have collapsed regions if happening concurrently.  I'm

Although they don't collapse, they might share a same pmd entry. If A
cleared the pmd entry and B goes ahead with populating the pte's. It
will crash.

> referring to intel_unmap() in which we won't free the iova region
> before domain_unmap() completes (which should cover the whole process
> of A1-A3) so the same iova range to be unmapped won't be allocated to
> any new pages in some other thread.
> 
> There's also a hint in domain_unmap():
> 
>    /* we don't need lock here; nobody else touches the iova range */
> 
>>
>> Actually, the iova allocator always packs IOVA ranges close to the top
>> of the address space. This results in requiring a minimal number of
>> pages to map the allocated IOVA ranges, which makes memory onsumption
>> by IOMMU page tables tolerable. Hence, we don't need to reclaim the
>> pages until the whole page table is about to tear down. The real data
>> on my test machine also improves this.
> 
> Do you mean you have run the code with a 1st-level-supported IOMMU
> hardware?  IMHO any data point would be good to be in the cover letter
> as reference.

Yes. Sure! Let me do this since the next version.

> 
> [...]
> 
>>>>>> +static struct page *
>>>>>> +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
>>>>>> +		  unsigned long addr, unsigned long end,
>>>>>> +		  struct page *freelist, bool reclaim)
>>>>>> +{
>>>>>> +	int i;
>>>>>> +	unsigned long start;
>>>>>> +	pte_t *pte, *first_pte;
>>>>>> +
>>>>>> +	start = addr;
>>>>>> +	pte = pte_offset_kernel(pmd, addr);
>>>>>> +	first_pte = pte;
>>>>>> +	do {
>>>>>> +		set_pte(pte, __pte(0));
>>>>>> +	} while (pte++, addr += PAGE_SIZE, addr != end);
>>>>>> +
>>>>>> +	domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
>>>>>> +
>>>>>> +	/* Add page to free list if all entries are empty. */
>>>>>> +	if (reclaim) {
>>>>>
>>>>> Shouldn't we know whether to reclaim if with (addr, end) specified as
>>>>> long as they cover the whole range of this PMD?
>>>>
>>>> Current policy is that we don't reclaim any pages until the whole page
>>>> table will be torn down.
>>>
>>> Ah OK.  But I saw that you're passing in relaim==!start_addr.
>>> Shouldn't that errornously trigger if one wants to unmap the 1st page
>>> as well even if not the whole address space?
>>
>> IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no
>> means to check whether a IOVA is valid.
> 
> Is this an assumption of the allocator?  Could that change in the future?

Yes. And I think it should keep unless no consumer depends on this
optimization.

> 
> IMHO that's not necessary if so, after all it's as simple as replacing
> (!start_addr) with (start == 0 && end == END).  I see that in
> domain_unmap() it has a similar check when freeing pgd:
> 
>    if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw))
> 

Yours looks better. Thank you!

> Thanks,
> 

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-28  8:23             ` Lu Baolu
@ 2019-09-29  5:25               ` Peter Xu
  2019-10-08  2:20                 ` Lu Baolu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2019-09-29  5:25 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

On Sat, Sep 28, 2019 at 04:23:16PM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> On 9/27/19 1:34 PM, Peter Xu wrote:
> > Hi, Baolu,
> > 
> > On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
> > > > > > > +	spin_lock(&(domain)->page_table_lock);				\
> > > > > > 
> > > > > > Is this intended to lock here instead of taking the lock during the
> > > > > > whole page table walk?  Is it safe?
> > > > > > 
> > > > > > Taking the example where nm==PTE: when we reach here how do we
> > > > > > guarantee that the PMD page that has this PTE is still valid?
> > > > > 
> > > > > We will always keep the non-leaf pages in the table,
> > > > 
> > > > I see.  Though, could I ask why?  It seems to me that the existing 2nd
> > > > level page table does not keep these when unmap, and it's not even use
> > > > locking at all by leveraging cmpxchg()?
> > > 
> > > I still need some time to understand how cmpxchg() solves the race issue
> > > when reclaims pages. For example.
> > > 
> > > Thread A				Thread B
> > > -A1: check all PTE's empty		-B1: up-level PDE valid
> > > -A2: clear the up-level PDE
> > > -A3: reclaim the page			-B2: populate the PTEs
> > > 
> > > Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.
> > 
> > I'm not sure of this, but IMHO it is similarly because we need to
> > allocate the iova ranges from iova allocator first, so thread A (who's
> > going to unmap pages) and thread B (who's going to map new pages)
> > should never have collapsed regions if happening concurrently.  I'm
> 
> Although they don't collapse, they might share a same pmd entry. If A
> cleared the pmd entry and B goes ahead with populating the pte's. It
> will crash.

My understanding is that if A was not owning all the pages on that PMD
entry then it will never free the page that was backing that PMD
entry.  Please refer to the code in dma_pte_clear_level() where it
has:

        /* If range covers entire pagetable, free it */
        if (start_pfn <= level_pfn &&
                last_pfn >= level_pfn + level_size(level) - 1) {
                ...
        } else {
                ...
        }

Note that when going into the else block, the PMD won't be freed but
only the PTEs that upon the PMD will be cleared.

In the case you mentioned above, IMHO it should go into that else
block.  Say, thread A must not contain the whole range of that PMD
otherwise thread B won't get allocated with pages within that range
covered by the same PMD.

Thanks,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
  2019-09-29  5:25               ` Peter Xu
@ 2019-10-08  2:20                 ` Lu Baolu
  0 siblings, 0 replies; 38+ messages in thread
From: Lu Baolu @ 2019-10-08  2:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: kevin.tian, Yi Sun, ashok.raj, kvm, sanjay.k.kumar, iommu,
	linux-kernel, Alex Williamson, David Woodhouse, yi.y.sun

Hi,

On 9/29/19 1:25 PM, Peter Xu wrote:
> On Sat, Sep 28, 2019 at 04:23:16PM +0800, Lu Baolu wrote:
>> Hi Peter,
>>
>> On 9/27/19 1:34 PM, Peter Xu wrote:
>>> Hi, Baolu,
>>>
>>> On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
>>>>>>>> +	spin_lock(&(domain)->page_table_lock);				\
>>>>>>>
>>>>>>> Is this intended to lock here instead of taking the lock during the
>>>>>>> whole page table walk?  Is it safe?
>>>>>>>
>>>>>>> Taking the example where nm==PTE: when we reach here how do we
>>>>>>> guarantee that the PMD page that has this PTE is still valid?
>>>>>>
>>>>>> We will always keep the non-leaf pages in the table,
>>>>>
>>>>> I see.  Though, could I ask why?  It seems to me that the existing 2nd
>>>>> level page table does not keep these when unmap, and it's not even use
>>>>> locking at all by leveraging cmpxchg()?
>>>>
>>>> I still need some time to understand how cmpxchg() solves the race issue
>>>> when reclaims pages. For example.
>>>>
>>>> Thread A				Thread B
>>>> -A1: check all PTE's empty		-B1: up-level PDE valid
>>>> -A2: clear the up-level PDE
>>>> -A3: reclaim the page			-B2: populate the PTEs
>>>>
>>>> Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.
>>>
>>> I'm not sure of this, but IMHO it is similarly because we need to
>>> allocate the iova ranges from iova allocator first, so thread A (who's
>>> going to unmap pages) and thread B (who's going to map new pages)
>>> should never have collapsed regions if happening concurrently.  I'm
>>
>> Although they don't collapse, they might share a same pmd entry. If A
>> cleared the pmd entry and B goes ahead with populating the pte's. It
>> will crash.
> 
> My understanding is that if A was not owning all the pages on that PMD
> entry then it will never free the page that was backing that PMD
> entry.  Please refer to the code in dma_pte_clear_level() where it
> has:
> 
>          /* If range covers entire pagetable, free it */
>          if (start_pfn <= level_pfn &&
>                  last_pfn >= level_pfn + level_size(level) - 1) {
>                  ...
>          } else {
>                  ...
>          }
> 
> Note that when going into the else block, the PMD won't be freed but
> only the PTEs that upon the PMD will be cleared.

Exactly! Thanks for pointing this out.

I will do the same thing in v2.

> 
> In the case you mentioned above, IMHO it should go into that else
> block.  Say, thread A must not contain the whole range of that PMD
> otherwise thread B won't get allocated with pages within that range
> covered by the same PMD.
> 
> Thanks,
> 

Best regards,
Baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, back to index

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
2019-09-23 20:31   ` Raj, Ashok
2019-09-24  1:38     ` Lu Baolu
2019-09-25  4:30       ` Peter Xu
2019-09-25  4:38         ` Tian, Kevin
2019-09-25  5:24           ` Peter Xu
2019-09-25  6:52             ` Lu Baolu
2019-09-25  7:32               ` Tian, Kevin
2019-09-25  8:35                 ` Peter Xu
2019-09-26  1:42                 ` Lu Baolu
2019-09-25  5:21   ` Peter Xu
2019-09-26  2:35     ` Lu Baolu
2019-09-26  3:49       ` Peter Xu
2019-09-27  2:27         ` Lu Baolu
2019-09-27  5:34           ` Peter Xu
2019-09-28  8:23             ` Lu Baolu
2019-09-29  5:25               ` Peter Xu
2019-10-08  2:20                 ` Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap Lu Baolu
2019-09-25  5:00   ` Tian, Kevin
2019-09-25  7:06     ` Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table Lu Baolu
2019-09-25  6:50   ` Peter Xu
2019-09-25  7:35     ` Tian, Kevin
2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
2019-09-23 20:25   ` Raj, Ashok
2019-09-24  4:40     ` Lu Baolu
2019-09-24  7:00     ` Tian, Kevin
2019-09-25  2:48       ` Lu Baolu
2019-09-25  6:56         ` Peter Xu
2019-09-25  7:21           ` Tian, Kevin
2019-09-25  7:45             ` Peter Xu
2019-09-25  8:02               ` Tian, Kevin
2019-09-25  8:52                 ` Peter Xu
2019-09-26  1:37                   ` Lu Baolu
2019-09-24  4:27   ` Lu Baolu

IOMMU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-iommu/0 linux-iommu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-iommu linux-iommu/ https://lore.kernel.org/linux-iommu \
		iommu@lists.linux-foundation.org iommu@archiver.kernel.org
	public-inbox-index linux-iommu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linux-foundation.lists.iommu


AGPL code for this site: git clone https://public-inbox.org/ public-inbox