All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] iommu/intel: Free empty page tables on unmaps
@ 2023-12-21  3:19 Pasha Tatashin
  2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  3:19 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

This series frees empty page tables on unmaps. It intends to be a
low overhead feature.

The read-writer lock is used to synchronize page table, but most of
time the lock is held is reader. It is held as a writer for short
period of time when unmapping a page that is bigger than the current
iova request. For all other cases this lock is read-only.

page->refcount is used in order to track number of entries at each page
table.

Microbenchmark data using iova_stress[1]:

Base:
yqbtg12:/home# ./iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~  32783M time:   22.297s

/iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~      0M time:   23.388s

The test maps/unmaps 4K pages and cycles through the IOVA space.
Base uses 32G of memory, and test completes in 22.3S.
Fix uses 0G of memory, and test completes in 23.4s.

I believe the proposed fix is a good compromize in terms of complexity/
scalability. A more scalable solution would be to spread read/writer
lock per-page table, and user page->private field to store the lock
itself.

However, since iommu already has some protection: i.e. no-one touches
the iova space of the request map/unmap we can avoid the extra complexity
and rely on a single per page table RW lock, and be in a reader mode
most of the time.

[1] https://github.com/soleen/iova_stress

Pasha Tatashin (3):
  iommu/intel: Use page->refcount to count number of entries in IOMMU
  iommu/intel: synchronize page table map and unmap operations
  iommu/intel: free empty page tables on unmaps

 drivers/iommu/intel/iommu.c | 153 ++++++++++++++++++++++++++++--------
 drivers/iommu/intel/iommu.h |  44 +++++++++--
 2 files changed, 158 insertions(+), 39 deletions(-)

-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU
  2023-12-21  3:19 [RFC 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
@ 2023-12-21  3:19 ` Pasha Tatashin
  2023-12-25  2:59   ` Liu, Jingqi
  2023-12-25 11:03   ` kernel test robot
  2023-12-21  3:19 ` [RFC 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  3:19 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

In order to be able to efficiently free empty page table levels, count the
number of entries in each page table my incremeanting and decremeanting
refcount every time a PTE is inserted or removed form the page table.

For this to work correctly, add two helper function:
dma_clear_pte and dma_set_pte where counting is performed,

Also, modify the code so every page table entry is always updated using the
two new functions.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 40 +++++++++++++++++++++---------------
 drivers/iommu/intel/iommu.h | 41 +++++++++++++++++++++++++++++++------
 2 files changed, 58 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47d..4688ef797161 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -949,7 +949,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
 			if (domain->use_first_level)
 				pteval |= DMA_FL_PTE_XD | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;
 
-			if (cmpxchg64(&pte->val, 0ULL, pteval))
+			if (dma_set_pte(pte, pteval))
 				/* Someone else set it while we were thinking; use theirs. */
 				free_pgtable_page(tmp_page);
 			else
@@ -1021,7 +1021,8 @@ static void dma_pte_clear_range(struct dmar_domain *domain,
 			continue;
 		}
 		do {
-			dma_clear_pte(pte);
+			if (dma_pte_present(pte))
+				dma_clear_pte(pte);
 			start_pfn += lvl_to_nr_pages(large_page);
 			pte++;
 		} while (start_pfn <= last_pfn && !first_pte_in_page(pte));
@@ -1062,7 +1063,8 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level,
 		 */
 		if (level < retain_level && !(start_pfn > level_pfn ||
 		      last_pfn < level_pfn + level_size(level) - 1)) {
-			dma_clear_pte(pte);
+			if (dma_pte_present(pte))
+				dma_clear_pte(pte);
 			domain_flush_cache(domain, pte, sizeof(*pte));
 			free_pgtable_page(level_pte);
 		}
@@ -1093,12 +1095,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 	}
 }
 
-/* When a page at a given level is being unlinked from its parent, we don't
-   need to *modify* it at all. All we need to do is make a list of all the
-   pages which can be freed just as soon as we've flushed the IOTLB and we
-   know the hardware page-walk will no longer touch them.
-   The 'pte' argument is the *parent* PTE, pointing to the page that is to
-   be freed. */
+/*
+ * A given page at a given level is being unlinked from its parent.
+ * We need to make a list of all the pages which can be freed just as soon as
+ * we've flushed the IOTLB and we know the hardware page-walk will no longer
+ * touch them. The 'pte' argument is the *parent* PTE, pointing to the page
+ * that is to be freed.
+ */
 static void dma_pte_list_pagetables(struct dmar_domain *domain,
 				    int level, struct dma_pte *pte,
 				    struct list_head *freelist)
@@ -1106,17 +1109,20 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 	struct page *pg;
 
 	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
-	list_add_tail(&pg->lru, freelist);
-
-	if (level == 1)
-		return;
-
 	pte = page_address(pg);
+
 	do {
-		if (dma_pte_present(pte) && !dma_pte_superpage(pte))
-			dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+		if (dma_pte_present(pte)) {
+			if (level > 1 && !dma_pte_superpage(pte)) {
+				dma_pte_list_pagetables(domain, level - 1, pte,
+							freelist);
+			}
+			dma_clear_pte(pte);
+		}
 		pte++;
 	} while (!first_pte_in_page(pte));
+
+	list_add_tail(&pg->lru, freelist);
 }
 
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
@@ -2244,7 +2250,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 		/* We don't need lock here, nobody else
 		 * touches the iova range
 		 */
-		tmp = cmpxchg64_local(&pte->val, 0ULL, pteval);
+		tmp = dma_set_pte(pte, pteval);
 		if (tmp) {
 			static int dumps = 5;
 			pr_crit("ERROR: DMA PTE for vPFN 0x%lx already set (to %llx not %llx)\n",
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index ce030c5b5772..f1ea508f45bd 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -802,11 +802,6 @@ struct dma_pte {
 	u64 val;
 };
 
-static inline void dma_clear_pte(struct dma_pte *pte)
-{
-	pte->val = 0;
-}
-
 static inline u64 dma_pte_addr(struct dma_pte *pte)
 {
 #ifdef CONFIG_64BIT
@@ -818,9 +813,43 @@ static inline u64 dma_pte_addr(struct dma_pte *pte)
 #endif
 }
 
+#define DMA_PTEVAL_PRESENT(pteval) (((pteval) & 3) != 0)
 static inline bool dma_pte_present(struct dma_pte *pte)
 {
-	return (pte->val & 3) != 0;
+	return DMA_PTEVAL_PRESENT(pte->val);
+}
+
+static inline void dma_clear_pte(struct dma_pte *pte)
+{
+	u64 old_pteval;
+
+	old_pteval = xchg(&pte->val, 0ULL);
+	if (DMA_PTEVAL_PRESENT(old_pteval)) {
+		struct page *pg = virt_to_page(pte);
+		int rc = page_ref_dec_return(pg);
+
+		WARN_ON_ONCE(rc > 512 || rc < 1);
+	} else {
+		/* Ensure that we cleared a valid entry from the page table */
+		WARN_ON(1);
+	}
+}
+
+static inline u64 dma_set_pte(struct dma_pte *pte, u64 pteval)
+{
+	u64 old_pteval;
+
+	/* Ensure we about to set a valid entry to the page table */
+	WARN_ON(!DMA_PTEVAL_PRESENT(pteval));
+	old_pteval = cmpxchg64(&pte->val, 0ULL, pteval);
+	if (old_pteval == 0) {
+		struct page *pg = virt_to_page(pte);
+		int rc = page_ref_inc_return(pg);
+
+		WARN_ON_ONCE(rc > 513 || rc < 2);
+	}
+
+	return old_pteval;
 }
 
 static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 2/3] iommu/intel: synchronize page table map and unmap operations
  2023-12-21  3:19 [RFC 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
  2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
@ 2023-12-21  3:19 ` Pasha Tatashin
  2023-12-21  3:19 ` [RFC 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
  2023-12-21  4:16 ` [RFC 0/3] iommu/intel: Free " Matthew Wilcox
  3 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  3:19 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

Since, we are going to update  parent page table entries when lower
level page tables become emtpy and we add them to the free list.
We need a way to synchronize the operation.

Use domain->pgd_lock to protect all map and unmap operations.
This is reader/writer lock. At the beginning everything is going to be
read only mode, however, later, when free page table on unmap is added
we will add a writer section as well.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 21 +++++++++++++++++++--
 drivers/iommu/intel/iommu.h |  3 +++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4688ef797161..733f25b277a3 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1082,11 +1082,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 				   unsigned long last_pfn,
 				   int retain_level)
 {
+	read_lock(&domain->pgd_lock);
 	dma_pte_clear_range(domain, start_pfn, last_pfn);
 
 	/* We don't need lock here; nobody else touches the iova range */
 	dma_pte_free_level(domain, agaw_to_level(domain->agaw), retain_level,
 			   domain->pgd, 0, start_pfn, last_pfn);
+	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -1179,9 +1181,11 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 	    WARN_ON(start_pfn > last_pfn))
 		return;
 
+	read_lock(&domain->pgd_lock);
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
 			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2217,6 +2221,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 
 	pteval = ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) | attr;
 
+	read_lock(&domain->pgd_lock);
 	while (nr_pages > 0) {
 		uint64_t tmp;
 
@@ -2226,8 +2231,10 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 
 			pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
 					     gfp);
-			if (!pte)
+			if (!pte) {
+				read_unlock(&domain->pgd_lock);
 				return -ENOMEM;
+			}
 			first_pte = pte;
 
 			lvl_pages = lvl_to_nr_pages(largepage_lvl);
@@ -2287,6 +2294,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			pte = NULL;
 		}
 	}
+	read_unlock(&domain->pgd_lock);
 
 	return 0;
 }
@@ -4013,6 +4021,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 	domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
 	if (!domain->pgd)
 		return -ENOMEM;
+	rwlock_init(&domain->pgd_lock);
 	domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
 	return 0;
 }
@@ -4247,11 +4256,15 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	unsigned long start_pfn, last_pfn;
 	int level = 0;
 
+	read_lock(&dmar_domain->pgd_lock);
 	/* Cope with horrid API which requires us to unmap more than the
 	   size argument if it happens to be a large-page mapping. */
 	if (unlikely(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
-				     &level, GFP_ATOMIC)))
+				     &level, GFP_ATOMIC))) {
+		read_unlock(&dmar_domain->pgd_lock);
 		return 0;
+	}
+	read_unlock(&dmar_domain->pgd_lock);
 
 	if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
 		size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4315,8 +4328,10 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 	int level = 0;
 	u64 phys = 0;
 
+	read_lock(&dmar_domain->pgd_lock);
 	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
 			     GFP_ATOMIC);
+	read_unlock(&dmar_domain->pgd_lock);
 	if (pte && dma_pte_present(pte))
 		phys = dma_pte_addr(pte) +
 			(iova & (BIT_MASK(level_to_offset_bits(level) +
@@ -4919,8 +4934,10 @@ static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
 		struct dma_pte *pte;
 		int lvl = 0;
 
+		read_lock(&dmar_domain->pgd_lock);
 		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
 				     GFP_ATOMIC);
+		read_unlock(&dmar_domain->pgd_lock);
 		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
 		if (!pte || !dma_pte_present(pte)) {
 			iova += pgsize;
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index f1ea508f45bd..cb0577ec5166 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -618,6 +618,9 @@ struct dmar_domain {
 		struct {
 			/* virtual address */
 			struct dma_pte	*pgd;
+
+			/* Synchronizes pgd map/unmap operations */
+			rwlock_t	pgd_lock;
 			/* max guest address width */
 			int		gaw;
 			/*
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 3/3] iommu/intel: free empty page tables on unmaps
  2023-12-21  3:19 [RFC 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
  2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
  2023-12-21  3:19 ` [RFC 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
@ 2023-12-21  3:19 ` Pasha Tatashin
  2023-12-21  4:16 ` [RFC 0/3] iommu/intel: Free " Matthew Wilcox
  3 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  3:19 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

When page tables become empty, add them to the freelist so that they
can also be freed.

This is means that a page tables that are outside of the imediat iova
range might be freed as well, therefore, only in the case where such
page tables are going to be freed, we take the writer lock.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 92 +++++++++++++++++++++++++++++++------
 1 file changed, 78 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 733f25b277a3..141dc106fb01 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1130,7 +1130,7 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				struct dma_pte *pte, unsigned long pfn,
 				unsigned long start_pfn, unsigned long last_pfn,
-				struct list_head *freelist)
+				struct list_head *freelist, int *freed_level)
 {
 	struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1156,11 +1156,48 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				first_pte = pte;
 			last_pte = pte;
 		} else if (level > 1) {
+			struct dma_pte *npte = phys_to_virt(dma_pte_addr(pte));
+			struct page *npage = virt_to_page(npte);
+
 			/* Recurse down into a level that isn't *entirely* obsolete */
-			dma_pte_clear_level(domain, level - 1,
-					    phys_to_virt(dma_pte_addr(pte)),
+			dma_pte_clear_level(domain, level - 1, npte,
 					    level_pfn, start_pfn, last_pfn,
-					    freelist);
+					    freelist, freed_level);
+
+			/*
+			 * Free next level page table if it became empty.
+			 *
+			 * We only holding the reader lock, and it is possible
+			 * that other threads are accessing page table as
+			 * readers as well. We can only free page table that
+			 * is outside of the request IOVA space only if
+			 * we grab the writer lock. Since we need to drop reader
+			 * lock, we are incrementing the refcount in the npage
+			 * so it (and the current page table) does not
+			 * dissappear due to concurrent unmapping threads.
+			 *
+			 * Store the size maximum size of the freed page table
+			 * into freed_level, so the size of the IOTLB flush
+			 * can be determined.
+			 */
+			if (freed_level && page_count(npage) == 1) {
+				page_ref_inc(npage);
+				read_unlock(&domain->pgd_lock);
+				write_lock(&domain->pgd_lock);
+				if (page_count(npage) == 2) {
+					dma_clear_pte(pte);
+
+					if (!first_pte)
+						first_pte = pte;
+
+					last_pte = pte;
+					list_add_tail(&npage->lru, freelist);
+					*freed_level = level;
+				}
+				write_unlock(&domain->pgd_lock);
+				read_lock(&domain->pgd_lock);
+				page_ref_dec(npage);
+			}
 		}
 next:
 		pfn = level_pfn + level_size(level);
@@ -1175,7 +1212,8 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
    the page tables, and may have cached the intermediate levels. The
    pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-			 unsigned long last_pfn, struct list_head *freelist)
+			 unsigned long last_pfn, struct list_head *freelist,
+			 int *level)
 {
 	if (WARN_ON(!domain_pfn_supported(domain, last_pfn)) ||
 	    WARN_ON(start_pfn > last_pfn))
@@ -1184,7 +1222,8 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 	read_lock(&domain->pgd_lock);
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+			    domain->pgd, 0, start_pfn, last_pfn, freelist,
+			    level);
 	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
@@ -1524,11 +1563,11 @@ static void domain_flush_pasid_iotlb(struct intel_iommu *iommu,
 
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
 				  struct dmar_domain *domain,
-				  unsigned long pfn, unsigned int pages,
+				  unsigned long pfn, unsigned long pages,
 				  int ih, int map)
 {
-	unsigned int aligned_pages = __roundup_pow_of_two(pages);
-	unsigned int mask = ilog2(aligned_pages);
+	unsigned long aligned_pages = __roundup_pow_of_two(pages);
+	unsigned long mask = ilog2(aligned_pages);
 	uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
 	u16 did = domain_id_iommu(domain, iommu);
 
@@ -1872,7 +1911,8 @@ static void domain_exit(struct dmar_domain *domain)
 	if (domain->pgd) {
 		LIST_HEAD(freelist);
 
-		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+			     NULL);
 		put_pages_list(&freelist);
 	}
 
@@ -3579,7 +3619,8 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
 			struct intel_iommu *iommu;
 			LIST_HEAD(freelist);
 
-			domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);
+			domain_unmap(si_domain, start_vpfn, last_vpfn,
+				     &freelist, NULL);
 
 			rcu_read_lock();
 			for_each_active_iommu(iommu, drhd)
@@ -4253,6 +4294,7 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 				struct iommu_iotlb_gather *gather)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	bool queued = iommu_iotlb_gather_queued(gather);
 	unsigned long start_pfn, last_pfn;
 	int level = 0;
 
@@ -4272,7 +4314,16 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	start_pfn = iova >> VTD_PAGE_SHIFT;
 	last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
 
-	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist);
+	/*
+	 * pass level only if !queued, which means we will do iotlb
+	 * flush callback before freeing pages from freelist.
+	 *
+	 * When level is passed domain_unamp will attempt to add empty
+	 * page tables to freelist, and pass the level number of the highest
+	 * page table that was added to the freelist.
+	 */
+	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist,
+		     queued ? NULL : &level);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;
@@ -4281,8 +4332,21 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	 * We do not use page-selective IOTLB invalidation in flush queue,
 	 * so there is no need to track page and sync iotlb.
 	 */
-	if (!iommu_iotlb_gather_queued(gather))
-		iommu_iotlb_gather_add_page(domain, gather, iova, size);
+	if (!queued) {
+		size_t sz = size;
+
+		/*
+		 * Increase iova and sz for flushing if level was returned,
+		 * as it means we also are freeing some page tables.
+		 */
+		if (level) {
+			unsigned long pgsize = level_size(level) << VTD_PAGE_SHIFT;
+
+			iova = ALIGN_DOWN(iova, pgsize);
+			sz = ALIGN(size, pgsize);
+		}
+		iommu_iotlb_gather_add_page(domain, gather, iova, sz);
+	}
 
 	return size;
 }
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps
  2023-12-21  3:19 [RFC 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
                   ` (2 preceding siblings ...)
  2023-12-21  3:19 ` [RFC 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
@ 2023-12-21  4:16 ` Matthew Wilcox
  2023-12-21  5:13   ` Pasha Tatashin
  3 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2023-12-21  4:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote:
> This series frees empty page tables on unmaps. It intends to be a
> low overhead feature.
> 
> The read-writer lock is used to synchronize page table, but most of
> time the lock is held is reader. It is held as a writer for short
> period of time when unmapping a page that is bigger than the current
> iova request. For all other cases this lock is read-only.
> 
> page->refcount is used in order to track number of entries at each page
> table.

Have I not put enough DANGER signs up around the page refcount?

 * If you want to use the refcount field, it must be used in such a way
 * that other CPUs temporarily incrementing and then decrementing the
 * refcount does not cause problems.  On receiving the page from
 * alloc_pages(), the refcount will be positive.

You can't use refcount for your purpose, and honestly I'm shocked you
haven't seen any of your WARNings trigger.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps
  2023-12-21  4:16 ` [RFC 0/3] iommu/intel: Free " Matthew Wilcox
@ 2023-12-21  5:13   ` Pasha Tatashin
  2023-12-21  5:42     ` Pasha Tatashin
  0 siblings, 1 reply; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  5:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote:
> > This series frees empty page tables on unmaps. It intends to be a
> > low overhead feature.
> >
> > The read-writer lock is used to synchronize page table, but most of
> > time the lock is held is reader. It is held as a writer for short
> > period of time when unmapping a page that is bigger than the current
> > iova request. For all other cases this lock is read-only.
> >
> > page->refcount is used in order to track number of entries at each page
> > table.
>
> Have I not put enough DANGER signs up around the page refcount?
>
>  * If you want to use the refcount field, it must be used in such a way
>  * that other CPUs temporarily incrementing and then decrementing the
>  * refcount does not cause problems.  On receiving the page from
>  * alloc_pages(), the refcount will be positive.
>
> You can't use refcount for your purpose, and honestly I'm shocked you
> haven't seen any of your WARNings trigger.

Hi Matthew,

Thank you for looking at this.

Could you please explain exactly why refcount can't be used like this?

After alloc_page() refcount is set to 1, we never reduce it to 0,
every new entry in a page table adds 1, so we get up-to 513, that is
why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to
dma_set_pte() macro. When refcount == 1, we know that the page table
is empty, and can be added to a freelist for a delayed freeing.

What is wrong with using refcount for a scalable way of keeping the
track of number of entries in a iommu page table? Is there a better
way that I should use?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps
  2023-12-21  5:13   ` Pasha Tatashin
@ 2023-12-21  5:42     ` Pasha Tatashin
  2023-12-21 14:06       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21  5:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote:
> > > This series frees empty page tables on unmaps. It intends to be a
> > > low overhead feature.
> > >
> > > The read-writer lock is used to synchronize page table, but most of
> > > time the lock is held is reader. It is held as a writer for short
> > > period of time when unmapping a page that is bigger than the current
> > > iova request. For all other cases this lock is read-only.
> > >
> > > page->refcount is used in order to track number of entries at each page
> > > table.
> >
> > Have I not put enough DANGER signs up around the page refcount?
> >
> >  * If you want to use the refcount field, it must be used in such a way
> >  * that other CPUs temporarily incrementing and then decrementing the
> >  * refcount does not cause problems.  On receiving the page from
> >  * alloc_pages(), the refcount will be positive.
> >
> > You can't use refcount for your purpose, and honestly I'm shocked you
> > haven't seen any of your WARNings trigger.
>
> Hi Matthew,
>
> Thank you for looking at this.
>
> Could you please explain exactly why refcount can't be used like this?
>
> After alloc_page() refcount is set to 1, we never reduce it to 0,
> every new entry in a page table adds 1, so we get up-to 513, that is
> why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to

I guess, what you mean is that other CPUs could temporarily
increase/decrease refcount outside of IOMMU management, do you have an
example of why that would happen? I could remove the above warning,
and in the worst case we would miss an opportunity to free a page
table during unmap, not a big deal, it can be freed during another
map/unmap event. Still better than today, where we never free them
during unmaps.

Pasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps
  2023-12-21  5:42     ` Pasha Tatashin
@ 2023-12-21 14:06       ` Matthew Wilcox
  2023-12-21 14:58         ` Pasha Tatashin
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2023-12-21 14:06 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Thu, Dec 21, 2023 at 12:42:41AM -0500, Pasha Tatashin wrote:
> On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote:
> > > > This series frees empty page tables on unmaps. It intends to be a
> > > > low overhead feature.
> > > >
> > > > The read-writer lock is used to synchronize page table, but most of
> > > > time the lock is held is reader. It is held as a writer for short
> > > > period of time when unmapping a page that is bigger than the current
> > > > iova request. For all other cases this lock is read-only.
> > > >
> > > > page->refcount is used in order to track number of entries at each page
> > > > table.
> > >
> > > Have I not put enough DANGER signs up around the page refcount?
> > >
> > >  * If you want to use the refcount field, it must be used in such a way
> > >  * that other CPUs temporarily incrementing and then decrementing the
> > >  * refcount does not cause problems.  On receiving the page from
> > >  * alloc_pages(), the refcount will be positive.
> > >
> > > You can't use refcount for your purpose, and honestly I'm shocked you
> > > haven't seen any of your WARNings trigger.
> >
> > Hi Matthew,
> >
> > Thank you for looking at this.
> >
> > Could you please explain exactly why refcount can't be used like this?
> >
> > After alloc_page() refcount is set to 1, we never reduce it to 0,
> > every new entry in a page table adds 1, so we get up-to 513, that is
> > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to
> 
> I guess, what you mean is that other CPUs could temporarily
> increase/decrease refcount outside of IOMMU management, do you have an
> example of why that would happen? I could remove the above warning,
> and in the worst case we would miss an opportunity to free a page
> table during unmap, not a big deal, it can be freed during another
> map/unmap event. Still better than today, where we never free them
> during unmaps.

Both GUP-fast and the page cache will find a page under RCU protection,
inc it's refcount if not zero, check the page is still the one they were
looking for, and if not will dec the refcount again.  That means if a
page has been in the page cache or process page tables and you can't
guarantee that all CPUs have been through the requisite grace periods,
you might see the refcount increased.

I'm not prepared to make a guarantee that these are the only circumstances
under which you'll see a temporarily higher refcount than you expect.
Either currently or in the future.  If you use the refcount as anything
other than a refcount, you're living dangerously.  And if you think that
you'll be the one to do the last refcount put, you're not necessarily
correct (see the saga around __free_pages() which ended up as commit
e320d3012d25 fixed by 462a8e08e0e6 (which indicates the rare race does
actually happen)).

Now, it seems like from your further explanation that the consequence
of getting this wrong is simply that you fail to free the page early.
That seems OK, but I insist that you insert some comments explaining
what is going on and why it's safe so somebody auditing uses of refcount
doesn't have to reanalyse the whole thing for themself.  Or worse that
somebody working on the iommu sees this and thinks they can "improve"
on it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps
  2023-12-21 14:06       ` Matthew Wilcox
@ 2023-12-21 14:58         ` Pasha Tatashin
  0 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-21 14:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Thu, Dec 21, 2023 at 9:06 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Dec 21, 2023 at 12:42:41AM -0500, Pasha Tatashin wrote:
> > On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote:
> > > > > This series frees empty page tables on unmaps. It intends to be a
> > > > > low overhead feature.
> > > > >
> > > > > The read-writer lock is used to synchronize page table, but most of
> > > > > time the lock is held is reader. It is held as a writer for short
> > > > > period of time when unmapping a page that is bigger than the current
> > > > > iova request. For all other cases this lock is read-only.
> > > > >
> > > > > page->refcount is used in order to track number of entries at each page
> > > > > table.
> > > >
> > > > Have I not put enough DANGER signs up around the page refcount?
> > > >
> > > >  * If you want to use the refcount field, it must be used in such a way
> > > >  * that other CPUs temporarily incrementing and then decrementing the
> > > >  * refcount does not cause problems.  On receiving the page from
> > > >  * alloc_pages(), the refcount will be positive.
> > > >
> > > > You can't use refcount for your purpose, and honestly I'm shocked you
> > > > haven't seen any of your WARNings trigger.
> > >
> > > Hi Matthew,
> > >
> > > Thank you for looking at this.
> > >
> > > Could you please explain exactly why refcount can't be used like this?
> > >
> > > After alloc_page() refcount is set to 1, we never reduce it to 0,
> > > every new entry in a page table adds 1, so we get up-to 513, that is
> > > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to
> >
> > I guess, what you mean is that other CPUs could temporarily
> > increase/decrease refcount outside of IOMMU management, do you have an
> > example of why that would happen? I could remove the above warning,
> > and in the worst case we would miss an opportunity to free a page
> > table during unmap, not a big deal, it can be freed during another
> > map/unmap event. Still better than today, where we never free them
> > during unmaps.
>
> Both GUP-fast and the page cache will find a page under RCU protection,
> inc it's refcount if not zero, check the page is still the one they were
> looking for, and if not will dec the refcount again.  That means if a
> page has been in the page cache or process page tables and you can't
> guarantee that all CPUs have been through the requisite grace periods,
> you might see the refcount increased.

Interesting scenario, it sounds like this could only happen for a
short period of time at the beginning of the life of a page in the
IOMMU Page Table.


> I'm not prepared to make a guarantee that these are the only circumstances
> under which you'll see a temporarily higher refcount than you expect.
> Either currently or in the future.  If you use the refcount as anything
> other than a refcount, you're living dangerously.  And if you think that
> you'll be the one to do the last refcount put, you're not necessarily
> correct (see the saga around __free_pages() which ended up as commit
> e320d3012d25 fixed by 462a8e08e0e6 (which indicates the rare race does
> actually happen)).
>
> Now, it seems like from your further explanation that the consequence
> of getting this wrong is simply that you fail to free the page early.
> That seems OK, but I insist that you insert some comments explaining
> what is going on and why it's safe so somebody auditing uses of refcount
> doesn't have to reanalyse the whole thing for themself.  Or worse that
> somebody working on the iommu sees this and thinks they can "improve"
> on it.

Yes, I can add detailed comments explaining how refcount is used here.

Alternatively, I was thinking of using mapcount:

From mm_types.h:
  * If your page will not be mapped to userspace, you can also use the
four
  * bytes in the mapcount union, but you must call
page_mapcount_reset()
  * before freeing it.

It sounds like we can safely use _mapcount for our needs, and do
page_mapcount_reset() before freeing pages.

Pasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU
  2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
@ 2023-12-25  2:59   ` Liu, Jingqi
  2023-12-28 15:01     ` Pasha Tatashin
  2023-12-25 11:03   ` kernel test robot
  1 sibling, 1 reply; 12+ messages in thread
From: Liu, Jingqi @ 2023-12-25  2:59 UTC (permalink / raw)
  To: Pasha Tatashin, akpm, linux-mm, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

On 12/21/2023 11:19 AM, Pasha Tatashin wrote:
> In order to be able to efficiently free empty page table levels, count the
> number of entries in each page table my incremeanting and decremeanting
s/incremeanting/incrementing
s/decremeanting/decrementing

> refcount every time a PTE is inserted or removed form the page table.
s/form/from

> 
> For this to work correctly, add two helper function:
> dma_clear_pte and dma_set_pte where counting is performed,
> 
> Also, modify the code so every page table entry is always updated using the
> two new functions.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>   drivers/iommu/intel/iommu.c | 40 +++++++++++++++++++++---------------
>   drivers/iommu/intel/iommu.h | 41 +++++++++++++++++++++++++++++++------
>   2 files changed, 58 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 897159dba47d..4688ef797161 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -949,7 +949,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>   			if (domain->use_first_level)
>   				pteval |= DMA_FL_PTE_XD | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;
>   
> -			if (cmpxchg64(&pte->val, 0ULL, pteval))
> +			if (dma_set_pte(pte, pteval))
>   				/* Someone else set it while we were thinking; use theirs. */
>   				free_pgtable_page(tmp_page);
>   			else
> @@ -1021,7 +1021,8 @@ static void dma_pte_clear_range(struct dmar_domain *domain,
>   			continue;
>   		}
>   		do {
> -			dma_clear_pte(pte);
> +			if (dma_pte_present(pte))
> +				dma_clear_pte(pte);
>   			start_pfn += lvl_to_nr_pages(large_page);
>   			pte++;
>   		} while (start_pfn <= last_pfn && !first_pte_in_page(pte));
> @@ -1062,7 +1063,8 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level,
>   		 */
>   		if (level < retain_level && !(start_pfn > level_pfn ||
>   		      last_pfn < level_pfn + level_size(level) - 1)) {
> -			dma_clear_pte(pte);
> +			if (dma_pte_present(pte))
> +				dma_clear_pte(pte);
>   			domain_flush_cache(domain, pte, sizeof(*pte));
>   			free_pgtable_page(level_pte);
>   		}
> @@ -1093,12 +1095,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
>   	}
>   }
>   
> -/* When a page at a given level is being unlinked from its parent, we don't
> -   need to *modify* it at all. All we need to do is make a list of all the
> -   pages which can be freed just as soon as we've flushed the IOTLB and we
> -   know the hardware page-walk will no longer touch them.
> -   The 'pte' argument is the *parent* PTE, pointing to the page that is to
> -   be freed. */
> +/*
> + * A given page at a given level is being unlinked from its parent.
> + * We need to make a list of all the pages which can be freed just as soon as
> + * we've flushed the IOTLB and we know the hardware page-walk will no longer
> + * touch them. The 'pte' argument is the *parent* PTE, pointing to the page
> + * that is to be freed.
> + */
>   static void dma_pte_list_pagetables(struct dmar_domain *domain,
>   				    int level, struct dma_pte *pte,
>   				    struct list_head *freelist)
> @@ -1106,17 +1109,20 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
>   	struct page *pg;
>   
>   	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
> -	list_add_tail(&pg->lru, freelist);
> -
> -	if (level == 1)
> -		return;
> -
>   	pte = page_address(pg);
> +
>   	do {
> -		if (dma_pte_present(pte) && !dma_pte_superpage(pte))
> -			dma_pte_list_pagetables(domain, level - 1, pte, freelist);
> +		if (dma_pte_present(pte)) {
> +			if (level > 1 && !dma_pte_superpage(pte)) {
> +				dma_pte_list_pagetables(domain, level - 1, pte,
> +							freelist);
> +			}
> +			dma_clear_pte(pte);
> +		}
>   		pte++;
>   	} while (!first_pte_in_page(pte));
> +
> +	list_add_tail(&pg->lru, freelist);
>   }
>   
How about calculating the page decrement when the pages in the freelist 
are really freed in iommu_free_pgtbl_pages() ?

Thanks,
Jingqi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU
  2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
  2023-12-25  2:59   ` Liu, Jingqi
@ 2023-12-25 11:03   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2023-12-25 11:03 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: llvm, oe-kbuild-all

Hi Pasha,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on joro-iommu/next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.7-rc7 next-20231222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/iommu-intel-Use-page-refcount-to-count-number-of-entries-in-IOMMU/20231222-172957
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
patch link:    https://lore.kernel.org/r/20231221031915.619337-2-pasha.tatashin%40soleen.com
patch subject: [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU
config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20231225/202312251813.JIQrzgR1-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231225/202312251813.JIQrzgR1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312251813.JIQrzgR1-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/iommu/intel/dmar.c:33:
>> drivers/iommu/intel/iommu.h:823:15: error: invalid output size for constraint '+q'
           old_pteval = xchg(&pte->val, 0ULL);
                        ^
   include/linux/atomic/atomic-instrumented.h:4716:2: note: expanded from macro 'xchg'
           raw_xchg(__ai_ptr, __VA_ARGS__); \
           ^
   include/linux/atomic/atomic-arch-fallback.h:12:18: note: expanded from macro 'raw_xchg'
   #define raw_xchg arch_xchg
                    ^
   arch/x86/include/asm/cmpxchg.h:78:27: note: expanded from macro 'arch_xchg'
   #define arch_xchg(ptr, v)       __xchg_op((ptr), (v), xchg, "")
                                   ^
   arch/x86/include/asm/cmpxchg.h:48:19: note: expanded from macro '__xchg_op'
                                         : "+q" (__ret), "+m" (*(ptr))     \
                                                 ^
   1 error generated.


vim +823 drivers/iommu/intel/iommu.h

   818	
   819	static inline void dma_clear_pte(struct dma_pte *pte)
   820	{
   821		u64 old_pteval;
   822	
 > 823		old_pteval = xchg(&pte->val, 0ULL);
   824		if (DMA_PTEVAL_PRESENT(old_pteval)) {
   825			struct page *pg = virt_to_page(pte);
   826			int rc = page_ref_dec_return(pg);
   827	
   828			WARN_ON_ONCE(rc > 512 || rc < 1);
   829		} else {
   830			/* Ensure that we cleared a valid entry from the page table */
   831			WARN_ON(1);
   832		}
   833	}
   834	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU
  2023-12-25  2:59   ` Liu, Jingqi
@ 2023-12-28 15:01     ` Pasha Tatashin
  0 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2023-12-28 15:01 UTC (permalink / raw)
  To: Liu, Jingqi
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Sun, Dec 24, 2023 at 9:59 PM Liu, Jingqi <jingqi.liu@intel.com> wrote:
>
> On 12/21/2023 11:19 AM, Pasha Tatashin wrote:
> > In order to be able to efficiently free empty page table levels, count the
> > number of entries in each page table my incremeanting and decremeanting
> s/incremeanting/incrementing
> s/decremeanting/decrementing
>
> > refcount every time a PTE is inserted or removed form the page table.
> s/form/from
>
> >
> > For this to work correctly, add two helper function:
> > dma_clear_pte and dma_set_pte where counting is performed,
> >
> > Also, modify the code so every page table entry is always updated using the
> > two new functions.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   drivers/iommu/intel/iommu.c | 40 +++++++++++++++++++++---------------
> >   drivers/iommu/intel/iommu.h | 41 +++++++++++++++++++++++++++++++------
> >   2 files changed, 58 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 897159dba47d..4688ef797161 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -949,7 +949,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
> >                       if (domain->use_first_level)
> >                               pteval |= DMA_FL_PTE_XD | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;
> >
> > -                     if (cmpxchg64(&pte->val, 0ULL, pteval))
> > +                     if (dma_set_pte(pte, pteval))
> >                               /* Someone else set it while we were thinking; use theirs. */
> >                               free_pgtable_page(tmp_page);
> >                       else
> > @@ -1021,7 +1021,8 @@ static void dma_pte_clear_range(struct dmar_domain *domain,
> >                       continue;
> >               }
> >               do {
> > -                     dma_clear_pte(pte);
> > +                     if (dma_pte_present(pte))
> > +                             dma_clear_pte(pte);
> >                       start_pfn += lvl_to_nr_pages(large_page);
> >                       pte++;
> >               } while (start_pfn <= last_pfn && !first_pte_in_page(pte));
> > @@ -1062,7 +1063,8 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level,
> >                */
> >               if (level < retain_level && !(start_pfn > level_pfn ||
> >                     last_pfn < level_pfn + level_size(level) - 1)) {
> > -                     dma_clear_pte(pte);
> > +                     if (dma_pte_present(pte))
> > +                             dma_clear_pte(pte);
> >                       domain_flush_cache(domain, pte, sizeof(*pte));
> >                       free_pgtable_page(level_pte);
> >               }
> > @@ -1093,12 +1095,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
> >       }
> >   }
> >
> > -/* When a page at a given level is being unlinked from its parent, we don't
> > -   need to *modify* it at all. All we need to do is make a list of all the
> > -   pages which can be freed just as soon as we've flushed the IOTLB and we
> > -   know the hardware page-walk will no longer touch them.
> > -   The 'pte' argument is the *parent* PTE, pointing to the page that is to
> > -   be freed. */
> > +/*
> > + * A given page at a given level is being unlinked from its parent.
> > + * We need to make a list of all the pages which can be freed just as soon as
> > + * we've flushed the IOTLB and we know the hardware page-walk will no longer
> > + * touch them. The 'pte' argument is the *parent* PTE, pointing to the page
> > + * that is to be freed.
> > + */
> >   static void dma_pte_list_pagetables(struct dmar_domain *domain,
> >                                   int level, struct dma_pte *pte,
> >                                   struct list_head *freelist)
> > @@ -1106,17 +1109,20 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
> >       struct page *pg;
> >
> >       pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
> > -     list_add_tail(&pg->lru, freelist);
> > -
> > -     if (level == 1)
> > -             return;
> > -
> >       pte = page_address(pg);
> > +
> >       do {
> > -             if (dma_pte_present(pte) && !dma_pte_superpage(pte))
> > -                     dma_pte_list_pagetables(domain, level - 1, pte, freelist);
> > +             if (dma_pte_present(pte)) {
> > +                     if (level > 1 && !dma_pte_superpage(pte)) {
> > +                             dma_pte_list_pagetables(domain, level - 1, pte,
> > +                                                     freelist);
> > +                     }
> > +                     dma_clear_pte(pte);
> > +             }
> >               pte++;
> >       } while (!first_pte_in_page(pte));
> > +
> > +     list_add_tail(&pg->lru, freelist);
> >   }
> >
> How about calculating the page decrement when the pages in the freelist
> are really freed in iommu_free_pgtbl_pages() ?


Hi Jingqi,

Thank you for looking at this.

My understanding is what you are suggesting is to count number of
entries and subtract them from refcount only once before adding page
to the freelist, is this correct?

We could do that, but having dma_clear_pte() for each entry is very
beneficial, as we could extend IOMMU page tables with other debug
technologies, such as page_table_check, if every single entry in the
page table is added and removed via dma_clear_pte() and dma_set_pte(),
since we are alreadying clearing pte, there is no reason not to change
the refcount at the same time.

Pasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-12-28 15:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21  3:19 [RFC 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
2023-12-21  3:19 ` [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Pasha Tatashin
2023-12-25  2:59   ` Liu, Jingqi
2023-12-28 15:01     ` Pasha Tatashin
2023-12-25 11:03   ` kernel test robot
2023-12-21  3:19 ` [RFC 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
2023-12-21  3:19 ` [RFC 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
2023-12-21  4:16 ` [RFC 0/3] iommu/intel: Free " Matthew Wilcox
2023-12-21  5:13   ` Pasha Tatashin
2023-12-21  5:42     ` Pasha Tatashin
2023-12-21 14:06       ` Matthew Wilcox
2023-12-21 14:58         ` Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.