linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
@ 2024-04-26  3:43 Pasha Tatashin
  2024-04-26  3:43 ` [RFC v2 1/3] iommu/intel: Use page->_mapcount to count number of entries in IOMMU Pasha Tatashin
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Pasha Tatashin @ 2024-04-26  3:43 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

Changelog
================================================================
v2: Use mapcount instead of refcount
    Synchronized with IOMMU Observability changes.
================================================================

This series frees empty page tables on unmaps. It intends to be a
low overhead feature.

The read-writer lock is used to synchronize page table, but most of
time the lock is held is reader. It is held as a writer for short
period of time when unmapping a page that is bigger than the current
iova request. For all other cases this lock is read-only.

page->mapcount is used in order to track number of entries at each page
table.

Microbenchmark data using iova_stress[1]:

Base:
$ ./iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~  32847M time:   36.074s

Fix:
$ ./iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~     27M time:   38.870s

The test maps/unmaps 4K pages and cycles through the IOVA space in a tight loop.
Base uses 32G of memory, and test completes in 36.074s
Fix uses 0G of memory, and test completes in 38.870s.

I believe the proposed fix is a good compromise in terms of complexity/
scalability. A more scalable solution would be to spread read/writer
lock per-page table, and user page->private field to store the lock
itself.

However, since iommu already has some protection: i.e. no-one touches
the iova space of the request map/unmap we can avoid the extra complexity
and rely on a single per page table RW lock, and be in a reader mode
most of the time.

[1] https://github.com/soleen/iova_stress

Pasha Tatashin (3):
  iommu/intel: Use page->_mapcount to count number of entries in IOMMU
  iommu/intel: synchronize page table map and unmap operations
  iommu/intel: free empty page tables on unmaps

 drivers/iommu/intel/iommu.c | 154 ++++++++++++++++++++++++++++--------
 drivers/iommu/intel/iommu.h |  42 ++++++++--
 drivers/iommu/iommu-pages.h |  30 +++++--
 3 files changed, 180 insertions(+), 46 deletions(-)

-- 
2.44.0.769.g3c40516874-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC v2 1/3] iommu/intel: Use page->_mapcount to count number of entries in IOMMU
  2024-04-26  3:43 [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
@ 2024-04-26  3:43 ` Pasha Tatashin
  2024-04-26  3:43 ` [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Pasha Tatashin @ 2024-04-26  3:43 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

In order to be able to efficiently free empty page table levels, count
the number of entries in each page table by incremeanting and
decremeanting mapcount every time a PTE is inserted or removed form the
page table.

For this to work correctly, add two helper function:
dma_clear_pte and dma_set_pte where counting is performed,

Also, modify the code so every page table entry is always updated using
the two new functions.

Finally, before pages are freed, we must restore mapcount to its original
state by calling page_mapcount_reset().

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 42 ++++++++++++++++++++++---------------
 drivers/iommu/intel/iommu.h | 39 ++++++++++++++++++++++++++++------
 drivers/iommu/iommu-pages.h | 30 +++++++++++++++++++-------
 3 files changed, 81 insertions(+), 30 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7abe76f92a3c..1bfb6eccad05 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -862,7 +862,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
 			if (domain->use_first_level)
 				pteval |= DMA_FL_PTE_XD | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;
 
-			if (cmpxchg64(&pte->val, 0ULL, pteval))
+			if (dma_set_pte(pte, pteval))
 				/* Someone else set it while we were thinking; use theirs. */
 				iommu_free_page(tmp_page);
 			else
@@ -934,7 +934,8 @@ static void dma_pte_clear_range(struct dmar_domain *domain,
 			continue;
 		}
 		do {
-			dma_clear_pte(pte);
+			if (dma_pte_present(pte))
+				dma_clear_pte(pte);
 			start_pfn += lvl_to_nr_pages(large_page);
 			pte++;
 		} while (start_pfn <= last_pfn && !first_pte_in_page(pte));
@@ -975,7 +976,8 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level,
 		 */
 		if (level < retain_level && !(start_pfn > level_pfn ||
 		      last_pfn < level_pfn + level_size(level) - 1)) {
-			dma_clear_pte(pte);
+			if (dma_pte_present(pte))
+				dma_clear_pte(pte);
 			domain_flush_cache(domain, pte, sizeof(*pte));
 			iommu_free_page(level_pte);
 		}
@@ -1006,12 +1008,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 	}
 }
 
-/* When a page at a given level is being unlinked from its parent, we don't
-   need to *modify* it at all. All we need to do is make a list of all the
-   pages which can be freed just as soon as we've flushed the IOTLB and we
-   know the hardware page-walk will no longer touch them.
-   The 'pte' argument is the *parent* PTE, pointing to the page that is to
-   be freed. */
+/*
+ * A given page at a given level is being unlinked from its parent.
+ * We need to make a list of all the pages which can be freed just as soon as
+ * we've flushed the IOTLB and we know the hardware page-walk will no longer
+ * touch them. The 'pte' argument is the *parent* PTE, pointing to the page
+ * that is to be freed.
+ */
 static void dma_pte_list_pagetables(struct dmar_domain *domain,
 				    int level, struct dma_pte *pte,
 				    struct list_head *freelist)
@@ -1019,17 +1022,21 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 	struct page *pg;
 
 	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
-	list_add_tail(&pg->lru, freelist);
-
-	if (level == 1)
-		return;
-
 	pte = page_address(pg);
+
 	do {
-		if (dma_pte_present(pte) && !dma_pte_superpage(pte))
-			dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+		if (dma_pte_present(pte)) {
+			if (level > 1 && !dma_pte_superpage(pte)) {
+				dma_pte_list_pagetables(domain, level - 1, pte,
+							freelist);
+			}
+			dma_clear_pte(pte);
+		}
 		pte++;
 	} while (!first_pte_in_page(pte));
+
+	page_mapcount_reset(pg);
+	list_add_tail(&pg->lru, freelist);
 }
 
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
@@ -1093,6 +1100,7 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
 		struct page *pgd_page = virt_to_page(domain->pgd);
+		page_mapcount_reset(pgd_page);
 		list_add_tail(&pgd_page->lru, freelist);
 		domain->pgd = NULL;
 	}
@@ -2113,7 +2121,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 		/* We don't need lock here, nobody else
 		 * touches the iova range
 		 */
-		tmp = cmpxchg64_local(&pte->val, 0ULL, pteval);
+		tmp = dma_set_pte(pte, pteval);
 		if (tmp) {
 			static int dumps = 5;
 			pr_crit("ERROR: DMA PTE for vPFN 0x%lx already set (to %llx not %llx)\n",
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 8d081d8c6f41..e5c1eb23897f 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -814,11 +814,6 @@ struct dma_pte {
 	u64 val;
 };
 
-static inline void dma_clear_pte(struct dma_pte *pte)
-{
-	pte->val = 0;
-}
-
 static inline u64 dma_pte_addr(struct dma_pte *pte)
 {
 #ifdef CONFIG_64BIT
@@ -830,9 +825,41 @@ static inline u64 dma_pte_addr(struct dma_pte *pte)
 #endif
 }
 
+#define DMA_PTEVAL_PRESENT(pteval) (((pteval) & 3) != 0)
 static inline bool dma_pte_present(struct dma_pte *pte)
 {
-	return (pte->val & 3) != 0;
+	return DMA_PTEVAL_PRESENT(pte->val);
+}
+
+static inline void dma_clear_pte(struct dma_pte *pte)
+{
+	u64 old_pteval;
+
+	old_pteval = xchg(&pte->val, 0ULL);
+	if (DMA_PTEVAL_PRESENT(old_pteval)) {
+		struct page *pg = virt_to_page(pte);
+
+		atomic_dec(&pg->_mapcount);
+	} else {
+		/* Ensure that we cleared a valid entry from the page table */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static inline u64 dma_set_pte(struct dma_pte *pte, u64 pteval)
+{
+	u64 old_pteval;
+
+	/* Ensure we about to set a valid entry to the page table */
+	WARN_ON_ONCE(!DMA_PTEVAL_PRESENT(pteval));
+	old_pteval = cmpxchg64(&pte->val, 0ULL, pteval);
+	if (old_pteval == 0) {
+		struct page *pg = virt_to_page(pte);
+
+		atomic_inc(&pg->_mapcount);
+	}
+
+	return old_pteval;
 }
 
 static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
index 82ebf0033081..b8b332951944 100644
--- a/drivers/iommu/iommu-pages.h
+++ b/drivers/iommu/iommu-pages.h
@@ -119,7 +119,8 @@ static inline void *iommu_alloc_pages(gfp_t gfp, int order)
 }
 
 /**
- * iommu_alloc_page_node - allocate a zeroed page at specific NUMA node.
+ * iommu_alloc_page_node - allocate a zeroed page at specific NUMA node, and set
+ *                         mapcount in its struct page to 0.
  * @nid: memory NUMA node id
  * @gfp: buddy allocator flags
  *
@@ -127,18 +128,29 @@ static inline void *iommu_alloc_pages(gfp_t gfp, int order)
  */
 static inline void *iommu_alloc_page_node(int nid, gfp_t gfp)
 {
-	return iommu_alloc_pages_node(nid, gfp, 0);
+	void *virt = iommu_alloc_pages_node(nid, gfp, 0);
+
+	if (virt)
+		atomic_set(&(virt_to_page(virt))->_mapcount, 0);
+
+	return virt;
 }
 
 /**
- * iommu_alloc_page - allocate a zeroed page
+ * iommu_alloc_page - allocate a zeroed page, and set mapcount in its struct
+ *                    page to 0.
  * @gfp: buddy allocator flags
  *
  * returns the virtual address of the allocated page
  */
 static inline void *iommu_alloc_page(gfp_t gfp)
 {
-	return iommu_alloc_pages(gfp, 0);
+	void *virt = iommu_alloc_pages(gfp, 0);
+
+	if (virt)
+		atomic_set(&(virt_to_page(virt))->_mapcount, 0);
+
+	return virt;
 }
 
 /**
@@ -155,16 +167,19 @@ static inline void iommu_free_pages(void *virt, int order)
 }
 
 /**
- * iommu_free_page - free page
+ * iommu_free_page - free page, and reset mapcount
  * @virt: virtual address of the page to be freed.
  */
 static inline void iommu_free_page(void *virt)
 {
-	iommu_free_pages(virt, 0);
+	if (virt) {
+		page_mapcount_reset(virt_to_page(virt));
+		iommu_free_pages(virt, 0);
+	}
 }
 
 /**
- * iommu_put_pages_list - free a list of pages.
+ * iommu_put_pages_list - free a list of pages, and reset mapcount.
  * @page: the head of the lru list to be freed.
  *
  * There are no locking requirement for these pages, as they are going to be
@@ -177,6 +192,7 @@ static inline void iommu_put_pages_list(struct list_head *page)
 	while (!list_empty(page)) {
 		struct page *p = list_entry(page->prev, struct page, lru);
 
+		page_mapcount_reset(p);
 		list_del(&p->lru);
 		__iommu_free_account(p, 0);
 		put_page(p);
-- 
2.44.0.769.g3c40516874-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations
  2024-04-26  3:43 [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
  2024-04-26  3:43 ` [RFC v2 1/3] iommu/intel: Use page->_mapcount to count number of entries in IOMMU Pasha Tatashin
@ 2024-04-26  3:43 ` Pasha Tatashin
  2024-04-29 14:56   ` Jason Gunthorpe
  2024-04-26  3:43 ` [RFC v2 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
  2024-04-26  6:42 ` [RFC v2 0/3] iommu/intel: Free " David Hildenbrand
  3 siblings, 1 reply; 10+ messages in thread
From: Pasha Tatashin @ 2024-04-26  3:43 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

Since, we are going to update  parent page table entries when lower
level page tables become emtpy and we add them to the free list.
We need a way to synchronize the operation.

Use domain->pgd_lock to protect all map and unmap operations.
This is reader/writer lock. At the beginning everything is going to be
read only mode, however, later, when free page table on unmap is added
we will add a writer section as well.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 21 +++++++++++++++++++--
 drivers/iommu/intel/iommu.h |  3 +++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1bfb6eccad05..8c7e596728b5 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -995,11 +995,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 				   unsigned long last_pfn,
 				   int retain_level)
 {
+	read_lock(&domain->pgd_lock);
 	dma_pte_clear_range(domain, start_pfn, last_pfn);
 
 	/* We don't need lock here; nobody else touches the iova range */
 	dma_pte_free_level(domain, agaw_to_level(domain->agaw), retain_level,
 			   domain->pgd, 0, start_pfn, last_pfn);
+	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -1093,9 +1095,11 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 	    WARN_ON(start_pfn > last_pfn))
 		return;
 
+	read_lock(&domain->pgd_lock);
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
 			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2088,6 +2092,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 
 	pteval = ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) | attr;
 
+	read_lock(&domain->pgd_lock);
 	while (nr_pages > 0) {
 		uint64_t tmp;
 
@@ -2097,8 +2102,10 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 
 			pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
 					     gfp);
-			if (!pte)
+			if (!pte) {
+				read_unlock(&domain->pgd_lock);
 				return -ENOMEM;
+			}
 			first_pte = pte;
 
 			lvl_pages = lvl_to_nr_pages(largepage_lvl);
@@ -2158,6 +2165,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 			pte = NULL;
 		}
 	}
+	read_unlock(&domain->pgd_lock);
 
 	return 0;
 }
@@ -3829,6 +3837,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 	domain->pgd = iommu_alloc_page_node(domain->nid, GFP_ATOMIC);
 	if (!domain->pgd)
 		return -ENOMEM;
+	rwlock_init(&domain->pgd_lock);
 	domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
 	return 0;
 }
@@ -4074,11 +4083,15 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	unsigned long start_pfn, last_pfn;
 	int level = 0;
 
+	read_lock(&dmar_domain->pgd_lock);
 	/* Cope with horrid API which requires us to unmap more than the
 	   size argument if it happens to be a large-page mapping. */
 	if (unlikely(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
-				     &level, GFP_ATOMIC)))
+				     &level, GFP_ATOMIC))) {
+		read_unlock(&dmar_domain->pgd_lock);
 		return 0;
+	}
+	read_unlock(&dmar_domain->pgd_lock);
 
 	if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
 		size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4145,8 +4158,10 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 	int level = 0;
 	u64 phys = 0;
 
+	read_lock(&dmar_domain->pgd_lock);
 	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
 			     GFP_ATOMIC);
+	read_unlock(&dmar_domain->pgd_lock);
 	if (pte && dma_pte_present(pte))
 		phys = dma_pte_addr(pte) +
 			(iova & (BIT_MASK(level_to_offset_bits(level) +
@@ -4801,8 +4816,10 @@ static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
 		struct dma_pte *pte;
 		int lvl = 0;
 
+		read_lock(&dmar_domain->pgd_lock);
 		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
 				     GFP_ATOMIC);
+		read_unlock(&dmar_domain->pgd_lock);
 		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
 		if (!pte || !dma_pte_present(pte)) {
 			iova += pgsize;
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index e5c1eb23897f..2f38b087ea4f 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -615,6 +615,9 @@ struct dmar_domain {
 		struct {
 			/* virtual address */
 			struct dma_pte	*pgd;
+
+			/* Synchronizes pgd map/unmap operations */
+			rwlock_t	pgd_lock;
 			/* max guest address width */
 			int		gaw;
 			/*
-- 
2.44.0.769.g3c40516874-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v2 3/3] iommu/intel: free empty page tables on unmaps
  2024-04-26  3:43 [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
  2024-04-26  3:43 ` [RFC v2 1/3] iommu/intel: Use page->_mapcount to count number of entries in IOMMU Pasha Tatashin
  2024-04-26  3:43 ` [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
@ 2024-04-26  3:43 ` Pasha Tatashin
  2024-04-26  6:42 ` [RFC v2 0/3] iommu/intel: Free " David Hildenbrand
  3 siblings, 0 replies; 10+ messages in thread
From: Pasha Tatashin @ 2024-04-26  3:43 UTC (permalink / raw)
  To: akpm, linux-mm, pasha.tatashin, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

When page tables become empty, add them to the freelist so that they
can also be freed.

This is means that a page tables that are outside of the imediat iova
range might be freed as well, therefore, only in the case where such
page tables are going to be freed, we take the writer lock.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/iommu/intel/iommu.c | 91 +++++++++++++++++++++++++++++++------
 1 file changed, 77 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8c7e596728b5..2dedcd4f6060 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1044,7 +1044,7 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				struct dma_pte *pte, unsigned long pfn,
 				unsigned long start_pfn, unsigned long last_pfn,
-				struct list_head *freelist)
+				struct list_head *freelist, int *freed_level)
 {
 	struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1070,11 +1070,47 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				first_pte = pte;
 			last_pte = pte;
 		} else if (level > 1) {
+			struct dma_pte *npte = phys_to_virt(dma_pte_addr(pte));
+			struct page *npage = virt_to_page(npte);
+
 			/* Recurse down into a level that isn't *entirely* obsolete */
-			dma_pte_clear_level(domain, level - 1,
-					    phys_to_virt(dma_pte_addr(pte)),
+			dma_pte_clear_level(domain, level - 1, npte,
 					    level_pfn, start_pfn, last_pfn,
-					    freelist);
+					    freelist, freed_level);
+
+			/*
+			 * Free next level page table if it became empty.
+			 *
+			 * We only holding the reader lock, and it is possible
+			 * that other threads are accessing page table as
+			 * readers as well. We can only free page table that
+			 * is outside of the request IOVA space only if
+			 * we grab the writer lock. Since we need to drop reader
+			 * lock, we are incrementing the mapcount in the npage
+			 * so it (and the current page table) does not
+			 * dissappear due to concurrent unmapping threads.
+			 *
+			 * Store the size maximum size of the freed page table
+			 * into freed_level, so the size of the IOTLB flush
+			 * can be determined.
+			 */
+			if (freed_level && !atomic_read(&npage->_mapcount)) {
+				atomic_inc(&npage->_mapcount);
+				read_unlock(&domain->pgd_lock);
+				write_lock(&domain->pgd_lock);
+				atomic_dec(&npage->_mapcount);
+				if (!atomic_read(&npage->_mapcount)) {
+					dma_clear_pte(pte);
+					if (!first_pte)
+						first_pte = pte;
+					last_pte = pte;
+					page_mapcount_reset(npage);
+					list_add_tail(&npage->lru, freelist);
+					*freed_level = level;
+				}
+				write_unlock(&domain->pgd_lock);
+				read_lock(&domain->pgd_lock);
+			}
 		}
 next:
 		pfn = level_pfn + level_size(level);
@@ -1089,7 +1125,8 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
    the page tables, and may have cached the intermediate levels. The
    pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-			 unsigned long last_pfn, struct list_head *freelist)
+			 unsigned long last_pfn, struct list_head *freelist,
+			 int *level)
 {
 	if (WARN_ON(!domain_pfn_supported(domain, last_pfn)) ||
 	    WARN_ON(start_pfn > last_pfn))
@@ -1098,7 +1135,8 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 	read_lock(&domain->pgd_lock);
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+			    domain->pgd, 0, start_pfn, last_pfn, freelist,
+			    level);
 	read_unlock(&domain->pgd_lock);
 
 	/* free pgd */
@@ -1479,11 +1517,11 @@ static void __iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
 
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
 				  struct dmar_domain *domain,
-				  unsigned long pfn, unsigned int pages,
+				  unsigned long pfn, unsigned long pages,
 				  int ih, int map)
 {
-	unsigned int aligned_pages = __roundup_pow_of_two(pages);
-	unsigned int mask = ilog2(aligned_pages);
+	unsigned long aligned_pages = __roundup_pow_of_two(pages);
+	unsigned long mask = ilog2(aligned_pages);
 	uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
 	u16 did = domain_id_iommu(domain, iommu);
 
@@ -1837,7 +1875,8 @@ static void domain_exit(struct dmar_domain *domain)
 	if (domain->pgd) {
 		LIST_HEAD(freelist);
 
-		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+			     NULL);
 		iommu_put_pages_list(&freelist);
 	}
 
@@ -3419,7 +3458,8 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
 			struct intel_iommu *iommu;
 			LIST_HEAD(freelist);
 
-			domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);
+			domain_unmap(si_domain, start_vpfn, last_vpfn,
+				     &freelist, NULL);
 
 			rcu_read_lock();
 			for_each_active_iommu(iommu, drhd)
@@ -4080,6 +4120,7 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 				struct iommu_iotlb_gather *gather)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	bool queued = iommu_iotlb_gather_queued(gather);
 	unsigned long start_pfn, last_pfn;
 	int level = 0;
 
@@ -4099,7 +4140,16 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	start_pfn = iova >> VTD_PAGE_SHIFT;
 	last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
 
-	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist);
+	/*
+	 * pass level only if !queued, which means we will do iotlb
+	 * flush callback before freeing pages from freelist.
+	 *
+	 * When level is passed domain_unamp will attempt to add empty
+	 * page tables to freelist, and pass the level number of the highest
+	 * page table that was added to the freelist.
+	 */
+	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist,
+		     queued ? NULL : &level);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;
@@ -4108,8 +4158,21 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	 * We do not use page-selective IOTLB invalidation in flush queue,
 	 * so there is no need to track page and sync iotlb.
 	 */
-	if (!iommu_iotlb_gather_queued(gather))
-		iommu_iotlb_gather_add_page(domain, gather, iova, size);
+	if (!queued) {
+		size_t sz = size;
+
+		/*
+		 * Increase iova and sz for flushing if level was returned,
+		 * as it means we also are freeing some page tables.
+		 */
+		if (level) {
+			unsigned long pgsize = level_size(level) << VTD_PAGE_SHIFT;
+
+			iova = ALIGN_DOWN(iova, pgsize);
+			sz = ALIGN(size, pgsize);
+		}
+		iommu_iotlb_gather_add_page(domain, gather, iova, sz);
+	}
 
 	return size;
 }
-- 
2.44.0.769.g3c40516874-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
  2024-04-26  3:43 [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
                   ` (2 preceding siblings ...)
  2024-04-26  3:43 ` [RFC v2 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
@ 2024-04-26  6:42 ` David Hildenbrand
  2024-04-26 13:49   ` Pasha Tatashin
  3 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-04-26  6:42 UTC (permalink / raw)
  To: Pasha Tatashin, akpm, linux-mm, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu, Matthew Wilcox

On 26.04.24 05:43, Pasha Tatashin wrote:
> Changelog
> ================================================================
> v2: Use mapcount instead of refcount
>      Synchronized with IOMMU Observability changes.
> ================================================================
> 
> This series frees empty page tables on unmaps. It intends to be a
> low overhead feature.
> 
> The read-writer lock is used to synchronize page table, but most of
> time the lock is held is reader. It is held as a writer for short
> period of time when unmapping a page that is bigger than the current
> iova request. For all other cases this lock is read-only.
> 
> page->mapcount is used in order to track number of entries at each page
> table.

I'm wondering if this will conflict with page_type at some point? We're 
already converting other page table users to ptdesc. CCing Willy.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
  2024-04-26  6:42 ` [RFC v2 0/3] iommu/intel: Free " David Hildenbrand
@ 2024-04-26 13:49   ` Pasha Tatashin
  2024-04-26 14:39     ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Pasha Tatashin @ 2024-04-26 13:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu, Matthew Wilcox

On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.04.24 05:43, Pasha Tatashin wrote:
> > Changelog
> > ================================================================
> > v2: Use mapcount instead of refcount
> >      Synchronized with IOMMU Observability changes.
> > ================================================================
> >
> > This series frees empty page tables on unmaps. It intends to be a
> > low overhead feature.
> >
> > The read-writer lock is used to synchronize page table, but most of
> > time the lock is held is reader. It is held as a writer for short
> > period of time when unmapping a page that is bigger than the current
> > iova request. For all other cases this lock is read-only.
> >
> > page->mapcount is used in order to track number of entries at each page
> > table.
>
> I'm wondering if this will conflict with page_type at some point? We're
> already converting other page table users to ptdesc. CCing Willy.

Hi David,

This contradicts with the following comment in mm_types.h:
 * If your page will not be mapped to userspace, you can also use the four
 * bytes in the mapcount union, but you must call
page_mapcount_reset()
 * before freeing it.

Thank you,
Pasha


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
  2024-04-26 13:49   ` Pasha Tatashin
@ 2024-04-26 14:39     ` David Hildenbrand
  2024-04-26 19:39       ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-04-26 14:39 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu, Matthew Wilcox

On 26.04.24 15:49, Pasha Tatashin wrote:
> On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.04.24 05:43, Pasha Tatashin wrote:
>>> Changelog
>>> ================================================================
>>> v2: Use mapcount instead of refcount
>>>       Synchronized with IOMMU Observability changes.
>>> ================================================================
>>>
>>> This series frees empty page tables on unmaps. It intends to be a
>>> low overhead feature.
>>>
>>> The read-writer lock is used to synchronize page table, but most of
>>> time the lock is held is reader. It is held as a writer for short
>>> period of time when unmapping a page that is bigger than the current
>>> iova request. For all other cases this lock is read-only.
>>>
>>> page->mapcount is used in order to track number of entries at each page
>>> table.
>>
>> I'm wondering if this will conflict with page_type at some point? We're
>> already converting other page table users to ptdesc. CCing Willy.
> 
> Hi David,

Hi!

> 
> This contradicts with the following comment in mm_types.h:
>   * If your page will not be mapped to userspace, you can also use the four
>   * bytes in the mapcount union, but you must call
> page_mapcount_reset()
>   * before freeing it.

I think the documentation is a bit outdated, because we now have page 
types that are: "For pages that are never mapped to userspace"

which includes

#define PG_table

(we should update that comment, because we're now also using it for 
hugetlb that can be mapped to user space, which is fine.)

Right now, using page->_mapcount would likely still be fine, as long as 
you cannot end up creating a value that would resemble a type (e.g., 
PG_offline could be bad).

But staring at users of _mapcount and page_mapcount_reset() ... you'd be 
pretty much the only user of that.

mm/zsmalloc.c calls page_mapcount_reset(), and I am not completely sure 
why ... I can see it touch page->index but not page->_mapcount.


Hopefully Willy can comment.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
  2024-04-26 14:39     ` David Hildenbrand
@ 2024-04-26 19:39       ` Matthew Wilcox
  2024-04-29 14:46         ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-04-26 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, akpm, linux-mm, linux-kernel, rientjes, dwmw2,
	baolu.lu, joro, will, robin.murphy, iommu

On Fri, Apr 26, 2024 at 04:39:05PM +0200, David Hildenbrand wrote:
> On 26.04.24 15:49, Pasha Tatashin wrote:
> > On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 26.04.24 05:43, Pasha Tatashin wrote:
> > > > Changelog
> > > > ================================================================
> > > > v2: Use mapcount instead of refcount
> > > >       Synchronized with IOMMU Observability changes.
> > > > ================================================================
> > > > 
> > > > This series frees empty page tables on unmaps. It intends to be a
> > > > low overhead feature.
> > > > 
> > > > The read-writer lock is used to synchronize page table, but most of
> > > > time the lock is held is reader. It is held as a writer for short
> > > > period of time when unmapping a page that is bigger than the current
> > > > iova request. For all other cases this lock is read-only.
> > > > 
> > > > page->mapcount is used in order to track number of entries at each page
> > > > table.
> > > 
> > > I'm wondering if this will conflict with page_type at some point? We're
> > > already converting other page table users to ptdesc. CCing Willy.
> > 
> > Hi David,
> 
> Hi!
> 
> > 
> > This contradicts with the following comment in mm_types.h:
> >   * If your page will not be mapped to userspace, you can also use the four
> >   * bytes in the mapcount union, but you must call
> > page_mapcount_reset()
> >   * before freeing it.
> 
> I think the documentation is a bit outdated, because we now have page types
> that are: "For pages that are never mapped to userspace"
> 
> which includes
> 
> #define PG_table
> 
> (we should update that comment, because we're now also using it for hugetlb
> that can be mapped to user space, which is fine.)
> 
> Right now, using page->_mapcount would likely still be fine, as long as you
> cannot end up creating a value that would resemble a type (e.g., PG_offline
> could be bad).
> 
> But staring at users of _mapcount and page_mapcount_reset() ... you'd be
> pretty much the only user of that.
> 
> mm/zsmalloc.c calls page_mapcount_reset(), and I am not completely sure why
> ... I can see it touch page->index but not page->_mapcount.
> 
> 
> Hopefully Willy can comment.

I feel like I have to say "no" to Pasha far too often ;-(

Agreed the documentation is out of date.

I think there's a lot of space in the struct page that can be used.
These are iommu page tables, not cpu page tables, so things are a bit
different for them.  But should they be converted to use ptdesc?  Maybe!

I'd suggest putting this into the union with pt_mm and pt_frag_refcount.
I think it could even go in the union with pt_list, but I think I'd
rather see it in the pt_mm union.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
  2024-04-26 19:39       ` Matthew Wilcox
@ 2024-04-29 14:46         ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 14:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Pasha Tatashin, akpm, linux-mm, linux-kernel,
	rientjes, dwmw2, baolu.lu, joro, will, robin.murphy, iommu

On Fri, Apr 26, 2024 at 08:39:14PM +0100, Matthew Wilcox wrote:

> I think there's a lot of space in the struct page that can be used.
> These are iommu page tables, not cpu page tables, so things are a bit
> different for them.  But should they be converted to use ptdesc?  Maybe!

Definately! Someday we will need more stuff in here..

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations
  2024-04-26  3:43 ` [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
@ 2024-04-29 14:56   ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 14:56 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, linux-mm, linux-kernel, rientjes, dwmw2, baolu.lu, joro,
	will, robin.murphy, iommu

On Fri, Apr 26, 2024 at 03:43:22AM +0000, Pasha Tatashin wrote:
> Since, we are going to update  parent page table entries when lower
> level page tables become emtpy and we add them to the free list.
> We need a way to synchronize the operation.
> 
> Use domain->pgd_lock to protect all map and unmap operations.
> This is reader/writer lock. At the beginning everything is going to be
> read only mode, however, later, when free page table on unmap is added
> we will add a writer section as well.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/iommu/intel/iommu.c | 21 +++++++++++++++++++--
>  drivers/iommu/intel/iommu.h |  3 +++
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 1bfb6eccad05..8c7e596728b5 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -995,11 +995,13 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
>  				   unsigned long last_pfn,
>  				   int retain_level)
>  {
> +	read_lock(&domain->pgd_lock);

I think no to this.

This is a very performance sensitive path for the DMA API, we really
do want to see a lockless RCU scheme to manage this overhead here.

This would be fine for a VFIO user, which I guess is your use case.

IMHO it is not a good idea to fiddle around the edges like this. We
need to get the iommu code to having shared algorithms for the radix
tree so we can actually implement something good here and share
it. Every driver has the same problem and needs the same complicated
fix.

I keep threatening to work on that but have yet to start..

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-04-29 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26  3:43 [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps Pasha Tatashin
2024-04-26  3:43 ` [RFC v2 1/3] iommu/intel: Use page->_mapcount to count number of entries in IOMMU Pasha Tatashin
2024-04-26  3:43 ` [RFC v2 2/3] iommu/intel: synchronize page table map and unmap operations Pasha Tatashin
2024-04-29 14:56   ` Jason Gunthorpe
2024-04-26  3:43 ` [RFC v2 3/3] iommu/intel: free empty page tables on unmaps Pasha Tatashin
2024-04-26  6:42 ` [RFC v2 0/3] iommu/intel: Free " David Hildenbrand
2024-04-26 13:49   ` Pasha Tatashin
2024-04-26 14:39     ` David Hildenbrand
2024-04-26 19:39       ` Matthew Wilcox
2024-04-29 14:46         ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).