All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings()
@ 2020-05-15 14:00 Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 1/7] mm: Add functions to track page directory modifications Joerg Roedel
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

Hi,

here is the updated version of this series with these
changes:

	- Removed sync_current_stack_to_mm() too.

	- Added Acked-by's from Andy Lutomirski

The previous versions can be found here:

	v1: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/

	v2: https://lore.kernel.org/lkml/20200513152137.32426-1-joro@8bytes.org/

The cover-letter of v1 has more details on the motivation
for this patch-set.

Please review.

Regards,

	Joerg

Joerg Roedel (7):
  mm: Add functions to track page directory modifications
  mm/vmalloc: Track which page-table levels were modified
  mm/ioremap: Track which page-table levels were modified
  x86/mm/64: Implement arch_sync_kernel_mappings()
  x86/mm/32: Implement arch_sync_kernel_mappings()
  mm: Remove vmalloc_sync_(un)mappings()
  x86/mm: Remove vmalloc faulting

 arch/x86/include/asm/pgtable-2level_types.h |   2 +
 arch/x86/include/asm/pgtable-3level_types.h |   2 +
 arch/x86/include/asm/pgtable_64_types.h     |   2 +
 arch/x86/include/asm/switch_to.h            |  23 ---
 arch/x86/kernel/setup_percpu.c              |   6 +-
 arch/x86/mm/fault.c                         | 176 +-------------------
 arch/x86/mm/init_64.c                       |   5 +
 arch/x86/mm/pti.c                           |   8 +-
 arch/x86/mm/tlb.c                           |  37 ----
 drivers/acpi/apei/ghes.c                    |   6 -
 include/asm-generic/5level-fixup.h          |   5 +-
 include/asm-generic/pgtable.h               |  23 +++
 include/linux/mm.h                          |  46 +++++
 include/linux/vmalloc.h                     |  18 +-
 kernel/notifier.c                           |   1 -
 kernel/trace/trace.c                        |  12 --
 lib/ioremap.c                               |  46 +++--
 mm/nommu.c                                  |  12 --
 mm/vmalloc.c                                | 109 +++++++-----
 19 files changed, 204 insertions(+), 335 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/7] mm: Add functions to track page directory modifications
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-06-05 10:08   ` Catalin Marinas
  2020-05-15 14:00 ` [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified Joerg Roedel
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Add page-table allocation functions which will keep track of changed
directory entries. They are needed for new PGD, P4D, PUD, and PMD
entries and will be used in vmalloc and ioremap code to decide whether
any changes in the kernel mappings need to be synchronized between
page-tables in the system.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 include/asm-generic/5level-fixup.h |  5 ++--
 include/asm-generic/pgtable.h      | 23 +++++++++++++++
 include/linux/mm.h                 | 46 ++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index 4c74b1c1d13b..58046ddc08d0 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -17,8 +17,9 @@
 	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
 		NULL : pud_offset(p4d, address))
 
-#define p4d_alloc(mm, pgd, address)	(pgd)
-#define p4d_offset(pgd, start)		(pgd)
+#define p4d_alloc(mm, pgd, address)		(pgd)
+#define p4d_alloc_track(mm, pgd, address, mask)	(pgd)
+#define p4d_offset(pgd, start)			(pgd)
 
 #ifndef __ASSEMBLY__
 static inline int p4d_none(p4d_t p4d)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 329b8c8ca703..bf1418ae91a2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1209,6 +1209,29 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+/*
+ * Page Table Modification bits for pgtbl_mod_mask.
+ *
+ * These are used by the p?d_alloc_track*() set of functions an in the generic
+ * vmalloc/ioremap code to track at which page-table levels entries have been
+ * modified. Based on that the code can better decide when vmalloc and ioremap
+ * mapping changes need to be synchronized to other page-tables in the system.
+ */
+#define		__PGTBL_PGD_MODIFIED	0
+#define		__PGTBL_P4D_MODIFIED	1
+#define		__PGTBL_PUD_MODIFIED	2
+#define		__PGTBL_PMD_MODIFIED	3
+#define		__PGTBL_PTE_MODIFIED	4
+
+#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
+#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
+#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
+#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
+#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
+
+/* Page-Table Modification Mask */
+typedef unsigned int pgtbl_mod_mask;
+
 #endif /* !__ASSEMBLY__ */
 
 #ifndef io_remap_pfn_range
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..022fe682af9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2078,13 +2078,54 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
 	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
 		NULL : pud_offset(p4d, address);
 }
+
+static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
+				     unsigned long address,
+				     pgtbl_mod_mask *mod_mask)
+
+{
+	if (unlikely(pgd_none(*pgd))) {
+		if (__p4d_alloc(mm, pgd, address))
+			return NULL;
+		*mod_mask |= PGTBL_PGD_MODIFIED;
+	}
+
+	return p4d_offset(pgd, address);
+}
+
 #endif /* !__ARCH_HAS_5LEVEL_HACK */
 
+static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
+				     unsigned long address,
+				     pgtbl_mod_mask *mod_mask)
+{
+	if (unlikely(p4d_none(*p4d))) {
+		if (__pud_alloc(mm, p4d, address))
+			return NULL;
+		*mod_mask |= PGTBL_P4D_MODIFIED;
+	}
+
+	return pud_offset(p4d, address);
+}
+
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 {
 	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
 		NULL: pmd_offset(pud, address);
 }
+
+static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
+				     unsigned long address,
+				     pgtbl_mod_mask *mod_mask)
+{
+	if (unlikely(pud_none(*pud))) {
+		if (__pmd_alloc(mm, pud, address))
+			return NULL;
+		*mod_mask |= PGTBL_PUD_MODIFIED;
+	}
+
+	return pmd_offset(pud, address);
+}
 #endif /* CONFIG_MMU */
 
 #if USE_SPLIT_PTE_PTLOCKS
@@ -2200,6 +2241,11 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
 		NULL: pte_offset_kernel(pmd, address))
 
+#define pte_alloc_kernel_track(pmd, address, mask)			\
+	((unlikely(pmd_none(*(pmd))) &&					\
+	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+		NULL: pte_offset_kernel(pmd, address))
+
 #if USE_SPLIT_PMD_PTLOCKS
 
 static struct page *pmd_to_page(pmd_t *pmd)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 1/7] mm: Add functions to track page directory modifications Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 20:01   ` Andrew Morton
  2020-05-15 14:00 ` [PATCH v3 3/7] mm/ioremap: " Joerg Roedel
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Track at which levels in the page-table entries were modified by
vmap/vunmap. After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/vmalloc.h | 16 ++++++++
 mm/vmalloc.c            | 88 ++++++++++++++++++++++++++++++-----------
 2 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index a95d3cc74d79..c80bdb8a6b55 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -144,6 +144,22 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 void vmalloc_sync_mappings(void);
 void vmalloc_sync_unmappings(void);
 
+/*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+/*
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
+ * is 0.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
 /*
  *	Lowlevel-APIs (not for driver use!)
  */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9a8227afa073..184f5a556cf7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -69,7 +69,8 @@ static void free_work(struct work_struct *w)
 
 /*** Page table manipulation functions ***/
 
-static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 
@@ -78,73 +79,104 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
 		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
 }
 
-static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	int cleared;
 
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_clear_huge(pmd))
+
+		cleared = pmd_clear_huge(pmd);
+		if (cleared || pmd_bad(*pmd))
+			*mask |= PGTBL_PMD_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		vunmap_pte_range(pmd, addr, next);
+		vunmap_pte_range(pmd, addr, next, mask);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
+	int cleared;
 
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_clear_huge(pud))
+
+		cleared = pud_clear_huge(pud);
+		if (cleared || pud_bad(*pud))
+			*mask |= PGTBL_PUD_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		vunmap_pmd_range(pud, addr, next);
+		vunmap_pmd_range(pud, addr, next, mask);
 	} while (pud++, addr = next, addr != end);
 }
 
-static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
+	int cleared;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
-		if (p4d_clear_huge(p4d))
+
+		cleared = p4d_clear_huge(p4d);
+		if (cleared || p4d_bad(*p4d))
+			*mask |= PGTBL_P4D_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		vunmap_pud_range(p4d, addr, next);
+		vunmap_pud_range(p4d, addr, next, mask);
 	} while (p4d++, addr = next, addr != end);
 }
 
-static void vunmap_page_range(unsigned long addr, unsigned long end)
+static void vunmap_page_range(unsigned long start, unsigned long end)
 {
 	pgd_t *pgd;
+	unsigned long addr = start;
 	unsigned long next;
+	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
+	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
+		if (pgd_bad(*pgd))
+			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		vunmap_p4d_range(pgd, addr, next);
+		vunmap_p4d_range(pgd, addr, next, &mask);
 	} while (pgd++, addr = next, addr != end);
+
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
 }
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 
@@ -153,7 +185,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 	 * callers keep track of where we're up to.
 	 */
 
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel_track(pmd, addr, mask);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -166,55 +198,59 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
 	if (!pmd)
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc_track(&init_mm, p4d, addr, mask);
 	if (!pud)
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
 	if (!p4d)
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -234,16 +270,20 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 	int err = 0;
 	int nr = 0;
+	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
 
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+
 	return nr;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/7] mm/ioremap: Track which page-table levels were modified
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 1/7] mm: Add functions to track page directory modifications Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 4/7] x86/mm/64: Implement arch_sync_kernel_mappings() Joerg Roedel
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Track at which levels in the page-table entries were modified by
ioremap_page_range(). After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called. The iounmap path re-uses vunmap(), which has
already been taken care of.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 lib/ioremap.c | 46 +++++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/lib/ioremap.c b/lib/ioremap.c
index 3f0e18543de8..ad485f08173b 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -61,13 +61,14 @@ static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+		pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 	u64 pfn;
 
 	pfn = phys_addr >> PAGE_SHIFT;
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel_track(pmd, addr, mask);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -75,6 +76,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
 
@@ -101,21 +103,24 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
 }
 
 static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+		pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
 	if (!pmd)
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
 
-		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot))
+		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+			*mask |= PGTBL_PMD_MODIFIED;
 			continue;
+		}
 
-		if (ioremap_pte_range(pmd, addr, next, phys_addr, prot))
+		if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
@@ -144,21 +149,24 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
 }
 
 static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+		pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc_track(&init_mm, p4d, addr, mask);
 	if (!pud)
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
 
-		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot))
+		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+			*mask |= PGTBL_PUD_MODIFIED;
 			continue;
+		}
 
-		if (ioremap_pmd_range(pud, addr, next, phys_addr, prot))
+		if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
@@ -187,21 +195,24 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
 }
 
 static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+		pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
 	if (!p4d)
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
 
-		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot))
+		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+			*mask |= PGTBL_P4D_MODIFIED;
 			continue;
+		}
 
-		if (ioremap_pud_range(p4d, addr, next, phys_addr, prot))
+		if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
@@ -214,6 +225,7 @@ int ioremap_page_range(unsigned long addr,
 	unsigned long start;
 	unsigned long next;
 	int err;
+	pgtbl_mod_mask mask = 0;
 
 	might_sleep();
 	BUG_ON(addr >= end);
@@ -222,13 +234,17 @@ int ioremap_page_range(unsigned long addr,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot);
+		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
+					&mask);
 		if (err)
 			break;
 	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
 	flush_cache_vmap(start, end);
 
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+
 	return err;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/7] x86/mm/64: Implement arch_sync_kernel_mappings()
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (2 preceding siblings ...)
  2020-05-15 14:00 ` [PATCH v3 3/7] mm/ioremap: " Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 5/7] x86/mm/32: " Joerg Roedel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 2 ++
 arch/x86/mm/init_64.c                   | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..8f63efb2a2cc 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -159,4 +159,6 @@ extern unsigned int ptrs_per_p4d;
 
 #define PGD_KERNEL_START	((PAGE_SIZE / 2) / sizeof(pgd_t))
 
+#define ARCH_PAGE_TABLE_SYNC_MASK	(pgtable_l5_enabled() ?	PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED)
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3b289c2f75cd..541af8e5bcd4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -217,6 +217,11 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		sync_global_pgds_l4(start, end);
 }
 
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+	sync_global_pgds(start, end);
+}
+
 /*
  * NOTE: This function is marked __ref because it calls __init function
  * (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/7] x86/mm/32: Implement arch_sync_kernel_mappings()
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (3 preceding siblings ...)
  2020-05-15 14:00 ` [PATCH v3 4/7] x86/mm/64: Implement arch_sync_kernel_mappings() Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 6/7] mm: Remove vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/pgtable-2level_types.h |  2 ++
 arch/x86/include/asm/pgtable-3level_types.h |  2 ++
 arch/x86/mm/fault.c                         | 25 +++++++++++++--------
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 6deb6cd236e3..7f6ccff0ba72 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -20,6 +20,8 @@ typedef union {
 
 #define SHARED_KERNEL_PMD	0
 
+#define ARCH_PAGE_TABLE_SYNC_MASK	PGTBL_PMD_MODIFIED
+
 /*
  * traditional i386 two-level paging structure:
  */
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 33845d36897c..80fbb4a9ed87 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -27,6 +27,8 @@ typedef union {
 #define SHARED_KERNEL_PMD	(!static_cpu_has(X86_FEATURE_PTI))
 #endif
 
+#define ARCH_PAGE_TABLE_SYNC_MASK	(SHARED_KERNEL_PMD ? 0 : PGTBL_PMD_MODIFIED)
+
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..edeb2adaf31f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -190,16 +190,13 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 	return pmd_k;
 }
 
-static void vmalloc_sync(void)
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
 {
-	unsigned long address;
-
-	if (SHARED_KERNEL_PMD)
-		return;
+	unsigned long addr;
 
-	for (address = VMALLOC_START & PMD_MASK;
-	     address >= TASK_SIZE_MAX && address < VMALLOC_END;
-	     address += PMD_SIZE) {
+	for (addr = start & PMD_MASK;
+	     addr >= TASK_SIZE_MAX && addr < VMALLOC_END;
+	     addr += PMD_SIZE) {
 		struct page *page;
 
 		spin_lock(&pgd_lock);
@@ -210,13 +207,23 @@ static void vmalloc_sync(void)
 			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
 
 			spin_lock(pgt_lock);
-			vmalloc_sync_one(page_address(page), address);
+			vmalloc_sync_one(page_address(page), addr);
 			spin_unlock(pgt_lock);
 		}
 		spin_unlock(&pgd_lock);
 	}
 }
 
+static void vmalloc_sync(void)
+{
+	unsigned long address;
+
+	if (SHARED_KERNEL_PMD)
+		return;
+
+	arch_sync_kernel_mappings(VMALLOC_START, VMALLOC_END);
+}
+
 void vmalloc_sync_mappings(void)
 {
 	vmalloc_sync();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 6/7] mm: Remove vmalloc_sync_(un)mappings()
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (4 preceding siblings ...)
  2020-05-15 14:00 ` [PATCH v3 5/7] x86/mm/32: " Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 14:00 ` [PATCH v3 7/7] x86/mm: Remove vmalloc faulting Joerg Roedel
  2020-05-15 14:16 ` [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Peter Zijlstra
  7 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

These functions are not needed anymore because the vmalloc and ioremap
mappings are now synchronized when they are created or teared down.

Remove all callers and function definitions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c      | 37 -------------------------------------
 drivers/acpi/apei/ghes.c |  6 ------
 include/linux/vmalloc.h  |  2 --
 kernel/notifier.c        |  1 -
 kernel/trace/trace.c     | 12 ------------
 mm/nommu.c               | 12 ------------
 mm/vmalloc.c             | 21 ---------------------
 7 files changed, 91 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index edeb2adaf31f..255fc631b042 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -214,26 +214,6 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
 	}
 }
 
-static void vmalloc_sync(void)
-{
-	unsigned long address;
-
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	arch_sync_kernel_mappings(VMALLOC_START, VMALLOC_END);
-}
-
-void vmalloc_sync_mappings(void)
-{
-	vmalloc_sync();
-}
-
-void vmalloc_sync_unmappings(void)
-{
-	vmalloc_sync();
-}
-
 /*
  * 32-bit:
  *
@@ -336,23 +316,6 @@ static void dump_pagetable(unsigned long address)
 
 #else /* CONFIG_X86_64: */
 
-void vmalloc_sync_mappings(void)
-{
-	/*
-	 * 64-bit mappings might allocate new p4d/pud pages
-	 * that need to be propagated to all tasks' PGDs.
-	 */
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
-}
-
-void vmalloc_sync_unmappings(void)
-{
-	/*
-	 * Unmappings never allocate or free p4d/pud pages.
-	 * No work is required here.
-	 */
-}
-
 /*
  * 64-bit:
  *
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 24c9642e8fc7..aabe9c5ee515 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -167,12 +167,6 @@ int ghes_estatus_pool_init(int num_ghes)
 	if (!addr)
 		goto err_pool_alloc;
 
-	/*
-	 * New allocation must be visible in all pgd before it can be found by
-	 * an NMI allocating from the pool.
-	 */
-	vmalloc_sync_mappings();
-
 	rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
 	if (rc)
 		goto err_pool_add;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c80bdb8a6b55..82c9a2ddcfa9 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -141,8 +141,6 @@ extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
-void vmalloc_sync_mappings(void);
-void vmalloc_sync_unmappings(void);
 
 /*
  * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 5989bbb93039..84c987dfbe03 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -519,7 +519,6 @@ NOKPROBE_SYMBOL(notify_die);
 
 int register_die_notifier(struct notifier_block *nb)
 {
-	vmalloc_sync_mappings();
 	return atomic_notifier_chain_register(&die_chain, nb);
 }
 EXPORT_SYMBOL_GPL(register_die_notifier);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 29615f15a820..f12e99b387b2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8527,18 +8527,6 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
 	allocate_snapshot = false;
 #endif
 
-	/*
-	 * Because of some magic with the way alloc_percpu() works on
-	 * x86_64, we need to synchronize the pgd of all the tables,
-	 * otherwise the trace events that happen in x86_64 page fault
-	 * handlers can't cope with accessing the chance that a
-	 * alloc_percpu()'d memory might be touched in the page fault trace
-	 * event. Oh, and we need to audit all other alloc_percpu() and vmalloc()
-	 * calls in tracing, because something might get triggered within a
-	 * page fault trace event!
-	 */
-	vmalloc_sync_mappings();
-
 	return 0;
 }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 318df4e236c9..b4267e1471f3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -369,18 +369,6 @@ void vm_unmap_aliases(void)
 }
 EXPORT_SYMBOL_GPL(vm_unmap_aliases);
 
-/*
- * Implement a stub for vmalloc_sync_[un]mapping() if the architecture
- * chose not to have one.
- */
-void __weak vmalloc_sync_mappings(void)
-{
-}
-
-void __weak vmalloc_sync_unmappings(void)
-{
-}
-
 struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes)
 {
 	BUG();
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 184f5a556cf7..901540e4773b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1332,12 +1332,6 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	if (unlikely(valist == NULL))
 		return false;
 
-	/*
-	 * First make sure the mappings are removed from all page-tables
-	 * before they are freed.
-	 */
-	vmalloc_sync_unmappings();
-
 	/*
 	 * TODO: to calculate a flush range without looping.
 	 * The list can be up to lazy_max_pages() elements.
@@ -3177,21 +3171,6 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
-/*
- * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose
- * not to have one.
- *
- * The purpose of this function is to make sure the vmalloc area
- * mappings are identical in all page-tables in the system.
- */
-void __weak vmalloc_sync_mappings(void)
-{
-}
-
-void __weak vmalloc_sync_unmappings(void)
-{
-}
-
 static int f(pte_t *pte, unsigned long addr, void *data)
 {
 	pte_t ***p = data;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 7/7] x86/mm: Remove vmalloc faulting
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (5 preceding siblings ...)
  2020-05-15 14:00 ` [PATCH v3 6/7] mm: Remove vmalloc_sync_(un)mappings() Joerg Roedel
@ 2020-05-15 14:00 ` Joerg Roedel
  2020-05-15 14:16 ` [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Peter Zijlstra
  7 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-15 14:00 UTC (permalink / raw)
  To: x86
  Cc: hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, joro, linux-kernel,
	linux-acpi, linux-arch, linux-mm

From: Joerg Roedel <jroedel@suse.de>

Remove fault handling on vmalloc areas, as the vmalloc code now takes
care of synchronizing changes to all page-tables in the system.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/switch_to.h |  23 ------
 arch/x86/kernel/setup_percpu.c   |   6 +-
 arch/x86/mm/fault.c              | 134 -------------------------------
 arch/x86/mm/pti.c                |   8 +-
 arch/x86/mm/tlb.c                |  37 ---------
 5 files changed, 4 insertions(+), 204 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 0e059b73437b..9f69cc497f4b 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -12,27 +12,6 @@ struct task_struct *__switch_to_asm(struct task_struct *prev,
 __visible struct task_struct *__switch_to(struct task_struct *prev,
 					  struct task_struct *next);
 
-/* This runs runs on the previous thread's stack. */
-static inline void prepare_switch_to(struct task_struct *next)
-{
-#ifdef CONFIG_VMAP_STACK
-	/*
-	 * If we switch to a stack that has a top-level paging entry
-	 * that is not present in the current mm, the resulting #PF will
-	 * will be promoted to a double-fault and we'll panic.  Probe
-	 * the new stack now so that vmalloc_fault can fix up the page
-	 * tables if needed.  This can only happen if we use a stack
-	 * in vmap space.
-	 *
-	 * We assume that the stack is aligned so that it never spans
-	 * more than one top-level paging entry.
-	 *
-	 * To minimize cache pollution, just follow the stack pointer.
-	 */
-	READ_ONCE(*(unsigned char *)next->thread.sp);
-#endif
-}
-
 asmlinkage void ret_from_fork(void);
 
 /*
@@ -67,8 +46,6 @@ struct fork_frame {
 
 #define switch_to(prev, next, last)					\
 do {									\
-	prepare_switch_to(next);					\
-									\
 	((last) = __switch_to_asm((prev), (next)));			\
 } while (0)
 
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e6d7894ad127..fd945ce78554 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -287,9 +287,9 @@ void __init setup_per_cpu_areas(void)
 	/*
 	 * Sync back kernel address range again.  We already did this in
 	 * setup_arch(), but percpu data also needs to be available in
-	 * the smpboot asm.  We can't reliably pick up percpu mappings
-	 * using vmalloc_fault(), because exception dispatch needs
-	 * percpu data.
+	 * the smpboot asm and arch_sync_kernel_mappings() doesn't sync to
+	 * swapper_pg_dir on 32-bit. The per-cpu mappings need to be available
+	 * there too.
 	 *
 	 * FIXME: Can the later sync in setup_cpu_entry_areas() replace
 	 * this call?
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 255fc631b042..dffe8e4d3140 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -214,44 +214,6 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
 	}
 }
 
-/*
- * 32-bit:
- *
- *   Handle a fault on the vmalloc or module mapping area
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
-	unsigned long pgd_paddr;
-	pmd_t *pmd_k;
-	pte_t *pte_k;
-
-	/* Make sure we are in vmalloc area: */
-	if (!(address >= VMALLOC_START && address < VMALLOC_END))
-		return -1;
-
-	/*
-	 * Synchronize this task's top level page-table
-	 * with the 'reference' page table.
-	 *
-	 * Do _not_ use "current" here. We might be inside
-	 * an interrupt in the middle of a task switch..
-	 */
-	pgd_paddr = read_cr3_pa();
-	pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
-	if (!pmd_k)
-		return -1;
-
-	if (pmd_large(*pmd_k))
-		return 0;
-
-	pte_k = pte_offset_kernel(pmd_k, address);
-	if (!pte_present(*pte_k))
-		return -1;
-
-	return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
 /*
  * Did it hit the DOS screen memory VA from vm86 mode?
  */
@@ -316,79 +278,6 @@ static void dump_pagetable(unsigned long address)
 
 #else /* CONFIG_X86_64: */
 
-/*
- * 64-bit:
- *
- *   Handle a fault on the vmalloc area
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
-	pgd_t *pgd, *pgd_k;
-	p4d_t *p4d, *p4d_k;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-
-	/* Make sure we are in vmalloc area: */
-	if (!(address >= VMALLOC_START && address < VMALLOC_END))
-		return -1;
-
-	/*
-	 * Copy kernel mappings over when needed. This can also
-	 * happen within a race in page table update. In the later
-	 * case just flush:
-	 */
-	pgd = (pgd_t *)__va(read_cr3_pa()) + pgd_index(address);
-	pgd_k = pgd_offset_k(address);
-	if (pgd_none(*pgd_k))
-		return -1;
-
-	if (pgtable_l5_enabled()) {
-		if (pgd_none(*pgd)) {
-			set_pgd(pgd, *pgd_k);
-			arch_flush_lazy_mmu_mode();
-		} else {
-			BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_k));
-		}
-	}
-
-	/* With 4-level paging, copying happens on the p4d level. */
-	p4d = p4d_offset(pgd, address);
-	p4d_k = p4d_offset(pgd_k, address);
-	if (p4d_none(*p4d_k))
-		return -1;
-
-	if (p4d_none(*p4d) && !pgtable_l5_enabled()) {
-		set_p4d(p4d, *p4d_k);
-		arch_flush_lazy_mmu_mode();
-	} else {
-		BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_k));
-	}
-
-	BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4);
-
-	pud = pud_offset(p4d, address);
-	if (pud_none(*pud))
-		return -1;
-
-	if (pud_large(*pud))
-		return 0;
-
-	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
-		return -1;
-
-	if (pmd_large(*pmd))
-		return 0;
-
-	pte = pte_offset_kernel(pmd, address);
-	if (!pte_present(*pte))
-		return -1;
-
-	return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
 #ifdef CONFIG_CPU_SUP_AMD
 static const char errata93_warning[] =
 KERN_ERR 
@@ -1227,29 +1116,6 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	 */
 	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
-	/*
-	 * We can fault-in kernel-space virtual memory on-demand. The
-	 * 'reference' page table is init_mm.pgd.
-	 *
-	 * NOTE! We MUST NOT take any locks for this case. We may
-	 * be in an interrupt or a critical region, and should
-	 * only copy the information from the master page table,
-	 * nothing more.
-	 *
-	 * Before doing this on-demand faulting, ensure that the
-	 * fault is not any of the following:
-	 * 1. A fault on a PTE with a reserved bit set.
-	 * 2. A fault caused by a user-mode access.  (Do not demand-
-	 *    fault kernel memory due to user-mode accesses).
-	 * 3. A fault caused by a page-level protection violation.
-	 *    (A demand fault would be on a non-present page which
-	 *     would have X86_PF_PROT==0).
-	 */
-	if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
-		if (vmalloc_fault(address) >= 0)
-			return;
-	}
-
 	/* Was the fault spurious, caused by lazy TLB invalidation? */
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 843aa10a4cb6..da0fb17a1a36 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -448,13 +448,7 @@ static void __init pti_clone_user_shared(void)
 		 * the sp1 and sp2 slots.
 		 *
 		 * This is done for all possible CPUs during boot to ensure
-		 * that it's propagated to all mms.  If we were to add one of
-		 * these mappings during CPU hotplug, we would need to take
-		 * some measure to make sure that every mm that subsequently
-		 * ran on that CPU would have the relevant PGD entry in its
-		 * pagetables.  The usual vmalloc_fault() mechanism would not
-		 * work for page faults taken in entry_SYSCALL_64 before RSP
-		 * is set up.
+		 * that it's propagated to all mms.
 		 */
 
 		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 66f96f21a7b6..f3fe261e5936 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -161,34 +161,6 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
-static void sync_current_stack_to_mm(struct mm_struct *mm)
-{
-	unsigned long sp = current_stack_pointer;
-	pgd_t *pgd = pgd_offset(mm, sp);
-
-	if (pgtable_l5_enabled()) {
-		if (unlikely(pgd_none(*pgd))) {
-			pgd_t *pgd_ref = pgd_offset_k(sp);
-
-			set_pgd(pgd, *pgd_ref);
-		}
-	} else {
-		/*
-		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
-		 * the p4d.  This compiles to approximately the same code as
-		 * the 5-level case.
-		 */
-		p4d_t *p4d = p4d_offset(pgd, sp);
-
-		if (unlikely(p4d_none(*p4d))) {
-			pgd_t *pgd_ref = pgd_offset_k(sp);
-			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
-
-			set_p4d(p4d, *p4d_ref);
-		}
-	}
-}
-
 static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
@@ -377,15 +349,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 */
 		cond_ibpb(tsk);
 
-		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-			/*
-			 * If our current stack is in vmalloc space and isn't
-			 * mapped in the new pgd, we'll double-fault.  Forcibly
-			 * map it.
-			 */
-			sync_current_stack_to_mm(next);
-		}
-
 		/*
 		 * Stop remote flushes for the previous mm.
 		 * Skip kernel threads; we never send init_mm TLB flushing IPIs,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings()
  2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
                   ` (6 preceding siblings ...)
  2020-05-15 14:00 ` [PATCH v3 7/7] x86/mm: Remove vmalloc faulting Joerg Roedel
@ 2020-05-15 14:16 ` Peter Zijlstra
  7 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2020-05-15 14:16 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Dave Hansen, Andy Lutomirski, rjw, Arnd Bergmann,
	Andrew Morton, Steven Rostedt, Vlastimil Babka, Michal Hocko,
	Matthew Wilcox, Joerg Roedel, linux-kernel, linux-acpi,
	linux-arch, linux-mm

On Fri, May 15, 2020 at 04:00:16PM +0200, Joerg Roedel wrote:
> Joerg Roedel (7):
>   mm: Add functions to track page directory modifications
>   mm/vmalloc: Track which page-table levels were modified
>   mm/ioremap: Track which page-table levels were modified
>   x86/mm/64: Implement arch_sync_kernel_mappings()
>   x86/mm/32: Implement arch_sync_kernel_mappings()
>   mm: Remove vmalloc_sync_(un)mappings()
>   x86/mm: Remove vmalloc faulting
> 
>  arch/x86/include/asm/pgtable-2level_types.h |   2 +
>  arch/x86/include/asm/pgtable-3level_types.h |   2 +
>  arch/x86/include/asm/pgtable_64_types.h     |   2 +
>  arch/x86/include/asm/switch_to.h            |  23 ---
>  arch/x86/kernel/setup_percpu.c              |   6 +-
>  arch/x86/mm/fault.c                         | 176 +-------------------
>  arch/x86/mm/init_64.c                       |   5 +
>  arch/x86/mm/pti.c                           |   8 +-
>  arch/x86/mm/tlb.c                           |  37 ----
>  drivers/acpi/apei/ghes.c                    |   6 -
>  include/asm-generic/5level-fixup.h          |   5 +-
>  include/asm-generic/pgtable.h               |  23 +++
>  include/linux/mm.h                          |  46 +++++
>  include/linux/vmalloc.h                     |  18 +-
>  kernel/notifier.c                           |   1 -
>  kernel/trace/trace.c                        |  12 --
>  lib/ioremap.c                               |  46 +++--
>  mm/nommu.c                                  |  12 --
>  mm/vmalloc.c                                | 109 +++++++-----
>  19 files changed, 204 insertions(+), 335 deletions(-)

I'm thinking this improves the status-quo, so:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Like Andy, I think I'd like x86_64 to pre-populate, but that can easily
be done on top and should not hold this back.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-15 14:00 ` [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified Joerg Roedel
@ 2020-05-15 20:01   ` Andrew Morton
  2020-05-16 12:56     ` Joerg Roedel
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2020-05-15 20:01 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Steven Rostedt, Vlastimil Babka, Michal Hocko,
	Matthew Wilcox, Joerg Roedel, linux-kernel, linux-acpi,
	linux-arch, linux-mm, Christoph Hellwig

On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <joro@8bytes.org> wrote:

> Track at which levels in the page-table entries were modified by
> vmap/vunmap. After the page-table has been modified, use that
> information do decide whether the new arch_sync_kernel_mappings()
> needs to be called.

Lots of collisions here with Christoph's "decruft the vmalloc API" series
(http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de).

I attempted to fix things up.

unmap_kernel_range_noflush() needed to be redone.

map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?


From: Joerg Roedel <jroedel@suse.de>
Subject: mm/vmalloc: Track which page-table levels were modified

Track at which levels in the page-table entries were modified by
vmap/vunmap.  After the page-table has been modified, use that information
do decide whether the new arch_sync_kernel_mappings() needs to be called.

Link: http://lkml.kernel.org/r/20200515140023.25469-3-joro@8bytes.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |   16 ++++++
 mm/vmalloc.c            |   91 +++++++++++++++++++++++++++-----------
 2 files changed, 81 insertions(+), 26 deletions(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-track-which-page-table-levels-were-modified
+++ a/include/linux/vmalloc.h
@@ -134,6 +134,22 @@ void vmalloc_sync_mappings(void);
 void vmalloc_sync_unmappings(void);
 
 /*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+/*
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
+ * is 0.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
+/*
  *	Lowlevel-APIs (not for driver use!)
  */
 
--- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified
+++ a/mm/vmalloc.c
@@ -69,7 +69,8 @@ static void free_work(struct work_struct
 
 /*** Page table manipulation functions ***/
 
-static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 
@@ -78,59 +79,81 @@ static void vunmap_pte_range(pmd_t *pmd,
 		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
 }
 
-static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	int cleared;
 
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_clear_huge(pmd))
+
+		cleared = pmd_clear_huge(pmd);
+		if (cleared || pmd_bad(*pmd))
+			*mask |= PGTBL_PMD_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		vunmap_pte_range(pmd, addr, next);
+		vunmap_pte_range(pmd, addr, next, mask);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
+	int cleared;
 
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_clear_huge(pud))
+
+		cleared = pud_clear_huge(pud);
+		if (cleared || pud_bad(*pud))
+			*mask |= PGTBL_PUD_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		vunmap_pmd_range(pud, addr, next);
+		vunmap_pmd_range(pud, addr, next, mask);
 	} while (pud++, addr = next, addr != end);
 }
 
-static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			     pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
+	int cleared;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
-		if (p4d_clear_huge(p4d))
+
+		cleared = p4d_clear_huge(p4d);
+		if (cleared || p4d_bad(*p4d))
+			*mask |= PGTBL_P4D_MODIFIED;
+
+		if (cleared)
 			continue;
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		vunmap_pud_range(p4d, addr, next);
+		vunmap_pud_range(p4d, addr, next, mask);
 	} while (p4d++, addr = next, addr != end);
 }
 
 /**
  * unmap_kernel_range_noflush - unmap kernel VM area
- * @addr: start of the VM area to unmap
+ * @start: start of the VM area to unmap
  * @size: size of the VM area to unmap
  *
  * Unmap PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify
@@ -141,24 +164,33 @@ static void vunmap_p4d_range(pgd_t *pgd,
  * for calling flush_cache_vunmap() on to-be-mapped areas before calling this
  * function and flush_tlb_kernel_range() after.
  */
-void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
+void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 {
-	unsigned long end = addr + size;
+	unsigned long end = start + size;
 	unsigned long next;
 	pgd_t *pgd;
+	unsigned long addr = start;
+	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
+	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
+		if (pgd_bad(*pgd))
+			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		vunmap_p4d_range(pgd, addr, next);
+		vunmap_p4d_range(pgd, addr, next, &mask);
 	} while (pgd++, addr = next, addr != end);
+
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
 }
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 
@@ -167,7 +199,7 @@ static int vmap_pte_range(pmd_t *pmd, un
 	 * callers keep track of where we're up to.
 	 */
 
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel_track(pmd, addr, mask);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -180,55 +212,59 @@ static int vmap_pte_range(pmd_t *pmd, un
 		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
 	if (!pmd)
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc_track(&init_mm, p4d, addr, mask);
 	if (!pud)
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+		pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
 	if (!p4d)
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -260,12 +296,15 @@ int map_kernel_range_noflush(unsigned lo
 	pgd_t *pgd;
 	int err = 0;
 	int nr = 0;
+	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+		if (pgd_bad(*pgd))
+			mask |= PGTBL_PGD_MODIFIED;
+		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-15 20:01   ` Andrew Morton
@ 2020-05-16 12:56     ` Joerg Roedel
  2020-05-18 22:18       ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2020-05-16 12:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Christoph Hellwig

Hi Andrew,

On Fri, May 15, 2020 at 01:01:42PM -0700, Andrew Morton wrote:
> On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> Lots of collisions here with Christoph's "decruft the vmalloc API" series
> (http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de).
> 
> I attempted to fix things up.
> 
> unmap_kernel_range_noflush() needed to be redone.
> 
> map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?

Yes, map_kernel_range_noflush() needs the arch_sync_kernel_mappings()
call as well.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-16 12:56     ` Joerg Roedel
@ 2020-05-18 22:18       ` Andrew Morton
  2020-05-19 15:25         ` Joerg Roedel
  2020-05-25 10:05         ` Joerg Roedel
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2020-05-18 22:18 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Christoph Hellwig

On Sat, 16 May 2020 14:56:41 +0200 Joerg Roedel <jroedel@suse.de> wrote:

> Hi Andrew,
> 
> On Fri, May 15, 2020 at 01:01:42PM -0700, Andrew Morton wrote:
> > On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > Lots of collisions here with Christoph's "decruft the vmalloc API" series
> > (http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de).
> > 
> > I attempted to fix things up.
> > 
> > unmap_kernel_range_noflush() needed to be redone.
> > 
> > map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?
> 
> Yes, map_kernel_range_noflush() needs the arch_sync_kernel_mappings()
> call as well.
> 

This?

--- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified-fix
+++ a/mm/vmalloc.c
@@ -309,6 +309,9 @@ int map_kernel_range_noflush(unsigned lo
 			return err;
 	} while (pgd++, addr = next, addr != end);
 
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+
 	return 0;
 }
 

It would be nice to get all this (ie, linux-next) retested before we
send it upstream, please.  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-18 22:18       ` Andrew Morton
@ 2020-05-19 15:25         ` Joerg Roedel
  2020-05-25 10:05         ` Joerg Roedel
  1 sibling, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-19 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Christoph Hellwig

On Mon, May 18, 2020 at 03:18:28PM -0700, Andrew Morton wrote:
> On Sat, 16 May 2020 14:56:41 +0200 Joerg Roedel <jroedel@suse.de> wrote:
> --- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified-fix
> +++ a/mm/vmalloc.c
> @@ -309,6 +309,9 @@ int map_kernel_range_noflush(unsigned lo
>  			return err;
>  	} while (pgd++, addr = next, addr != end);
>  
> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> +		arch_sync_kernel_mappings(start, end);
> +
>  	return 0;
>  }

Yes, this is the right call.

> It would be nice to get all this (ie, linux-next) retested before we
> send it upstream, please.

Will do and report back.


Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified
  2020-05-18 22:18       ` Andrew Morton
  2020-05-19 15:25         ` Joerg Roedel
@ 2020-05-25 10:05         ` Joerg Roedel
  1 sibling, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2020-05-25 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Christoph Hellwig

On Mon, May 18, 2020 at 03:18:28PM -0700, Andrew Morton wrote:
> It would be nice to get all this (ie, linux-next) retested before we
> send it upstream, please.

Tested these patches as in next-20200522, with three 2, 3, and 4-level
paging. On 4-level (aka 64-bit) I also re-ran the Stevens trace-cmd
reproducer, no issues found and the trace-cmd problem is still fixed.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications
  2020-05-15 14:00 ` [PATCH v3 1/7] mm: Add functions to track page directory modifications Joerg Roedel
@ 2020-06-05 10:08   ` Catalin Marinas
  2020-06-05 11:46     ` Catalin Marinas
  0 siblings, 1 reply; 18+ messages in thread
From: Catalin Marinas @ 2020-06-05 10:08 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Will Deacon

Hi Joerg,

On Fri, May 15, 2020 at 04:00:17PM +0200, Joerg Roedel wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..022fe682af9e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2078,13 +2078,54 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
>  	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
>  		NULL : pud_offset(p4d, address);
>  }
> +
> +static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
> +				     unsigned long address,
> +				     pgtbl_mod_mask *mod_mask)
> +
> +{
> +	if (unlikely(pgd_none(*pgd))) {
> +		if (__p4d_alloc(mm, pgd, address))
> +			return NULL;
> +		*mod_mask |= PGTBL_PGD_MODIFIED;
> +	}
> +
> +	return p4d_offset(pgd, address);
> +}
> +
>  #endif /* !__ARCH_HAS_5LEVEL_HACK */
>  
> +static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
> +				     unsigned long address,
> +				     pgtbl_mod_mask *mod_mask)
> +{
> +	if (unlikely(p4d_none(*p4d))) {
> +		if (__pud_alloc(mm, p4d, address))
> +			return NULL;
> +		*mod_mask |= PGTBL_P4D_MODIFIED;
> +	}
> +
> +	return pud_offset(p4d, address);
> +}

This patch causes a kernel panic on arm64 (and possibly powerpc, I
haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
checks for the empty p4d with pgd_none() instead of p4d_none().

The patch below fixes it:

-----------------------8<-----------------------------------
From 8fb5f4c1e0983bb501c8ec71f9bb9dd344b80e87 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Fri, 5 Jun 2020 10:56:03 +0100
Subject: [PATCH] mm: Use pgd_none() in pud_alloc_track() with
 __ARCH_HAS_5LEVEL_HACK

Commit d8626138009b ("mm: add functions to track page directory
modifications") introduced the pud_alloc_track() function checking for
an empty p4d using p4d_none(). However, when __ARCH_HAS_5LEVEL_HACK is
defined, the pud_alloc() counterpart checks for an empty p4d using
pgd_none(). Since p4d_none() is always 0 in this case, no pud would be
allocated and the kernel panics during boot on arm64 (at least).

Until all architectures are moved away from the 5level-fixup.h, define a
pud_alloc_track() that matches the __ARCH_HAS_5LEVEL_HACK pud_alloc().

Fixes: d8626138009b ("mm: add functions to track page directory modifications")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fda41eb7f1c8..9d3761a1fad5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2106,8 +2106,6 @@ static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
 	return p4d_offset(pgd, address);
 }
 
-#endif /* !__ARCH_HAS_5LEVEL_HACK */
-
 static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
 				     unsigned long address,
 				     pgtbl_mod_mask *mod_mask)
@@ -2121,6 +2119,23 @@ static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
 	return pud_offset(p4d, address);
 }
 
+#else /* __ARCH_HAS_5LEVEL_HACK */
+
+static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
+				     unsigned long address,
+				     pgtbl_mod_mask *mod_mask)
+{
+	if (unlikely(pgd_none(*p4d))) {
+		if (__pud_alloc(mm, p4d, address))
+			return NULL;
+		*mod_mask |= PGTBL_P4D_MODIFIED;
+	}
+
+	return pud_offset(p4d, address);
+}
+
+#endif /* !__ARCH_HAS_5LEVEL_HACK */
+
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 {
 	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications
  2020-06-05 10:08   ` Catalin Marinas
@ 2020-06-05 11:46     ` Catalin Marinas
  2020-06-06  1:01       ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Catalin Marinas @ 2020-06-05 11:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Dave Hansen, Andy Lutomirski, Peter Zijlstra, rjw,
	Arnd Bergmann, Andrew Morton, Steven Rostedt, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Joerg Roedel, linux-kernel,
	linux-acpi, linux-arch, linux-mm, Will Deacon

On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> This patch causes a kernel panic on arm64 (and possibly powerpc, I
> haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> checks for the empty p4d with pgd_none() instead of p4d_none().

Ah, should have checked the list first
(https://lore.kernel.org/linux-mm/20200604074446.23944-1-joro@8bytes.org/).

Please ignore my patch then.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications
  2020-06-05 11:46     ` Catalin Marinas
@ 2020-06-06  1:01       ` Andrew Morton
  2020-06-06 12:50         ` Catalin Marinas
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2020-06-06  1:01 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Joerg Roedel,
	linux-kernel, linux-acpi, linux-arch, linux-mm, Will Deacon

On Fri, 5 Jun 2020 12:46:55 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> > This patch causes a kernel panic on arm64 (and possibly powerpc, I
> > haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> > checks for the empty p4d with pgd_none() instead of p4d_none().
> 
> Ah, should have checked the list first
> (https://lore.kernel.org/linux-mm/20200604074446.23944-1-joro@8bytes.org/).
> 

Sorry about that.

Can you confirm that the merge of Mike's 5level_hack zappage has fixed
the arm64 wontboot?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications
  2020-06-06  1:01       ` Andrew Morton
@ 2020-06-06 12:50         ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2020-06-06 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, x86, hpa, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, rjw, Arnd Bergmann, Steven Rostedt,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Joerg Roedel,
	linux-kernel, linux-acpi, linux-arch, linux-mm, Will Deacon

On Fri, Jun 05, 2020 at 06:01:16PM -0700, Andrew Morton wrote:
> On Fri, 5 Jun 2020 12:46:55 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> > > This patch causes a kernel panic on arm64 (and possibly powerpc, I
> > > haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> > > checks for the empty p4d with pgd_none() instead of p4d_none().
> > 
> > Ah, should have checked the list first
> > (https://lore.kernel.org/linux-mm/20200604074446.23944-1-joro@8bytes.org/).
> 
> Can you confirm that the merge of Mike's 5level_hack zappage has fixed
> the arm64 wontboot?

Yes, it has. Mainline is booting again on arm64.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-06-06 12:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15 14:00 [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 1/7] mm: Add functions to track page directory modifications Joerg Roedel
2020-06-05 10:08   ` Catalin Marinas
2020-06-05 11:46     ` Catalin Marinas
2020-06-06  1:01       ` Andrew Morton
2020-06-06 12:50         ` Catalin Marinas
2020-05-15 14:00 ` [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified Joerg Roedel
2020-05-15 20:01   ` Andrew Morton
2020-05-16 12:56     ` Joerg Roedel
2020-05-18 22:18       ` Andrew Morton
2020-05-19 15:25         ` Joerg Roedel
2020-05-25 10:05         ` Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 3/7] mm/ioremap: " Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 4/7] x86/mm/64: Implement arch_sync_kernel_mappings() Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 5/7] x86/mm/32: " Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 6/7] mm: Remove vmalloc_sync_(un)mappings() Joerg Roedel
2020-05-15 14:00 ` [PATCH v3 7/7] x86/mm: Remove vmalloc faulting Joerg Roedel
2020-05-15 14:16 ` [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings() Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.