linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free()
@ 2015-09-22  6:23 Ingo Molnar
  2015-09-22  6:23 ` [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

So this is the somewhat belated latest iteration of the series.
I (think I) fixed all correctness bugs in the code pointed out by Oleg.

The task list walk is still 'dumb', using for_each_process(), as none of
the call sites are performance critical.

Oleg, can you see any problems with this code?

Background:

Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
some of the lock bouncing overhead.

I think we can do much better: this series eliminates the pgd_list and makes
pgd_alloc()/pgd_free() lockless.

Now the lockless initialization of the PGD has a few preconditions, which the
initial part of the series implements:

 - no PGD clearing is allowed, only additions. This makes sense as a single PGD
   entry covers 512 GB of RAM so the 4K overhead per 0.5TB of RAM mapped is
   miniscule.

The patches after that convert existing pgd_list users to walk the task list.

PGD locking is kept intact: coherency guarantees between the CPA, vmalloc,
hotplug, etc. code are unchanged.

The final patches eliminate the pgd_list and thus make pgd_alloc()/pgd_free()
lockless.

The patches have been boot tested on 64-bit and 32-bit x86 systems.

Architectures not making use of the new facility are unaffected.

Thanks,

	Ingo

===
Ingo Molnar (11):
  x86/mm/pat: Don't free PGD entries on memory unmap
  x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm/hotplug: Simplify sync_global_pgds()
  mm: Introduce arch_pgd_init_late()
  x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  x86/mm: Remove pgd_list use from vmalloc_sync_all()
  x86/mm/pat/32: Remove pgd_list use from the PAT code
  x86/mm: Make pgd_alloc()/pgd_free() lockless
  x86/mm: Remove pgd_list leftovers
  x86/mm: Simplify pgd_alloc()

 arch/Kconfig                      |   9 +++
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/pgtable.h    |   3 -
 arch/x86/include/asm/pgtable_64.h |   3 +-
 arch/x86/mm/fault.c               |  32 +++++++---
 arch/x86/mm/init_64.c             |  92 ++++++++++++--------------
 arch/x86/mm/pageattr.c            |  40 ++++++------
 arch/x86/mm/pgtable.c             | 131 +++++++++++++++++++-------------------
 arch/x86/xen/mmu.c                |  45 +++++++++++--
 fs/exec.c                         |   3 +
 include/linux/mm.h                |   6 ++
 kernel/fork.c                     |  16 +++++
 12 files changed, 227 insertions(+), 154 deletions(-)

--
2.1.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22 17:41   ` Linus Torvalds
  2015-09-22  6:23 ` [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 2c44c0792301..b784ed7c9a7e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
 	pgd_t *pgd_entry = root + pgd_index(addr);
 
 	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
 }
 
 static int alloc_pte_page(pmd_t *pmd)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
  2015-09-22  6:23 ` [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22 17:48   ` Linus Torvalds
  2015-09-22  6:23 ` [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.

But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.

Also improve the comments a bit, to make this function easier
to understand.

Only lightly tested, as I don't have a memory hotplug setup.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 30564e2752d3..7129e7647a76 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/oom.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -160,8 +161,8 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
  */
 void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 {
@@ -169,29 +170,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 
 	for (address = start; address <= end; address += PGDIR_SIZE) {
 		const pgd_t *pgd_ref = pgd_offset_k(address);
-		struct page *page;
+		struct task_struct *g;
 
 		/*
-		 * When it is called after memory hot remove, pgd_none()
-		 * returns true. In this case (removed == 1), we must clear
-		 * the PGD entries in the local PGD level page.
+		 * When this function is called after memory hot remove,
+		 * pgd_none() already returns true, but only the reference
+		 * kernel PGD has been cleared, not the process PGDs.
+		 *
+		 * So clear the affected entries in every process PGD as well:
 		 */
 		if (pgd_none(*pgd_ref) && !removed)
 			continue;
 
+		rcu_read_lock(); /* Task list walk */
 		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+
+		for_each_process(g) {
+			struct task_struct *p;
+			struct mm_struct *mm;
 			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			p = find_lock_task_mm(g);
+			if (!p)
+				continue;
+
+			mm = p->mm;
+			pgd = mm->pgd;
+
+			/* The pgt_lock is only used by Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
-				BUG_ON(pgd_page_vaddr(*pgd)
-				       != pgd_page_vaddr(*pgd_ref));
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
 			if (removed) {
 				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,8 +214,10 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			}
 
 			spin_unlock(pgt_lock);
+			task_unlock(p);
 		}
 		spin_unlock(&pgd_lock);
+		rcu_read_unlock();
 	}
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
  2015-09-22  6:23 ` [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
  2015-09-22  6:23 ` [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-10-06  3:35   ` Kamezawa Hiroyuki
  2016-02-12 19:04   ` Andy Lutomirski
  2015-09-22  6:23 ` [PATCH 04/11] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7129e7647a76..60b0cc3f2819 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/11] x86/mm/hotplug: Simplify sync_global_pgds()
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (2 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22  6:23 ` [PATCH 05/11] mm: Introduce arch_pgd_init_late() Ingo Molnar
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.

So remove the 'removed' parameter and simplify the call sites.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h |  3 +--
 arch/x86/mm/fault.c               |  2 +-
 arch/x86/mm/init_64.c             | 27 ++++++++-------------------
 3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
 	native_set_pgd(pgd, native_make_pgd(0));
 }
 
-extern void sync_global_pgds(unsigned long start, unsigned long end,
-			     int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);
 
 /*
  * Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..f890f5463ac1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -353,7 +353,7 @@ static void dump_pagetable(unsigned long address)
 
 void vmalloc_sync_all(void)
 {
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
 }
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 60b0cc3f2819..467c4f66ded9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -161,10 +161,10 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
  * matching PGD entries in the local PGD level page as well.
  */
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
 
@@ -172,14 +172,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 		const pgd_t *pgd_ref = pgd_offset_k(address);
 		struct task_struct *g;
 
-		/*
-		 * When this function is called after memory hot remove,
-		 * pgd_none() already returns true, but only the reference
-		 * kernel PGD has been cleared, not the process PGDs.
-		 *
-		 * So clear the affected entries in every process PGD as well:
-		 */
-		if (pgd_none(*pgd_ref) && !removed)
+		/* Only sync (potentially) newly added PGD entries: */
+		if (pgd_none(*pgd_ref))
 			continue;
 
 		rcu_read_lock(); /* Task list walk */
@@ -205,13 +199,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
 				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
-			if (removed) {
-				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
-					pgd_clear(pgd);
-			} else {
-				if (pgd_none(*pgd))
-					set_pgd(pgd, *pgd_ref);
-			}
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
 			task_unlock(p);
@@ -646,7 +635,7 @@ kernel_physical_mapping_init(unsigned long start,
 	}
 
 	if (pgd_changed)
-		sync_global_pgds(addr, end - 1, 0);
+		sync_global_pgds(addr, end - 1);
 
 	__flush_tlb_all();
 
@@ -1286,7 +1275,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 	else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
-		sync_global_pgds(start, end - 1, 0);
+		sync_global_pgds(start, end - 1);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (3 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 04/11] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22 17:55   ` Linus Torvalds
  2015-09-22  6:23 ` [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().

The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.

Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/Kconfig          |  9 ++++++++
 arch/x86/Kconfig      |  1 +
 arch/x86/mm/init_64.c | 12 +++++++++++
 arch/x86/mm/pgtable.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exec.c             |  3 +++
 include/linux/mm.h    |  6 ++++++
 kernel/fork.c         | 16 ++++++++++++++
 7 files changed, 106 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4e949e58b192..671810ce6fe0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -503,6 +503,15 @@ config PGTABLE_LEVELS
 	int
 	default 2
 
+config ARCH_HAS_PGD_INIT_LATE
+	bool
+	help
+	  Architectures that want a late PGD initialization can define
+	  the arch_pgd_init_late() callback and it will be called
+	  by the generic new task (fork()) code after a new task has
+	  been made visible on the task list, but before it has been
+	  first scheduled.
+
 config ARCH_HAS_ELF_RANDOMIZE
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 328c8352480c..3e97b6cfdb60 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_HAS_PGD_INIT_LATE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_SG_CHAIN
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 467c4f66ded9..429362f8d6ca 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -177,6 +177,18 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 			continue;
 
 		rcu_read_lock(); /* Task list walk */
+
+		/*
+		 * Since this is x86, this spin_lock() is also a full memory barrier that
+		 * is required for correct operation of the lockless reading of PGDs
+		 * in arch_pgd_init_late(). If you ever move this code to another
+		 * architecture or to generic code you need to make sure this is
+		 * an:
+		 *
+		 *	smp_mb();
+		 *
+		 * before looking at PGDs in the loop below.
+		 */
 		spin_lock(&pgd_lock);
 
 		for_each_process(g) {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..c7038b6e51bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,65 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	return NULL;
 }
 
+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm)
+{
+	/*
+	 * This function is called after a new MM has been made visible
+	 * in fork() or exec() via:
+	 *
+	 *   tsk->mm = mm;
+	 *
+	 * This barrier makes sure the MM is visible to new RCU
+	 * walkers before we read and initialize the pagetables below,
+	 * so that we don't miss updates:
+	 */
+	smp_mb();
+
+	/*
+	 * If the pgd points to a shared pagetable level (either the
+	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
+	 * references from swapper_pg_dir:
+	 */
+	if ( CONFIG_PGTABLE_LEVELS == 2 ||
+	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+	     CONFIG_PGTABLE_LEVELS == 4) {
+
+		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =        mm->pgd + KERNEL_PGD_BOUNDARY;
+		int i;
+
+		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+			/*
+			 * This is lock-less, so it can race with PGD updates
+			 * coming from vmalloc() or CPA methods, but it's safe,
+			 * because:
+			 *
+			 * 1) this PGD is not in use yet, we have still not
+			 *    scheduled this task.
+			 * 2) we only ever extend PGD entries
+			 *
+			 * So if we observe a non-zero PGD entry we can copy it,
+			 * it won't change from under us. Parallel updates (new
+			 * allocations) will modify our (already visible) PGD:
+			 */
+			if (!pgd_none(*pgd_src))
+				set_pgd(pgd_dst, *pgd_src);
+		}
+	}
+}
+
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..0a77a6991d0e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -866,7 +866,10 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
+
 	tsk->mm = mm;
+	arch_pgd_init_late(mm);
+
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..8d008dfa9d73 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,6 +1152,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
+#endif
+
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 7d5f0f118a63..4668f8902b19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1606,6 +1606,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/*
+	 * If we have a new PGD then initialize it:
+	 *
+	 * This method is called after a task has been made visible
+	 * on the task list already.
+	 *
+	 * Architectures that manage per task kernel pagetables
+	 * might use this callback to initialize them after they
+	 * are already visible to new updates.
+	 *
+	 * NOTE: any user-space parts of the PGD are already initialized
+	 *       and must not be clobbered.
+	 */
+	if (!(clone_flags & CLONE_VM))
+		arch_pgd_init_late(p->mm);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p, cgrp_ss_priv);
 	if (clone_flags & CLONE_THREAD)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (4 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 05/11] mm: Introduce arch_pgd_init_late() Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22 17:58   ` Linus Torvalds
  2015-09-22  6:23 ` [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu.c | 45 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 9c479fe40459..96bb4a7a626d 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -45,6 +45,7 @@
 #include <linux/vmalloc.h>
 #include <linux/module.h>
 #include <linux/gfp.h>
+#include <linux/oom.h>
 #include <linux/memblock.h>
 #include <linux/seq_file.h>
 #include <linux/crash_dump.h>
@@ -854,18 +855,34 @@ static void xen_pgd_pin(struct mm_struct *mm)
  */
 void xen_mm_pin_all(void)
 {
-	struct page *page;
+	struct task_struct *g;
 
+	rcu_read_lock(); /* Task list walk */
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process(g) {
+		struct task_struct *p;
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		p = find_lock_task_mm(g);
+		if (!p)
+			continue;
+
+		mm = p->mm;
+		pgd = mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_pin(&init_mm, pgd);
 			SetPageSavePinned(page);
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -968,19 +985,35 @@ static void xen_pgd_unpin(struct mm_struct *mm)
  */
 void xen_mm_unpin_all(void)
 {
-	struct page *page;
+	struct task_struct *g;
 
+	rcu_read_lock(); /* Task list walk */
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process(g) {
+		struct task_struct *p;
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		p = find_lock_task_mm(g);
+		if (!p)
+			continue;
+
+		mm = p->mm;
+		pgd = mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (PageSavePinned(page)) {
 			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_unpin(&init_mm, pgd);
 			ClearPageSavePinned(page);
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (5 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22 17:59   ` Linus Torvalds
  2015-09-22  6:23 ` [PATCH 08/11] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs in certain rare cases,
like register_die_notifier().

This use seems to be somewhat questionable, as most other vmalloc
page table fixups are vmalloc_fault() driven, but nevertheless
it's there and it's using the pgd_list.

But we don't need the global list, as we can walk the task list
under RCU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f890f5463ac1..9322d5ad3811 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/oom.h>			/* find_lock_task_mm(), ...	*/
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -237,24 +238,38 @@ void vmalloc_sync_all(void)
 	for (address = VMALLOC_START & PMD_MASK;
 	     address >= TASK_SIZE && address < FIXADDR_TOP;
 	     address += PMD_SIZE) {
-		struct page *page;
 
+		struct task_struct *g;
+
+		rcu_read_lock(); /* Task list walk */
 		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+
+		for_each_process(g) {
+			struct task_struct *p;
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
-			pmd_t *ret;
+			pmd_t *pmd_ret;
+
+			p = find_lock_task_mm(g);
+			if (!p)
+				continue;
 
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			mm = p->mm;
 
+			/* The pgt_lock is only used on Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
-			ret = vmalloc_sync_one(page_address(page), address);
+			pmd_ret = vmalloc_sync_one(mm->pgd, address);
 			spin_unlock(pgt_lock);
 
-			if (!ret)
+			task_unlock(p);
+
+			if (!pmd_ret)
 				break;
 		}
+
 		spin_unlock(&pgd_lock);
+		rcu_read_unlock();
 	}
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/11] x86/mm/pat/32: Remove pgd_list use from the PAT code
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (6 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22  6:23 ` [PATCH 09/11] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.

This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.

(This code already holds the pgd_lock.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index b784ed7c9a7e..bc7533801014 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -12,6 +12,7 @@
 #include <linux/pfn.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
+#include <linux/oom.h>
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 
@@ -438,18 +439,36 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	set_pte_atomic(kpte, pte);
 #ifdef CONFIG_X86_32
 	if (!SHARED_KERNEL_PMD) {
-		struct page *page;
+		struct task_struct *g;
 
-		list_for_each_entry(page, &pgd_list, lru) {
+		rcu_read_lock(); /* Task list walk */
+
+		for_each_process(g) {
+			struct task_struct *p;
+			struct mm_struct *mm;
+			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			p = find_lock_task_mm(g);
+			if (!p)
+				continue;
+
+			mm = p->mm;
+			pgt_lock = &mm->page_table_lock;
+			spin_lock(pgt_lock);
+
+			pgd = mm->pgd + pgd_index(address);
 			pud = pud_offset(pgd, address);
 			pmd = pmd_offset(pud, address);
 			set_pte_atomic((pte_t *)pmd, pte);
+
+			spin_unlock(pgt_lock);
+
+			task_unlock(p);
 		}
+		rcu_read_unlock();
 	}
 #endif
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/11] x86/mm: Make pgd_alloc()/pgd_free() lockless
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (7 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 08/11] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22  6:23 ` [PATCH 10/11] x86/mm: Remove pgd_list leftovers Ingo Molnar
  2015-09-22  6:23 ` [PATCH 11/11] x86/mm: Simplify pgd_alloc() Ingo Molnar
  10 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.

The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.

The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.

So we can remove the pgd_list addition/removal from this code.

The new PGD is private while constructed, so it needs no extra
locking.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++------------------------
 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index c7038b6e51bf..8a42d54f44ba 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
 	}
-
-	/* list required to sync kernel mapping updates */
-	if (!SHARED_KERNEL_PMD) {
-		pgd_set_mm(pgd, mm);
-		pgd_list_add(pgd);
-	}
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	spin_lock(&pgd_lock);
-	pgd_list_del(pgd);
-	spin_unlock(&pgd_lock);
 }
 
 /*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * Make sure that pre-populating the pmds is atomic with
-	 * respect to anything walking the pgd_list, so that they
-	 * never see a partially populated pgd.
+	 * No locking is needed here, as the PGD is still private,
+	 * so no code walking the task list and looking at mm->pgd
+	 * will be able to see it before it's fully constructed:
 	 */
-	spin_lock(&pgd_lock);
-
 	pgd_ctor(mm, pgd);
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
-	spin_unlock(&pgd_lock);
-
 	return pgd;
 
 out_free_pmds:
@@ -453,7 +433,6 @@ void arch_pgd_init_late(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	pgd_dtor(pgd);
 	paravirt_pgd_free(mm, pgd);
 	_pgd_free(pgd);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/11] x86/mm: Remove pgd_list leftovers
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (8 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 09/11] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  2015-09-22  6:23 ` [PATCH 11/11] x86/mm: Simplify pgd_alloc() Ingo Molnar
  10 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

Nothing uses the pgd_list anymore - remove the list itself and its helpers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |  3 ---
 arch/x86/mm/fault.c            |  1 -
 arch/x86/mm/pgtable.c          | 26 --------------------------
 3 files changed, 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..8338c8175409 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 
 extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9322d5ad3811..546fbca9621d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -189,7 +189,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);
 
 #ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8a42d54f44ba..cb5b8cbcf96b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_del(&page->lru);
-}
-
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
-	BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
-	virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
-	return (struct mm_struct *)page->index;
-}
-
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 {
 	/* If the pgd points to a shared pagetable level (either the
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/11] x86/mm: Simplify pgd_alloc()
  2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (9 preceding siblings ...)
  2015-09-22  6:23 ` [PATCH 10/11] x86/mm: Remove pgd_list leftovers Ingo Molnar
@ 2015-09-22  6:23 ` Ingo Molnar
  10 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-22  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner

Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.

This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index cb5b8cbcf96b..ac72e60d5297 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
-	/* If the pgd points to a shared pagetable level (either the
-	   ptes in non-PAE, or shared PMD in PAE), then just copy the
-	   references from swapper_pg_dir. */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
-	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
-		clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
-				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-				KERNEL_PGD_PTRS);
-	}
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * No locking is needed here, as the PGD is still private,
-	 * so no code walking the task list and looking at mm->pgd
-	 * will be able to see it before it's fully constructed:
+	 * Zero out the kernel portion here, we'll set them up in
+	 * arch_pgd_init_late(), when the pgd is globally
+	 * visible already per the task list, so that it cannot
+	 * miss updates.
+	 *
+	 * We need to zero it here, to make sure arch_pgd_init_late()
+	 * can initialize them without locking.
 	 */
-	pgd_ctor(mm, pgd);
+	memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
 	return pgd;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap
  2015-09-22  6:23 ` [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
@ 2015-09-22 17:41   ` Linus Torvalds
  2015-09-22 18:03     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory mapped.

I'm ok with this just from a "it removes code" standpoint.  That said,
some of the other patches here make me go "hmm". I'll answer them
separately.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-09-22  6:23 ` [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
@ 2015-09-22 17:48   ` Linus Torvalds
  2015-09-23 11:44     ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 17:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> +
> +               for_each_process(g) {
> +                       struct task_struct *p;
> +                       struct mm_struct *mm;
>                         pgd_t *pgd;
>                         spinlock_t *pgt_lock;
>
> +                       p = find_lock_task_mm(g);
> +                       if (!p)
> +                               continue;
> +
> +                       mm = p->mm;

So quite frankly, this is *much* better than the earlier version that
walked over all threads.

However, this now becomes a pattern for the series, and that just makes me think

    "Why is this not a 'for_each_mm()' pattern helper?"

if it only showed up once, that would be one thing. But this
patch-series makes it a thing. Which is why I wonder..

                      Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22  6:23 ` [PATCH 05/11] mm: Introduce arch_pgd_init_late() Ingo Molnar
@ 2015-09-22 17:55   ` Linus Torvalds
  2015-09-22 18:00     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 17:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> Add a late PGD init callback to places that allocate a new MM
> with a new PGD: copy_process() and exec().
>
> The purpose of this callback is to allow architectures to implement
> lockless initialization of task PGDs, to remove the scalability
> limit of pgd_list/pgd_lock.

Do we really need this?

Can't we just initialize the pgd when we allocate it, knowing that
it's not in sync, but just depend on the vmalloc fault to add in any
kernel entries that we might have missed?

I liked the other patches in the series because they remove code and
simplify things. This patch I don't like.

There may be some reason we need it that I missed, and which makes me
go "Duh!" when you tell me. But please do tell me.

               Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-09-22  6:23 ` [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
@ 2015-09-22 17:58   ` Linus Torvalds
  2015-09-29  8:44     ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> suspend/restore. It's a stop-all method that needs to iterate through
> all allocated pgds in the system to fix them up for Xen's use.

And _this_ is why I'd reall ylike that "for_each_mm()" helper.

Yeah, yeah, maybe it would require syntax like

    for_each_mm (tsk, mm) {
        ...
    } end_for_each_mm(mm);

to do variable allocation things or cleanups (ie "end_for_each_mm()"
might drop the task lock etc), but wouldn't that still be better than
this complex boilerplate thing?

                    Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-09-22  6:23 ` [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
@ 2015-09-22 17:59   ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 17:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> +
> +               for_each_process(g) {
> +                       struct task_struct *p;
> +                       struct mm_struct *mm;
> +
> +                       p = find_lock_task_mm(g);
> +                       if (!p)
> +                               continue;
...
> +                       task_unlock(p);

You know the drill by now..

                Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22 17:55   ` Linus Torvalds
@ 2015-09-22 18:00     ` Andy Lutomirski
  2015-09-22 18:26       ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-09-22 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 10:55 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> Add a late PGD init callback to places that allocate a new MM
>> with a new PGD: copy_process() and exec().
>>
>> The purpose of this callback is to allow architectures to implement
>> lockless initialization of task PGDs, to remove the scalability
>> limit of pgd_list/pgd_lock.
>
> Do we really need this?
>
> Can't we just initialize the pgd when we allocate it, knowing that
> it's not in sync, but just depend on the vmalloc fault to add in any
> kernel entries that we might have missed?

I really really hate the vmalloc fault thing.  It seems to work,
rather to my surprise.  It doesn't *deserve* to work, because of
things like the percpu TSS accesses in the entry code that happen
without a valid stack.

For all I know, there's a long history of this hitting on monster
non-SMAP systems that are all buggy and rootable but no one notices
because it's rare.  On SMAP with non-malicious userspace, it's an
instant double fault.  With malicious userspace, it's rootable
regardless of SMAP, but it's much harder with SMAP.

If we start every mm with a fully zeroed pgd (which is what I think
you're suggesting), then this starts affecting small systems as in
addition to monster systems.

I'd really rather go in the other directoin and completely eliminate
vmalloc faults.  We could do that by eagerly initializing all pgd, or
we could do it by tracking, per-pgd, how up-to-date it is and fixing
it up in switch_mm.  The latter is a bit nasty on SMP.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap
  2015-09-22 17:41   ` Linus Torvalds
@ 2015-09-22 18:03     ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-09-22 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 10:41 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> This complicates PGD management, so don't do this. We can keep the
>> PGD mapped and the PUD table all clear - it's only a single 4K page
>> per 512 GB of memory mapped.
>
> I'm ok with this just from a "it removes code" standpoint.  That said,
> some of the other patches here make me go "hmm". I'll answer them
> separately.
>

If we want to get rid of vmalloc faults, then this patch makes it much
more obvious that it can be done without hurting performance.

>                   Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22 18:00     ` Andy Lutomirski
@ 2015-09-22 18:26       ` Linus Torvalds
  2015-09-22 18:37         ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 18:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 11:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I really really hate the vmalloc fault thing.  It seems to work,
> rather to my surprise.  It doesn't *deserve* to work, because of
> things like the percpu TSS accesses in the entry code that happen
> without a valid stack.

The thing is, I think you're misguided in your hatred.

The reason I say that is because I think we should just embrace the
fact that faults can and do happen in the kernel in very inconvenient
places, and not just in code we "control".

Even if you get rid of the vmalloc fault, you'll still have debug
faults, and you'll still have NMI's and horrible crazy machine check
faults.

I actually think teh vmalloc fault is a good way to just let people
know "pretty much anything can trap, deal with it".

And I think trying to eliminate them is the wrong thing, because it
forces us to be so damn synchronized. This whole patch-series is a
prime example of why that is a bad bad things. We want to have _less_
synchronization.

                Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22 18:26       ` Linus Torvalds
@ 2015-09-22 18:37         ` Andy Lutomirski
  2015-09-22 18:44           ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-09-22 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 11:26 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Sep 22, 2015 at 11:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I really really hate the vmalloc fault thing.  It seems to work,
>> rather to my surprise.  It doesn't *deserve* to work, because of
>> things like the percpu TSS accesses in the entry code that happen
>> without a valid stack.
>
> The thing is, I think you're misguided in your hatred.
>
> The reason I say that is because I think we should just embrace the
> fact that faults can and do happen in the kernel in very inconvenient
> places, and not just in code we "control".
>
> Even if you get rid of the vmalloc fault, you'll still have debug
> faults, and you'll still have NMI's and horrible crazy machine check
> faults.
>
> I actually think teh vmalloc fault is a good way to just let people
> know "pretty much anything can trap, deal with it".
>
> And I think trying to eliminate them is the wrong thing, because it
> forces us to be so damn synchronized. This whole patch-series is a
> prime example of why that is a bad bad things. We want to have _less_
> synchronization.

Sure, pretty much anything can trap, but we need to do *something* to
deal with it.

Debug faults can't happen with bad stacks any more (now that we honor
the kprobe blacklist), which means that debug faults could, in theory,
move off the IST stack.  The SYSENTER + debug mess doesn't have any
stack problem.

NMIs and MCEs are special, and we deal with that using IST and all
kinds of mess.

I don't think that anyone really wants to move #PF to IST, which means
that we simply cannot handle vmalloc faults that happen when switching
stacks after SYSCALL, no matter what fanciness we shove into the
page_fault asm.  If we move #PF to IST, then we have to worry about
page_fault -> nmi -> page_fault, which would be a clusterf*ck.

AMD gave us a pile of misguided architectural turds, and we have to
deal with it.  My preference is to simplify dealing with it by getting
rid of vmalloc faults so that we can at least reliably touch percpu
memory without faulting.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22 18:37         ` Andy Lutomirski
@ 2015-09-22 18:44           ` Linus Torvalds
  2015-09-22 18:52             ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-09-22 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 11:37 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> kinds of mess.
>
> I don't think that anyone really wants to move #PF to IST, which means
> that we simply cannot handle vmalloc faults that happen when switching
> stacks after SYSCALL, no matter what fanciness we shove into the
> page_fault asm.

But that's fine. The kernel stack is special.  So yes, we want to make
sure that the kernel stack is always mapped in the thread whose stack
it is.

But that's not a big and onerous guarantee to make. Not when the
*real* problem is "random vmalloc allocations made by other processes
that we are not in the least interested in, and we don't want to add
synchronization for".

                         Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()
  2015-09-22 18:44           ` Linus Torvalds
@ 2015-09-22 18:52             ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-09-22 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Tue, Sep 22, 2015 at 11:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Sep 22, 2015 at 11:37 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> kinds of mess.
>>
>> I don't think that anyone really wants to move #PF to IST, which means
>> that we simply cannot handle vmalloc faults that happen when switching
>> stacks after SYSCALL, no matter what fanciness we shove into the
>> page_fault asm.
>
> But that's fine. The kernel stack is special.  So yes, we want to make
> sure that the kernel stack is always mapped in the thread whose stack
> it is.
>
> But that's not a big and onerous guarantee to make. Not when the
> *real* problem is "random vmalloc allocations made by other processes
> that we are not in the least interested in, and we don't want to add
> synchronization for".
>

It's the kernel stack, the TSS (for sp0) and rsp_scratch at least.
But yes, that's not that onerous, and it's never lazily initialized
elsewhere.

How about this (long-term, not right now): Never free pgd entries.
For each pgd, track the number of populated kernel entries.  Also
track the global (init_mm) number of existing kernel entries.  At
context switch time, if new_pgd has fewer entries that the total, sync
it.

This hits *at most* 256 times per thread, and otherwise it's just a
single unlikely branch.  It guarantees that we only ever take a
vmalloc fault when accessing maps that didn't exist when we last
context switched, which gets us all of the important percpu stuff and
the kernel stack, even if we schedule onto a cpu that didn't exist
when the mm was created.

--Andy

>                          Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-09-22 17:48   ` Linus Torvalds
@ 2015-09-23 11:44     ` Oleg Nesterov
  2015-09-29  8:42       ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2015-09-23 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Waiman Long,
	Thomas Gleixner

On 09/22, Linus Torvalds wrote:
>
> However, this now becomes a pattern for the series, and that just makes me think
>
>     "Why is this not a 'for_each_mm()' pattern helper?"

And we already have other users. And note that oom_kill_process() does _not_
follow this pattern and that is why it is buggy.

So this is funny, but I was thinking about almost the same, something like

	struct task_struct *next_task_with_mm(struct task_struct *p)
	{
		struct task_struct *t;

		p = p->group_leader;
		while ((p = next_task(p)) != &init_task) {
			if (p->flags & PF_KTHREAD)
				continue;

			t = find_lock_task_mm(p);
			if (t)
				return t;
		}

		return NULL;
	}

	#define for_each_task_lock_mm(p)
		for (p = &init_task; (p = next_task_with_mm(p)); task_unlock(p))


So that you can do

	for_each_task_lock_mm(p) {
		do_something_with(p->mm);

		if (some_condition()) {
			// UNFORTUNATELY you can't just do "break"
			task_unlock(p);
			break;
		}
	}

do you think it makes sense?


In fact it can't be simpler, we can move task_unlock() into next_task_with_mm(),
it can check ->mm != NULL or p != init_task.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-09-23 11:44     ` Oleg Nesterov
@ 2015-09-29  8:42       ` Ingo Molnar
  2015-09-29 16:51         ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-09-29  8:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Waiman Long,
	Thomas Gleixner


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 09/22, Linus Torvalds wrote:
> >
> > However, this now becomes a pattern for the series, and that just makes me think
> >
> >     "Why is this not a 'for_each_mm()' pattern helper?"
> 
> And we already have other users. And note that oom_kill_process() does _not_
> follow this pattern and that is why it is buggy.
> 
> So this is funny, but I was thinking about almost the same, something like
> 
> 	struct task_struct *next_task_with_mm(struct task_struct *p)
> 	{
> 		struct task_struct *t;
> 
> 		p = p->group_leader;
> 		while ((p = next_task(p)) != &init_task) {
> 			if (p->flags & PF_KTHREAD)
> 				continue;
> 
> 			t = find_lock_task_mm(p);
> 			if (t)
> 				return t;
> 		}
> 
> 		return NULL;
> 	}
> 
> 	#define for_each_task_lock_mm(p)
> 		for (p = &init_task; (p = next_task_with_mm(p)); task_unlock(p))
> 
> 
> So that you can do
> 
> 	for_each_task_lock_mm(p) {
> 		do_something_with(p->mm);
> 
> 		if (some_condition()) {
> 			// UNFORTUNATELY you can't just do "break"
> 			task_unlock(p);
> 			break;
> 		}
> 	}
> 
> do you think it makes sense?

Sure, I'm inclined to use the above code from you.

> In fact it can't be simpler, we can move task_unlock() into next_task_with_mm(), 
> it can check ->mm != NULL or p != init_task.

s/can't/can ?

But even with that I'm not sure I can parse your suggestion. Got some (pseudo) code
perhaps?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-09-22 17:58   ` Linus Torvalds
@ 2015-09-29  8:44     ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-09-29  8:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Waiman Long,
	Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> > xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> > suspend/restore. It's a stop-all method that needs to iterate through
> > all allocated pgds in the system to fix them up for Xen's use.
> 
> And _this_ is why I'd reall ylike that "for_each_mm()" helper.
> 
> Yeah, yeah, maybe it would require syntax like
> 
>     for_each_mm (tsk, mm) {
>         ...
>     } end_for_each_mm(mm);
> 
> to do variable allocation things or cleanups (ie "end_for_each_mm()" might drop 
> the task lock etc), but wouldn't that still be better than this complex 
> boilerplate thing?

Yeah, agreed absolutely.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-09-29  8:42       ` Ingo Molnar
@ 2015-09-29 16:51         ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2015-09-29 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Waiman Long,
	Thomas Gleixner

On 09/29, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > 	struct task_struct *next_task_with_mm(struct task_struct *p)
> > 	{
> > 		struct task_struct *t;
> >
> > 		p = p->group_leader;
> > 		while ((p = next_task(p)) != &init_task) {
> > 			if (p->flags & PF_KTHREAD)
> > 				continue;
> >
> > 			t = find_lock_task_mm(p);
> > 			if (t)
> > 				return t;
> > 		}
> >
> > 		return NULL;
> > 	}
> >
> > 	#define for_each_task_lock_mm(p)
> > 		for (p = &init_task; (p = next_task_with_mm(p)); task_unlock(p))
> >
> >
> > So that you can do
> >
> > 	for_each_task_lock_mm(p) {
> > 		do_something_with(p->mm);
> >
> > 		if (some_condition()) {
> > 			// UNFORTUNATELY you can't just do "break"
> > 			task_unlock(p);
> > 			break;
> > 		}
> > 	}
> >
> > do you think it makes sense?
>
> Sure, I'm inclined to use the above code from you.
>
> > In fact it can't be simpler, we can move task_unlock() into next_task_with_mm(),
> > it can check ->mm != NULL or p != init_task.
>
> s/can't/can ?

yes, sorry,

> But even with that I'm not sure I can parse your suggestion. Got some (pseudo) code
> perhaps?

I meant

	struct task_struct *next_task_lock_mm(struct task_struct *p)
	{
		struct task_struct *t;

		if (p) {
			task_unlock(p);
			p = p->group_leader;
		} else {
			p = &init_task;
		}

		while ((p = next_task(p)) != &init_task) {
			if (p->flags & PF_KTHREAD)
				continue;

			t = find_lock_task_mm(p);
			if (t)
				return t;
		}

		return NULL;
	}

	#define for_each_task_lock_mm(p)
		for (p = NULL; (p = next_task_lock_mm(p)); )

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2015-09-22  6:23 ` [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
@ 2015-10-06  3:35   ` Kamezawa Hiroyuki
  2016-02-12 19:04   ` Andy Lutomirski
  1 sibling, 0 replies; 32+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-06  3:35 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel, linux-mm, Ishimatsu,
	Yasuaki/石松 靖章,
	Tang Chen
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Waiman Long, Thomas Gleixner, Izumi,
	Taku/泉 拓

On 2015/09/22 15:23, Ingo Molnar wrote:
> So when memory hotplug removes a piece of physical memory from pagetable
> mappings, it also frees the underlying PGD entry.
> 
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory hotplugged.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ishimatsu-san, Tang-san, please check.

Doesn't this patch affects the issues of 
 5255e0a79fcc0ff47b387af92bd9ef5729b1b859
 9661d5bcd058fe15b4138a00d96bd36516134543

?

-Kame

> ---
>   arch/x86/mm/init_64.c | 27 ---------------------------
>   1 file changed, 27 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7129e7647a76..60b0cc3f2819 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>   	spin_unlock(&init_mm.page_table_lock);
>   }
>   
> -/* Return true if pgd is changed, otherwise return false. */
> -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> -{
> -	pud_t *pud;
> -	int i;
> -
> -	for (i = 0; i < PTRS_PER_PUD; i++) {
> -		pud = pud_start + i;
> -		if (pud_val(*pud))
> -			return false;
> -	}
> -
> -	/* free a pud table */
> -	free_pagetable(pgd_page(*pgd), 0);
> -	spin_lock(&init_mm.page_table_lock);
> -	pgd_clear(pgd);
> -	spin_unlock(&init_mm.page_table_lock);
> -
> -	return true;
> -}
> -
>   static void __meminit
>   remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>   		 bool direct)
> @@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
>   	unsigned long addr;
>   	pgd_t *pgd;
>   	pud_t *pud;
> -	bool pgd_changed = false;
>   
>   	for (addr = start; addr < end; addr = next) {
>   		next = pgd_addr_end(addr, end);
> @@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
>   
>   		pud = (pud_t *)pgd_page_vaddr(*pgd);
>   		remove_pud_table(pud, addr, next, direct);
> -		if (free_pud_table(pud, pgd))
> -			pgd_changed = true;
>   	}
>   
> -	if (pgd_changed)
> -		sync_global_pgds(start, end - 1, 1);
> -
>   	flush_tlb_all();
>   }
>   
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2015-09-22  6:23 ` [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
  2015-10-06  3:35   ` Kamezawa Hiroyuki
@ 2016-02-12 19:04   ` Andy Lutomirski
  2016-03-10  6:45     ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2016-02-12 19:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> So when memory hotplug removes a piece of physical memory from pagetable
> mappings, it also frees the underlying PGD entry.
>
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory hotplugged.

Ressurecting an ancient thread: I want this particular change to make
it (much) easier to make vmapped stacks work correctly.  Could it be
applied by itself?

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-02-12 19:04   ` Andy Lutomirski
@ 2016-03-10  6:45     ` Andy Lutomirski
  2016-03-10  9:56       ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2016-03-10  6:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Fri, Feb 12, 2016 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> So when memory hotplug removes a piece of physical memory from pagetable
>> mappings, it also frees the underlying PGD entry.
>>
>> This complicates PGD management, so don't do this. We can keep the
>> PGD mapped and the PUD table all clear - it's only a single 4K page
>> per 512 GB of memory hotplugged.
>
> Ressurecting an ancient thread: I want this particular change to make
> it (much) easier to make vmapped stacks work correctly.  Could it be
> applied by itself?
>

It's incomplete.  pageattr.c has another instance of the same thing.
I'll see if I can make it work, but I may end up doing something a
little different.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-03-10  6:45     ` Andy Lutomirski
@ 2016-03-10  9:56       ` Ingo Molnar
  2016-03-11  1:52         ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2016-03-10  9:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Waiman Long, Thomas Gleixner


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Fri, Feb 12, 2016 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >> So when memory hotplug removes a piece of physical memory from pagetable
> >> mappings, it also frees the underlying PGD entry.
> >>
> >> This complicates PGD management, so don't do this. We can keep the
> >> PGD mapped and the PUD table all clear - it's only a single 4K page
> >> per 512 GB of memory hotplugged.
> >
> > Ressurecting an ancient thread: I want this particular change to make
> > it (much) easier to make vmapped stacks work correctly.  Could it be
> > applied by itself?
> >
> 
> It's incomplete.  pageattr.c has another instance of the same thing.
> I'll see if I can make it work, but I may end up doing something a
> little different.

If so then mind picking up (and fixing ;-) tip:WIP.x86/mm in its entirety? It's 
well tested so shouldn't have too many easy to hit bugs. Feel free to rebase and 
restructure it, it's a WIP tree.

I keep getting distracted with other things but I'd hate if this got dropped on 
the floor.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-03-10  9:56       ` Ingo Molnar
@ 2016-03-11  1:52         ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2016-03-11  1:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Waiman Long, Thomas Gleixner

On Thu, Mar 10, 2016 at 1:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Fri, Feb 12, 2016 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> >> So when memory hotplug removes a piece of physical memory from pagetable
>> >> mappings, it also frees the underlying PGD entry.
>> >>
>> >> This complicates PGD management, so don't do this. We can keep the
>> >> PGD mapped and the PUD table all clear - it's only a single 4K page
>> >> per 512 GB of memory hotplugged.
>> >
>> > Ressurecting an ancient thread: I want this particular change to make
>> > it (much) easier to make vmapped stacks work correctly.  Could it be
>> > applied by itself?
>> >
>>
>> It's incomplete.  pageattr.c has another instance of the same thing.
>> I'll see if I can make it work, but I may end up doing something a
>> little different.
>
> If so then mind picking up (and fixing ;-) tip:WIP.x86/mm in its entirety? It's
> well tested so shouldn't have too many easy to hit bugs. Feel free to rebase and
> restructure it, it's a WIP tree.

I'll chew on this one patch a bit and see where the whole things go.
If I can rebase the rest on top, I'll use them.

BTW, how are current kernels possibly correct when this code runs?  We
zap a pgd from the init pgd.  I can't find any code that would try to
propagate that zapped pgd to other pgds.  Then, if we hotplug in some
more memory or claim the slot for vmap, we'll install a new pgd entry,
and we might access *that* through a different pgd.  There vmalloc
fault fixup won't help because the MMU will chase a stale pointer in
the old pgd.

So we might actually need this patch sooner rather than later.

>
> I keep getting distracted with other things but I'd hate if this got dropped on
> the floor.
>
> Thanks,
>
>         Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-03-11  1:52 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-22  6:23 [PATCH 00/11] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-09-22  6:23 ` [PATCH 01/11] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
2015-09-22 17:41   ` Linus Torvalds
2015-09-22 18:03     ` Andy Lutomirski
2015-09-22  6:23 ` [PATCH 02/11] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
2015-09-22 17:48   ` Linus Torvalds
2015-09-23 11:44     ` Oleg Nesterov
2015-09-29  8:42       ` Ingo Molnar
2015-09-29 16:51         ` Oleg Nesterov
2015-09-22  6:23 ` [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
2015-10-06  3:35   ` Kamezawa Hiroyuki
2016-02-12 19:04   ` Andy Lutomirski
2016-03-10  6:45     ` Andy Lutomirski
2016-03-10  9:56       ` Ingo Molnar
2016-03-11  1:52         ` Andy Lutomirski
2015-09-22  6:23 ` [PATCH 04/11] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
2015-09-22  6:23 ` [PATCH 05/11] mm: Introduce arch_pgd_init_late() Ingo Molnar
2015-09-22 17:55   ` Linus Torvalds
2015-09-22 18:00     ` Andy Lutomirski
2015-09-22 18:26       ` Linus Torvalds
2015-09-22 18:37         ` Andy Lutomirski
2015-09-22 18:44           ` Linus Torvalds
2015-09-22 18:52             ` Andy Lutomirski
2015-09-22  6:23 ` [PATCH 06/11] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
2015-09-22 17:58   ` Linus Torvalds
2015-09-29  8:44     ` Ingo Molnar
2015-09-22  6:23 ` [PATCH 07/11] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
2015-09-22 17:59   ` Linus Torvalds
2015-09-22  6:23 ` [PATCH 08/11] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
2015-09-22  6:23 ` [PATCH 09/11] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
2015-09-22  6:23 ` [PATCH 10/11] x86/mm: Remove pgd_list leftovers Ingo Molnar
2015-09-22  6:23 ` [PATCH 11/11] x86/mm: Simplify pgd_alloc() Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).