All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()
@ 2015-06-11 14:07 Ingo Molnar
  2015-06-11 14:07 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
                   ` (13 more replies)
  0 siblings, 14 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed 
moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce 
some of the lock bouncing overhead.

I think we can do much better: this series eliminates the pgd_list and makes 
pgd_alloc()/pgd_free() lockless.

Now the lockless initialization of the PGD has a few preconditions, which the 
initial part of the series implements:

 - no PGD clearing is allowed, only additions. This makes sense as a single PGD 
   entry covers 512 GB of RAM so the 4K overhead per 0.5TB of RAM mapped is 
   miniscule.

The patches after that convert existing pgd_list users to walk the task list.

PGD locking is kept intact: coherency guarantees between the CPA, vmalloc, 
hotplug, etc. code are unchanged.

The final patches eliminate the pgd_list and thus make pgd_alloc()/pgd_free() 
lockless.

The patches have been boot tested on 64-bit and 32-bit x86 systems.

Architectures not making use of the new facility are unaffected.

Thanks,

	Ingo

=====
Ingo Molnar (12):
  x86/mm/pat: Don't free PGD entries on memory unmap
  x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm/hotplug: Simplify sync_global_pgds()
  mm: Introduce arch_pgd_init_late()
  x86/mm: Enable and use the arch_pgd_init_late() method
  x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  x86/mm: Remove pgd_list use from vmalloc_sync_all()
  x86/mm/pat/32: Remove pgd_list use from the PAT code
  x86/mm: Make pgd_alloc()/pgd_free() lockless
  x86/mm: Remove pgd_list leftovers
  x86/mm: Simplify pgd_alloc()

 arch/Kconfig                      |   9 ++++
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/pgtable.h    |   3 --
 arch/x86/include/asm/pgtable_64.h |   3 +-
 arch/x86/mm/fault.c               |  24 +++++++----
 arch/x86/mm/init_64.c             |  73 +++++++++++----------------------
 arch/x86/mm/pageattr.c            |  34 ++++++++--------
 arch/x86/mm/pgtable.c             | 129 +++++++++++++++++++++++++++++-----------------------------
 arch/x86/xen/mmu.c                |  34 +++++++++++++---
 fs/exec.c                         |   3 ++
 include/linux/mm.h                |   6 +++
 kernel/fork.c                     |  16 ++++++++
 12 files changed, 183 insertions(+), 152 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 70d221fe2eb4..997fc97e9072 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
 	pgd_t *pgd_entry = root + pgd_index(addr);
 
 	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
 }
 
 static int alloc_pte_page(pmd_t *pmd)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
  2015-06-11 14:07 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-12 22:22   ` Oleg Nesterov
                     ` (2 more replies)
  2015-06-11 14:07 ` [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
                   ` (11 subsequent siblings)
  13 siblings, 3 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.

But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.

Also improve the comments a bit, to make this function easier
to understand.

Only lightly tested, as I don't have a memory hotplug setup.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..1921acbd49fd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
  */
 void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 {
@@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 
 	for (address = start; address <= end; address += PGDIR_SIZE) {
 		const pgd_t *pgd_ref = pgd_offset_k(address);
-		struct page *page;
+		struct task_struct *g, *p;
 
 		/*
-		 * When it is called after memory hot remove, pgd_none()
-		 * returns true. In this case (removed == 1), we must clear
-		 * the PGD entries in the local PGD level page.
+		 * When this function is called after memory hot remove,
+		 * pgd_none() already returns true, but only the reference
+		 * kernel PGD has been cleared, not the process PGDs.
+		 *
+		 * So clear the affected entries in every process PGD as well:
 		 */
 		if (pgd_none(*pgd_ref) && !removed)
 			continue;
 
 		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
-			pgd_t *pgd;
+
+		for_each_process_thread(g, p) {
+			pgd_t *pgd = p->mm->pgd;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			if (!p->mm)
+				continue;
+
+			/* The pgt_lock is only used by Xen: */
+			pgt_lock = &p->mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
-				BUG_ON(pgd_page_vaddr(*pgd)
-				       != pgd_page_vaddr(*pgd_ref));
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
 			if (removed) {
 				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
  2015-06-11 14:07 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1921acbd49fd..2d5931d343cd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -770,27 +770,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -982,7 +961,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -993,13 +971,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (2 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-12 22:28   ` Oleg Nesterov
  2015-06-11 14:07 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.

So remove the 'removed' parameter and simplify the call sites.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h |  3 +--
 arch/x86/mm/fault.c               |  2 +-
 arch/x86/mm/init_64.c             | 19 +++++++------------
 3 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
 	native_set_pgd(pgd, native_make_pgd(0));
 }
 
-extern void sync_global_pgds(unsigned long start, unsigned long end,
-			     int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);
 
 /*
  * Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53bac3a7..50342825f221 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address)
 
 void vmalloc_sync_all(void)
 {
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
 }
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2d5931d343cd..beee532b76a7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
  * matching PGD entries in the local PGD level page as well.
  */
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
 
@@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 		 *
 		 * So clear the affected entries in every process PGD as well:
 		 */
-		if (pgd_none(*pgd_ref) && !removed)
+		if (pgd_none(*pgd_ref))
 			continue;
 
 		spin_lock(&pgd_lock);
@@ -197,13 +197,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
 				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
-			if (removed) {
-				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
-					pgd_clear(pgd);
-			} else {
-				if (pgd_none(*pgd))
-					set_pgd(pgd, *pgd_ref);
-			}
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
 		}
@@ -636,7 +631,7 @@ kernel_physical_mapping_init(unsigned long start,
 	}
 
 	if (pgd_changed)
-		sync_global_pgds(addr, end - 1, 0);
+		sync_global_pgds(addr, end - 1);
 
 	__flush_tlb_all();
 
@@ -1276,7 +1271,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 	else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
-		sync_global_pgds(start, end - 1, 0);
+		sync_global_pgds(start, end - 1);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (3 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 18:23   ` Andrew Morton
  2015-06-12 21:12   ` Oleg Nesterov
  2015-06-11 14:07 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().

The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.

Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/Kconfig       |  9 +++++++++
 fs/exec.c          |  3 +++
 include/linux/mm.h |  6 ++++++
 kernel/fork.c      | 16 ++++++++++++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..a8e866cd4247 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -491,6 +491,15 @@ config PGTABLE_LEVELS
 	int
 	default 2
 
+config ARCH_HAS_PGD_INIT_LATE
+	bool
+	help
+	  Architectures that want a late PGD initialization can define
+	  the arch_pgd_init_late() callback and it will be called
+	  by the generic new task (fork()) code after a new task has
+	  been made visible on the task list, but before it has been
+	  first scheduled.
+
 config ARCH_HAS_ELF_RANDOMIZE
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..c1d213c64fda 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
+
 	tsk->mm = mm;
+	arch_pgd_init_late(mm, mm->pgd);
+
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..35d887d2b038 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { }
+#endif
+
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..1f83ceca6c6c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/*
+	 * If we have a new PGD then initialize it:
+	 *
+	 * This method is called after a task has been made visible
+	 * on the task list already.
+	 *
+	 * Architectures that manage per task kernel pagetables
+	 * might use this callback to initialize them after they
+	 * are already visible to new updates.
+	 *
+	 * NOTE: any user-space parts of the PGD are already initialized
+	 *       and must not be clobbered.
+	 */
+	if (p->mm != current->mm)
+		arch_pgd_init_late(p->mm, p->mm->pgd);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (4 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 20:04   ` Boris Ostrovsky
  2015-06-12 22:50   ` Oleg Nesterov
  2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Prepare for lockless PGD init: enable the arch_pgd_init_late() callback
and add a 'careful' implementation of PGD init to it: only copy over
non-zero entries.

Since PGD entries only ever get added, this method catches any updates
to swapper_pg_dir[] that might have occured between early PGD init
and late PGD init.

Note that this only matters for code that does not use the pgd_list but
the task list to find all PGDs in the system.

Subsequent patches will convert pgd_list users to task-list iterations.

[ This adds extra overhead in that we do the PGD initialization for a
  second time - a later patch will simplify this, once we don't have
  old pgd_list users. ]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/mm/pgtable.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7e39f9b22705..15c19ce149f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_HAS_PGD_INIT_LATE
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..e0bf90470d70 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	return NULL;
 }
 
+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
+{
+	/*
+	 * This is called after a new MM has been made visible
+	 * in fork() or exec().
+	 *
+	 * This barrier makes sure the MM is visible to new RCU
+	 * walkers before we initialize it, so that we don't miss
+	 * updates:
+	 */
+	smp_wmb();
+
+	/*
+	 * If the pgd points to a shared pagetable level (either the
+	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
+	 * references from swapper_pg_dir:
+	 */
+	if (CONFIG_PGTABLE_LEVELS == 2 ||
+	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+	    CONFIG_PGTABLE_LEVELS == 4) {
+
+		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =            pgd + KERNEL_PGD_BOUNDARY;
+		int i;
+
+		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+			/*
+			 * This is lock-less, so it can race with PGD updates
+			 * coming from vmalloc() or CPA methods, but it's safe,
+			 * because:
+			 *
+			 * 1) this PGD is not in use yet, we have still not
+			 *    scheduled this task.
+			 * 2) we only ever extend PGD entries
+			 *
+			 * So if we observe a non-zero PGD entry we can copy it,
+			 * it won't change from under us. Parallel updates (new
+			 * allocations) will modify our (already visible) PGD:
+			 */
+			if (pgd_val(*pgd_src))
+				WRITE_ONCE(*pgd_dst, *pgd_src);
+		}
+	}
+}
+
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (5 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 20:37   ` Linus Torvalds
  2015-06-11 14:07 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..87a8354435f8 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,18 +853,29 @@ static void xen_pgd_pin(struct mm_struct *mm)
  */
 void xen_mm_pin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
+	rcu_read_lock();
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process_thread(g, p) {
+		struct page *page;
+		pgd_t *pgd;
+
+		if (!p->mm)
+			continue;
+
+		pgd = p->mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_pin(&init_mm, pgd);
 			SetPageSavePinned(page);
 		}
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -967,19 +978,30 @@ static void xen_pgd_unpin(struct mm_struct *mm)
  */
 void xen_mm_unpin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
+	rcu_read_lock();
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process_thread(g, p) {
+		struct page *page;
+		pgd_t *pgd;
+
+		if (!p->mm)
+			continue;
+
+		pgd = p->mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (PageSavePinned(page)) {
 			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_unpin(&init_mm, pgd);
 			ClearPageSavePinned(page);
 		}
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (6 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 15:04   ` Andy Lutomirski
  2015-06-11 14:07 ` [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs.

This uses pgd_list currently, but we don't need the global list,
as we can walk the task list under RCU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c   | 21 ++++++++++++++-------
 arch/x86/mm/init_64.c |  3 ++-
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50342825f221..2d587e548b59 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,24 +235,31 @@ void vmalloc_sync_all(void)
 	for (address = VMALLOC_START & PMD_MASK;
 	     address >= TASK_SIZE && address < FIXADDR_TOP;
 	     address += PMD_SIZE) {
-		struct page *page;
 
+		struct task_struct *g, *p;
+
+		rcu_read_lock();
 		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+
+		for_each_process_thread(g, p) {
 			spinlock_t *pgt_lock;
-			pmd_t *ret;
+			pmd_t *pmd_ret;
 
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			if (!p->mm)
+				continue;
 
+			/* The pgt_lock is only used on Xen: */
+			pgt_lock = &p->mm->page_table_lock;
 			spin_lock(pgt_lock);
-			ret = vmalloc_sync_one(page_address(page), address);
+			pmd_ret = vmalloc_sync_one(p->mm->pgd, address);
 			spin_unlock(pgt_lock);
 
-			if (!ret)
+			if (!pmd_ret)
 				break;
 		}
+
 		spin_unlock(&pgd_lock);
+		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index beee532b76a7..730560c4873e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -184,11 +184,12 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_lock(&pgd_lock);
 
 		for_each_process_thread(g, p) {
-			pgd_t *pgd = p->mm->pgd;
+			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
 			if (!p->mm)
 				continue;
+			pgd = p->mm->pgd;
 
 			/* The pgt_lock is only used by Xen: */
 			pgt_lock = &p->mm->page_table_lock;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (7 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.

This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.

(This code already holds the pgd_lock.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 997fc97e9072..93c134fdb398 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -438,18 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	set_pte_atomic(kpte, pte);
 #ifdef CONFIG_X86_32
 	if (!SHARED_KERNEL_PMD) {
-		struct page *page;
+		struct task_struct *g, *p;
 
-		list_for_each_entry(page, &pgd_list, lru) {
+		rcu_read_lock();
+
+		for_each_process_thread(g, p) {
+			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			if (!p->mm)
+				continue;
+
+			pgt_lock = &p->mm->page_table_lock;
+			spin_lock(pgt_lock);
+
+			pgd = p->mm->pgd + pgd_index(address);
 			pud = pud_offset(pgd, address);
 			pmd = pmd_offset(pud, address);
 			set_pte_atomic((pte_t *)pmd, pte);
+
+			spin_unlock(pgt_lock);
 		}
+
+		rcu_read_unlock();
 	}
 #endif
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (8 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 11/12] x86/mm: Remove pgd_list leftovers Ingo Molnar
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.

The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.

The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.

So we can remove the pgd_list addition/removal from this code.

The new PGD is private while constructed, so it needs no extra
locking.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++------------------------
 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e0bf90470d70..d855010a111f 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
 	}
-
-	/* list required to sync kernel mapping updates */
-	if (!SHARED_KERNEL_PMD) {
-		pgd_set_mm(pgd, mm);
-		pgd_list_add(pgd);
-	}
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	spin_lock(&pgd_lock);
-	pgd_list_del(pgd);
-	spin_unlock(&pgd_lock);
 }
 
 /*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * Make sure that pre-populating the pmds is atomic with
-	 * respect to anything walking the pgd_list, so that they
-	 * never see a partially populated pgd.
+	 * No locking is needed here, as the PGD is still private,
+	 * so no code walking the task list and looking at mm->pgd
+	 * will be able to see it before it's fully constructed:
 	 */
-	spin_lock(&pgd_lock);
-
 	pgd_ctor(mm, pgd);
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
-	spin_unlock(&pgd_lock);
-
 	return pgd;
 
 out_free_pmds:
@@ -451,7 +431,6 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	pgd_dtor(pgd);
 	paravirt_pgd_free(mm, pgd);
 	_pgd_free(pgd);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/12] x86/mm: Remove pgd_list leftovers
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (9 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 12/12] x86/mm: Simplify pgd_alloc() Ingo Molnar
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Nothing uses the pgd_list anymore - remove the list itself and its helpers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |  3 ---
 arch/x86/mm/fault.c            |  1 -
 arch/x86/mm/pgtable.c          | 26 --------------------------
 3 files changed, 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fe57e7a98839..7e93438d8b53 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 
 extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2d587e548b59..bebdd97f888b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);
 
 #ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index d855010a111f..0666f7796806 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_del(&page->lru);
-}
-
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
-	BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
-	virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
-	return (struct mm_struct *)page->index;
-}
-
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 {
 	/* If the pgd points to a shared pagetable level (either the
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/12] x86/mm: Simplify pgd_alloc()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (10 preceding siblings ...)
  2015-06-11 14:07 ` [PATCH 11/12] x86/mm: Remove pgd_list leftovers Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 14:13   ` Ingo Molnar
  2015-06-11 16:46 ` H. Peter Anvin
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.

This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0666f7796806..b1bd35f452ef 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
-	/* If the pgd points to a shared pagetable level (either the
-	   ptes in non-PAE, or shared PMD in PAE), then just copy the
-	   references from swapper_pg_dir. */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
-	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
-		clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
-				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-				KERNEL_PGD_PTRS);
-	}
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * No locking is needed here, as the PGD is still private,
-	 * so no code walking the task list and looking at mm->pgd
-	 * will be able to see it before it's fully constructed:
+	 * Zero out the kernel portion here, we'll set them up in
+	 * arch_pgd_init_late(), when the pgd is globally
+	 * visible already per the task list, so that it cannot
+	 * miss updates.
+	 *
+	 * We need to zero it here, to make sure arch_pgd_init_late()
+	 * can initialize them without locking.
 	 */
-	pgd_ctor(mm, pgd);
+	memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
 	return pgd;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
@ 2015-06-11 14:13   ` Ingo Molnar
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Thomas Gleixner, Waiman Long


[ I fat-fingered the linux-mm Cc:, so every reply will bounce on that,
  sorry about that  :-/ Fixed it in this mail's Cc: list. ]

* Ingo Molnar <mingo@kernel.org> wrote:

> Waiman Long reported 'pgd_lock' contention on high CPU count systems and 
> proposed moving pgd_lock on a separate cacheline to eliminate false sharing and 
> to reduce some of the lock bouncing overhead.

So 'pgd_lock' is a global lock, used for every new task creation:

arch/x86/mm/fault.c:DEFINE_SPINLOCK(pgd_lock);

which with a sufficiently high CPU count starts to hurt.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()
@ 2015-06-11 14:13   ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andy Lutomirski, Andrew Morton, Denys Vlasenko, Brian Gerst,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Oleg Nesterov, Thomas Gleixner, Waiman Long


[ I fat-fingered the linux-mm Cc:, so every reply will bounce on that,
  sorry about that  :-/ Fixed it in this mail's Cc: list. ]

* Ingo Molnar <mingo@kernel.org> wrote:

> Waiman Long reported 'pgd_lock' contention on high CPU count systems and 
> proposed moving pgd_lock on a separate cacheline to eliminate false sharing and 
> to reduce some of the lock bouncing overhead.

So 'pgd_lock' is a global lock, used for every new task creation:

arch/x86/mm/fault.c:DEFINE_SPINLOCK(pgd_lock);

which with a sufficiently high CPU count starts to hurt.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-06-11 14:07 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
@ 2015-06-11 15:04   ` Andy Lutomirski
  2015-06-11 15:49     ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-06-11 15:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
> The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
> the global reference kernel PGD to task PGDs.

Does it?  AFAICS the only caller is register_die_notifier, and it's
not really clear to me why that exists.

At some point I'd love to remove lazy kernel PGD sync from the kernel
entirely (or at least from x86) and just do it when we switch mms.
Now that you're removing all code that deletes kernel PGD entries, I
think all we'd need to do is to add a per-PGD or per-mm count of the
number of kernel entries populated and to fix it up when we switch to
an mm with fewer entries populated than init_mm.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-06-11 15:04   ` Andy Lutomirski
@ 2015-06-11 15:49     ` Ingo Molnar
  2015-06-11 16:10       ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-11 15:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mml, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
> > the global reference kernel PGD to task PGDs.
> 
> Does it?  AFAICS the only caller is register_die_notifier, and it's
> not really clear to me why that exists.

Doh, indeed, got confused in that changelog - we are filling it in 
opportunistically via vmalloc_fault().

> At some point I'd love to remove lazy kernel PGD sync from the kernel entirely 
> (or at least from x86) and just do it when we switch mms. Now that you're 
> removing all code that deletes kernel PGD entries, I think all we'd need to do 
> is to add a per-PGD or per-mm count of the number of kernel entries populated 
> and to fix it up when we switch to an mm with fewer entries populated than 
> init_mm.

That would add a (cheap but nonzero) runtime check to every context switch. It's a 
relative slow path, but in comparison vmalloc() is an even slower slowpath, so why 
not do it there and just do synchronous updates and remove the vmalloc faults 
altogether?

Also, on 64-bit it should not matter much: there the only change is the once in a 
blue moon case where we allocate a new pgd for a 512 GB block of address space 
that a single pgd entry covers.

I'd hate to add a check to every context switch, no matter how cheap, just for a 
case that essentially never triggers...

So how about this solution instead:

 - we add a generation counter to sync_global_pgds() so that it can detect when
   the number of pgds populated in init_mm changes.

 - we change vmalloc() to call sync_global_pgds(): this will be very cheap in the 
   overwhelming majority of cases.

 - we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-06-11 15:49     ` Ingo Molnar
@ 2015-06-11 16:10       ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-06-11 16:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On Thu, Jun 11, 2015 at 8:49 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
>> > the global reference kernel PGD to task PGDs.
>>
>> Does it?  AFAICS the only caller is register_die_notifier, and it's
>> not really clear to me why that exists.
>
> Doh, indeed, got confused in that changelog - we are filling it in
> opportunistically via vmalloc_fault().
>
>> At some point I'd love to remove lazy kernel PGD sync from the kernel entirely
>> (or at least from x86) and just do it when we switch mms. Now that you're
>> removing all code that deletes kernel PGD entries, I think all we'd need to do
>> is to add a per-PGD or per-mm count of the number of kernel entries populated
>> and to fix it up when we switch to an mm with fewer entries populated than
>> init_mm.
>
> That would add a (cheap but nonzero) runtime check to every context switch. It's a
> relative slow path, but in comparison vmalloc() is an even slower slowpath, so why
> not do it there and just do synchronous updates and remove the vmalloc faults
> altogether?
>
> Also, on 64-bit it should not matter much: there the only change is the once in a
> blue moon case where we allocate a new pgd for a 512 GB block of address space
> that a single pgd entry covers.
>
> I'd hate to add a check to every context switch, no matter how cheap, just for a
> case that essentially never triggers...
>
> So how about this solution instead:
>
>  - we add a generation counter to sync_global_pgds() so that it can detect when
>    the number of pgds populated in init_mm changes.
>
>  - we change vmalloc() to call sync_global_pgds(): this will be very cheap in the
>    overwhelming majority of cases.
>
>  - we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-)

Should be good.  The only real down side is that the update becomes a
walk over all tasks or over all mms, whereas if we checked on context
switches, then the update is just a single PGD entry copy.

That being said, the updates should be *really* rare.  I think it's
completely impossible for it to happen more than 255 times per boot.
So I like your solution.

Then I can stop wondering what happens when we manage to take a
vmalloc fault early in entry_SYSCALL_64 or somewhere in the page_fault
prologue.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free()
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
                   ` (12 preceding siblings ...)
  2015-06-11 14:13   ` Ingo Molnar
@ 2015-06-11 16:46 ` H. Peter Anvin
  13 siblings, 0 replies; 51+ messages in thread
From: H. Peter Anvin @ 2015-06-11 16:46 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, Linus Torvalds,
	Oleg Nesterov, Thomas Gleixner, Waiman Long

On 06/11/2015 07:07 AM, Ingo Molnar wrote:
> Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
> moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
> some of the lock bouncing overhead.
>
> I think we can do much better: this series eliminates the pgd_list and makes
> pgd_alloc()/pgd_free() lockless.

This ties into a slightly different issue, which is how to deal 
*properly* with "special" PGDs like the 1:1 trampoline and the UEFI page 
tables.  These, too, should be able to incorporate, possibly more than 
once, page tables further down.

	-hpa


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-11 14:07 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
@ 2015-06-11 18:23   ` Andrew Morton
  2015-06-12  8:16     ` Ingo Molnar
  2015-06-12 21:12   ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2015-06-11 18:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar <mingo@kernel.org> wrote:

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	syscall_tracepoint_update(p);
>  	write_unlock_irq(&tasklist_lock);
>  
> +	/*
> +	 * If we have a new PGD then initialize it:
> +	 *
> +	 * This method is called after a task has been made visible
> +	 * on the task list already.
> +	 *
> +	 * Architectures that manage per task kernel pagetables
> +	 * might use this callback to initialize them after they
> +	 * are already visible to new updates.
> +	 *
> +	 * NOTE: any user-space parts of the PGD are already initialized
> +	 *       and must not be clobbered.
> +	 */
> +	if (p->mm != current->mm)
> +		arch_pgd_init_late(p->mm, p->mm->pgd);

Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-11 14:07 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
@ 2015-06-11 20:04   ` Boris Ostrovsky
  2015-06-12  8:19     ` Ingo Molnar
  2015-06-12 22:50   ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Boris Ostrovsky @ 2015-06-11 20:04 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long,
	Konrad Rzeszutek Wilk, David Vrabel

On 06/11/2015 10:07 AM, Ingo Molnar wrote:

> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index fb0a9dd1d6e4..e0bf90470d70 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>   	return NULL;
>   }
>
> +/*
> + * Initialize the kernel portion of the PGD.
> + *
> + * This is done separately, because pgd_alloc() happens when
> + * the task is not on the task list yet - and PGD updates
> + * happen by walking the task list.
> + *
> + * No locking is needed here, as we just copy over the reference
> + * PGD. The reference PGD (pgtable_init) is only ever expanded
> + * at the highest, PGD level. Thus any other task extending it
> + * will first update the reference PGD, then modify the task PGDs.
> + */
> +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> +{
> +	/*
> +	 * This is called after a new MM has been made visible
> +	 * in fork() or exec().
> +	 *
> +	 * This barrier makes sure the MM is visible to new RCU
> +	 * walkers before we initialize it, so that we don't miss
> +	 * updates:
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * If the pgd points to a shared pagetable level (either the
> +	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
> +	 * references from swapper_pg_dir:
> +	 */
> +	if (CONFIG_PGTABLE_LEVELS == 2 ||
> +	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
> +	    CONFIG_PGTABLE_LEVELS == 4) {
> +
> +		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
> +		pgd_t *pgd_dst =            pgd + KERNEL_PGD_BOUNDARY;
> +		int i;
> +
> +		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
> +			/*
> +			 * This is lock-less, so it can race with PGD updates
> +			 * coming from vmalloc() or CPA methods, but it's safe,
> +			 * because:
> +			 *
> +			 * 1) this PGD is not in use yet, we have still not
> +			 *    scheduled this task.
> +			 * 2) we only ever extend PGD entries
> +			 *
> +			 * So if we observe a non-zero PGD entry we can copy it,
> +			 * it won't change from under us. Parallel updates (new
> +			 * allocations) will modify our (already visible) PGD:
> +			 */
> +			if (pgd_val(*pgd_src))
> +				WRITE_ONCE(*pgd_dst, *pgd_src);


This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a 
Xen PV guest.

I don't know whether anything would need to be done wrt WRITE_ONCE. 
Perhaps put it into native_set_pgd()?

-boris



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
@ 2015-06-11 20:37   ` Linus Torvalds
  2015-06-11 20:41     ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2015-06-11 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long

On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.

This is not at all equivalent, and it looks stupid.

Don't use "for_each_process_thread(g, p)". You only care about each
mm, and threads all share the same mm, so just do

    for_each_process(p)

instead of iterating over all threads too.

No?

               Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 20:37   ` Linus Torvalds
@ 2015-06-11 20:41     ` Linus Torvalds
  2015-06-12  7:23       ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2015-06-11 20:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long

On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Don't use "for_each_process_thread(g, p)". You only care about each
> mm, and threads all share the same mm, so just do
>
>     for_each_process(p)
>
> instead of iterating over all threads too.

Hmm. I may be wrong. It strikes me that one of the group leaders might
have exited but the subthreads live on. We'd see p->mm being NULL,
even though the mm was originally in use.

Ugh. So maybe the code really does need to iterate over all threads.

                  Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 20:41     ` Linus Torvalds
@ 2015-06-12  7:23       ` Ingo Molnar
       [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-12  7:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Don't use "for_each_process_thread(g, p)". You only care about each mm, and 
> > threads all share the same mm, so just do
> >
> >     for_each_process(p)
> >
> > instead of iterating over all threads too.
> 
> Hmm. I may be wrong. It strikes me that one of the group leaders might have 
> exited but the subthreads live on. We'd see p->mm being NULL, even though the mm 
> was originally in use.
> 
> Ugh. So maybe the code really does need to iterate over all threads.

Yeah, for_each_process() is indeed no guarantee that we iterate over all mm's.

We might make it so: but that would mean restricting certain clone_flags variants 
- not sure that's possible with our current ABI usage?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
       [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
@ 2015-06-12  8:04           ` Ingo Molnar
  2015-06-12 20:38             ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-12  8:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Thomas Gleixner, Denys Vlasenko, Borislav Petkov,
	Andrew Morton, Oleg Nesterov, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Jun 12, 2015 00:23, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> > We might make it so: but that would mean restricting certain clone_flags 
> > variants - not sure that's possible with our current ABI usage?
> 
> We already do that. You can't share signal info unless you share the mm. And a 
> shared signal state is what defines a thread group.
> 
> So I think the only issue is that ->mm can become NULL when the thread group 
> leader dies - a non-NULL mm should always be shared among all threads.

Indeed, we do that in exit_mm().

So we could add tsk->mm_leader or so, which does not get cleared and which the 
scheduler does not look at, but I'm not sure it's entirely safe that way: we don't 
have a refcount, and when the last thread exits it becomes bogus for a small 
window until the zombie leader is unlinked from the task list.

To close that race we'd have __mmdrop() or so clear out tsk->mm_leader - but the 
task doing the mmdrop() might be a lazy thread totally unrelated to the original 
thread group so we don't know which tsk->mm_leader to clear out.

To solve that we'd have to track the leader owning an MM in mm_struct - which gets 
interesting for the exec() case where the thread group gets a new leader, so we'd 
have to re-link the mm's leader pointer there.

So unless I missed some simpler solution there a good number of steps where this 
could go wrong, in small looking race windows - how about we just live with 
iterating through all tasks instead of just all processes, once per 512 GB of 
memory mapped?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-11 18:23   ` Andrew Morton
@ 2015-06-12  8:16     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-12  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  	syscall_tracepoint_update(p);
> >  	write_unlock_irq(&tasklist_lock);
> >  
> > +	/*
> > +	 * If we have a new PGD then initialize it:
> > +	 *
> > +	 * This method is called after a task has been made visible
> > +	 * on the task list already.
> > +	 *
> > +	 * Architectures that manage per task kernel pagetables
> > +	 * might use this callback to initialize them after they
> > +	 * are already visible to new updates.
> > +	 *
> > +	 * NOTE: any user-space parts of the PGD are already initialized
> > +	 *       and must not be clobbered.
> > +	 */
> > +	if (p->mm != current->mm)
> > +		arch_pgd_init_late(p->mm, p->mm->pgd);
> 
> Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)?

Indeed - fixed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-11 20:04   ` Boris Ostrovsky
@ 2015-06-12  8:19     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-12  8:19 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Oleg Nesterov, Thomas Gleixner,
	Waiman Long, Konrad Rzeszutek Wilk, David Vrabel


* Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:

> On 06/11/2015 10:07 AM, Ingo Molnar wrote:
> 
> >diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> >index fb0a9dd1d6e4..e0bf90470d70 100644
> >--- a/arch/x86/mm/pgtable.c
> >+++ b/arch/x86/mm/pgtable.c
> >@@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> >  	return NULL;
> >  }
> >
> >+/*
> >+ * Initialize the kernel portion of the PGD.
> >+ *
> >+ * This is done separately, because pgd_alloc() happens when
> >+ * the task is not on the task list yet - and PGD updates
> >+ * happen by walking the task list.
> >+ *
> >+ * No locking is needed here, as we just copy over the reference
> >+ * PGD. The reference PGD (pgtable_init) is only ever expanded
> >+ * at the highest, PGD level. Thus any other task extending it
> >+ * will first update the reference PGD, then modify the task PGDs.
> >+ */
> >+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> >+{
> >+	/*
> >+	 * This is called after a new MM has been made visible
> >+	 * in fork() or exec().
> >+	 *
> >+	 * This barrier makes sure the MM is visible to new RCU
> >+	 * walkers before we initialize it, so that we don't miss
> >+	 * updates:
> >+	 */
> >+	smp_wmb();
> >+
> >+	/*
> >+	 * If the pgd points to a shared pagetable level (either the
> >+	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
> >+	 * references from swapper_pg_dir:
> >+	 */
> >+	if (CONFIG_PGTABLE_LEVELS == 2 ||
> >+	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
> >+	    CONFIG_PGTABLE_LEVELS == 4) {
> >+
> >+		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
> >+		pgd_t *pgd_dst =            pgd + KERNEL_PGD_BOUNDARY;
> >+		int i;
> >+
> >+		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
> >+			/*
> >+			 * This is lock-less, so it can race with PGD updates
> >+			 * coming from vmalloc() or CPA methods, but it's safe,
> >+			 * because:
> >+			 *
> >+			 * 1) this PGD is not in use yet, we have still not
> >+			 *    scheduled this task.
> >+			 * 2) we only ever extend PGD entries
> >+			 *
> >+			 * So if we observe a non-zero PGD entry we can copy it,
> >+			 * it won't change from under us. Parallel updates (new
> >+			 * allocations) will modify our (already visible) PGD:
> >+			 */
> >+			if (pgd_val(*pgd_src))
> >+				WRITE_ONCE(*pgd_dst, *pgd_src);
> 
> 
> This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a Xen
> PV guest.

Thanks, fixed.

> I don't know whether anything would need to be done wrt WRITE_ONCE. Perhaps put 
> it into native_set_pgd()?

So this was just write-tearing paranoia at the raw pgd value copy I did which is 
an unusual pattern - but I'll use set_pgd(), as it clearly works and has the 
paravirt callback as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12  8:04           ` Ingo Molnar
@ 2015-06-12 20:38             ` Oleg Nesterov
  2015-06-12 20:53               ` Oleg Nesterov
  2015-06-13  7:26               ` Ingo Molnar
  0 siblings, 2 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 20:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > So I think the only issue is that ->mm can become NULL when the thread group
> > leader dies - a non-NULL mm should always be shared among all threads.
>
> Indeed, we do that in exit_mm().

Yes,

> So we could add tsk->mm_leader or so,

No, no, please do not. Just do something like

	for_each_process(p) {

		for_each_thread(p, t) {
			if (t->mm) {
				do_something(t->mm);
				break;
			}
		}
	}

But either way I don't understand what protects this ->mm. Perhaps this needs
find_lock_task_mm().

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:38             ` Oleg Nesterov
@ 2015-06-12 20:53               ` Oleg Nesterov
  2015-06-12 22:30                 ` Boris Ostrovsky
  2015-06-13  7:26               ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 20:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12, Oleg Nesterov wrote:
>
> On 06/12, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > So I think the only issue is that ->mm can become NULL when the thread group
> > > leader dies - a non-NULL mm should always be shared among all threads.
> >
> > Indeed, we do that in exit_mm().
>
> Yes,
>
> > So we could add tsk->mm_leader or so,
>
> No, no, please do not. Just do something like
>
> 	for_each_process(p) {
>
> 		for_each_thread(p, t) {
> 			if (t->mm) {
> 				do_something(t->mm);
> 				break;
> 			}
> 		}
> 	}
>
> But either way I don't understand what protects this ->mm. Perhaps this needs
> find_lock_task_mm().

And, I don't understand this code, probably this doesn't matter, but.

unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
and miss the new child. Is it OK?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-11 14:07 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
  2015-06-11 18:23   ` Andrew Morton
@ 2015-06-12 21:12   ` Oleg Nesterov
  2015-06-13  6:54     ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 21:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/11, Ingo Molnar wrote:
>
> @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	syscall_tracepoint_update(p);
>  	write_unlock_irq(&tasklist_lock);
>  
> +	/*
> +	 * If we have a new PGD then initialize it:
> +	 *
> +	 * This method is called after a task has been made visible
> +	 * on the task list already.
> +	 *
> +	 * Architectures that manage per task kernel pagetables
> +	 * might use this callback to initialize them after they
> +	 * are already visible to new updates.
> +	 *
> +	 * NOTE: any user-space parts of the PGD are already initialized
> +	 *       and must not be clobbered.
> +	 */
> +	if (p->mm != current->mm)
> +		arch_pgd_init_late(p->mm, p->mm->pgd);
> +

Cosmetic, but imo

	if (!(clone_flags & CLONE_VM))
		arch_pgd_init_late(...);

will look better and more consistent.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
@ 2015-06-12 22:22   ` Oleg Nesterov
  2015-06-12 22:48   ` Waiman Long
  2015-06-12 23:15   ` Oleg Nesterov
  2 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 22:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/11, Ingo Molnar wrote:
>
>  void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>  {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>  
>  	for (address = start; address <= end; address += PGDIR_SIZE) {
>  		const pgd_t *pgd_ref = pgd_offset_k(address);
> -		struct page *page;
> +		struct task_struct *g, *p;
>  
>  		/*
> -		 * When it is called after memory hot remove, pgd_none()
> -		 * returns true. In this case (removed == 1), we must clear
> -		 * the PGD entries in the local PGD level page.
> +		 * When this function is called after memory hot remove,
> +		 * pgd_none() already returns true, but only the reference
> +		 * kernel PGD has been cleared, not the process PGDs.
> +		 *
> +		 * So clear the affected entries in every process PGD as well:
>  		 */
>  		if (pgd_none(*pgd_ref) && !removed)
>  			continue;
>  
>  		spin_lock(&pgd_lock);
> -		list_for_each_entry(page, &pgd_list, lru) {
> -			pgd_t *pgd;
> +
> +		for_each_process_thread(g, p) {

Well, this looks obvously unsafe without rcu_read_lock() at least.

The usage of ->mm doesn't look safe too but this is fixeable, see
my previous reply to 7/12.

And probably I am totally confused, but it seems that 06/12 should
come before this patch? Otherwise, why we can't race with fork() and
miss the new process?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()
  2015-06-11 14:07 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
@ 2015-06-12 22:28   ` Oleg Nesterov
  2015-06-13  7:51     ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 22:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

Yes I guess I am totally confused ;)

On 06/11, Ingo Molnar wrote:
>
> @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>  		 *
>  		 * So clear the affected entries in every process PGD as well:
>  		 */
> -		if (pgd_none(*pgd_ref) && !removed)
> +		if (pgd_none(*pgd_ref))

But doesn't this change invalidate the comment above?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:53               ` Oleg Nesterov
@ 2015-06-12 22:30                 ` Boris Ostrovsky
  2015-06-12 23:36                   ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Boris Ostrovsky @ 2015-06-12 22:30 UTC (permalink / raw)
  To: Oleg Nesterov, Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
> On 06/12, Oleg Nesterov wrote:
>>
>> On 06/12, Ingo Molnar wrote:
>>>
>>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> So I think the only issue is that ->mm can become NULL when the thread group
>>>> leader dies - a non-NULL mm should always be shared among all threads.
>>>
>>> Indeed, we do that in exit_mm().
>>
>> Yes,
>>
>>> So we could add tsk->mm_leader or so,
>>
>> No, no, please do not. Just do something like
>>
>> 	for_each_process(p) {
>>
>> 		for_each_thread(p, t) {
>> 			if (t->mm) {
>> 				do_something(t->mm);
>> 				break;
>> 			}
>> 		}
>> 	}
>>
>> But either way I don't understand what protects this ->mm. Perhaps this needs
>> find_lock_task_mm().
>
> And, I don't understand this code, probably this doesn't matter, but.
>
> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
> and miss the new child. Is it OK?


Currently xen_mm_pin_all() is only called in the suspend path, out of 
stop_machine(), so presumably at that time fork is not possible.


-boris

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
  2015-06-12 22:22   ` Oleg Nesterov
@ 2015-06-12 22:48   ` Waiman Long
  2015-06-13  7:46     ` Ingo Molnar
  2015-06-12 23:15   ` Oleg Nesterov
  2 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2015-06-12 22:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Oleg Nesterov, Thomas Gleixner

On 06/11/2015 10:07 AM, Ingo Molnar wrote:
> The memory hotplug code uses sync_global_pgds() to synchronize updates
> to the global (&init_mm) kernel PGD and the task PGDs. It does this
> by iterating over the pgd_list - which list closely tracks task
> creation/destruction via fork/clone.
>
> But we want to remove this list, so that it does not have to be
> maintained from fork()/exit(), so convert the memory hotplug code
> to use the task list to iterate over all pgds in the system.
>
> Also improve the comments a bit, to make this function easier
> to understand.
>
> Only lightly tested, as I don't have a memory hotplug setup.
>
> Cc: Andrew Morton<akpm@linux-foundation.org>
> Cc: Andy Lutomirski<luto@amacapital.net>
> Cc: Borislav Petkov<bp@alien8.de>
> Cc: Brian Gerst<brgerst@gmail.com>
> Cc: Denys Vlasenko<dvlasenk@redhat.com>
> Cc: H. Peter Anvin<hpa@zytor.com>
> Cc: Linus Torvalds<torvalds@linux-foundation.org>
> Cc: Peter Zijlstra<peterz@infradead.org>
> Cc: Thomas Gleixner<tglx@linutronix.de>
> Cc: Waiman Long<Waiman.Long@hp.com>
> Signed-off-by: Ingo Molnar<mingo@kernel.org>
> ---
>   arch/x86/mm/init_64.c | 30 +++++++++++++++++-------------
>   1 file changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..1921acbd49fd 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
>   __setup("noexec32=", nonx32_setup);
>
>   /*
> - * When memory was added/removed make sure all the processes MM have
> - * suitable PGD entries in the local PGD level page.
> + * When memory was added/removed make sure all the process MMs have
> + * matching PGD entries in the local PGD level page as well.
>    */
>   void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>   {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
>   	for (address = start; address<= end; address += PGDIR_SIZE) {
>   		const pgd_t *pgd_ref = pgd_offset_k(address);
> -		struct page *page;
> +		struct task_struct *g, *p;
>
>   		/*
> -		 * When it is called after memory hot remove, pgd_none()
> -		 * returns true. In this case (removed == 1), we must clear
> -		 * the PGD entries in the local PGD level page.
> +		 * When this function is called after memory hot remove,
> +		 * pgd_none() already returns true, but only the reference
> +		 * kernel PGD has been cleared, not the process PGDs.
> +		 *
> +		 * So clear the affected entries in every process PGD as well:
>   		 */
>   		if (pgd_none(*pgd_ref)&&  !removed)
>   			continue;
>
>   		spin_lock(&pgd_lock);
> -		list_for_each_entry(page,&pgd_list, lru) {
> -			pgd_t *pgd;
> +
> +		for_each_process_thread(g, p) {
> +			pgd_t *pgd = p->mm->pgd;
>   			spinlock_t *pgt_lock;
>
> -			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> -			/* the pgt_lock only for Xen */
> -			pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
> +			if (!p->mm)
> +				continue;

pgd was initialized to p->mm->pgd before the "p->mm" check is done. 
Shouldn't the initialization be moved after that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-11 14:07 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
  2015-06-11 20:04   ` Boris Ostrovsky
@ 2015-06-12 22:50   ` Oleg Nesterov
  2015-06-12 22:57     ` Oleg Nesterov
  2015-06-13  6:47     ` Ingo Molnar
  1 sibling, 2 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 22:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/11, Ingo Molnar wrote:
>
> +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> +{
> +	/*
> +	 * This is called after a new MM has been made visible
> +	 * in fork() or exec().
> +	 *
> +	 * This barrier makes sure the MM is visible to new RCU
> +	 * walkers before we initialize it, so that we don't miss
> +	 * updates:
> +	 */
> +	smp_wmb();

I can't understand the comment and the barrier...

Afaics, we need to ensure that:

> +			if (pgd_val(*pgd_src))
> +				WRITE_ONCE(*pgd_dst, *pgd_src);

either we notice the recent update of this PGD, or (say) the subsequent
sync_global_pgds() can miss the child.

How the write barrier can help?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-12 22:50   ` Oleg Nesterov
@ 2015-06-12 22:57     ` Oleg Nesterov
  2015-06-13  6:49       ` Ingo Molnar
  2015-06-13  6:47     ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 22:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/13, Oleg Nesterov wrote:
>
> Afaics, we need to ensure that:
>
> > +			if (pgd_val(*pgd_src))
> > +				WRITE_ONCE(*pgd_dst, *pgd_src);
>
> either we notice the recent update of this PGD, or (say) the subsequent
> sync_global_pgds() can miss the child.

and perhaps !pgd_none(pgd_src) will look better.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
  2015-06-12 22:22   ` Oleg Nesterov
  2015-06-12 22:48   ` Waiman Long
@ 2015-06-12 23:15   ` Oleg Nesterov
  2015-06-13  7:48     ` Ingo Molnar
  2 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 23:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/11, Ingo Molnar wrote:
>
>  void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>  {
> @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
>  	for (address = start; address <= end; address += PGDIR_SIZE) {
>  		const pgd_t *pgd_ref = pgd_offset_k(address);
> -		struct page *page;
> +		struct task_struct *g, *p;
>
>  		/*
> -		 * When it is called after memory hot remove, pgd_none()
> -		 * returns true. In this case (removed == 1), we must clear
> -		 * the PGD entries in the local PGD level page.
> +		 * When this function is called after memory hot remove,
> +		 * pgd_none() already returns true, but only the reference
> +		 * kernel PGD has been cleared, not the process PGDs.
> +		 *
> +		 * So clear the affected entries in every process PGD as well:
>  		 */
>  		if (pgd_none(*pgd_ref) && !removed)
>  			continue;
>
>  		spin_lock(&pgd_lock);
> -		list_for_each_entry(page, &pgd_list, lru) {
> -			pgd_t *pgd;
> +
> +		for_each_process_thread(g, p) {
> +			pgd_t *pgd = p->mm->pgd;

and we use the same pgd later: set_pgd(pgd, *pgd_ref).
looks like this should be pgd_offset(p->mm, address) ?

And obviously you need to check ->mm != NULL first?

And perhaps it makes sense to swap "for (address)" and for_each_process()
loops...

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 22:30                 ` Boris Ostrovsky
@ 2015-06-12 23:36                   ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-12 23:36 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Thomas Gleixner,
	Denys Vlasenko, Borislav Petkov, Andrew Morton, Andy Lutomirski,
	linux-mml, Linux Kernel Mailing List, Brian Gerst,
	H. Peter Anvin, Peter Zijlstra

On 06/12, Boris Ostrovsky wrote:
>
> On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
>>>
>>> 	for_each_process(p) {
>>>
>>> 		for_each_thread(p, t) {
>>> 			if (t->mm) {
>>> 				do_something(t->mm);
>>> 				break;
>>> 			}
>>> 		}
>>> 	}
>>>
>>> But either way I don't understand what protects this ->mm. Perhaps this needs
>>> find_lock_task_mm().
>>
>> And, I don't understand this code, probably this doesn't matter, but.
>>
>> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
>> and miss the new child. Is it OK?
>
>
> Currently xen_mm_pin_all() is only called in the suspend path, out of
> stop_machine(), so presumably at that time fork is not possible.

OK, thanks, this also means that this code shouldn't worry about ->mm,
it should be stable.

But for_each_process() in sync_global_pgds() should, afaics.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-12 22:50   ` Oleg Nesterov
  2015-06-12 22:57     ` Oleg Nesterov
@ 2015-06-13  6:47     ` Ingo Molnar
  2015-06-13  6:52       ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  6:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> > +{
> > +	/*
> > +	 * This is called after a new MM has been made visible
> > +	 * in fork() or exec().
> > +	 *
> > +	 * This barrier makes sure the MM is visible to new RCU
> > +	 * walkers before we initialize it, so that we don't miss
> > +	 * updates:
> > +	 */
> > +	smp_wmb();
> 
> I can't understand the comment and the barrier...
> 
> Afaics, we need to ensure that:
> 
> > +			if (pgd_val(*pgd_src))
> > +				WRITE_ONCE(*pgd_dst, *pgd_src);
> 
> either we notice the recent update of this PGD, or (say) the subsequent
> sync_global_pgds() can miss the child.
> 
> How the write barrier can help?

So the real thing this pairs with is the earlier:

	tsk->mm = mm;

plus the linking of the new task in the task list.

_that_ write must become visible to others before we do the (conditional) copy 
ourselves.

Granted, it happens quite a bit earlier, and the task linking's use of locking is 
a natural barrier - but since this is lockless I didn't want to leave a silent 
assumption in.

Perhaps remove the barrier and just leave a comment in that describes the 
assumption on task-linking being a full barrier?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-12 22:57     ` Oleg Nesterov
@ 2015-06-13  6:49       ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  6:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/13, Oleg Nesterov wrote:
> >
> > Afaics, we need to ensure that:
> >
> > > +			if (pgd_val(*pgd_src))
> > > +				WRITE_ONCE(*pgd_dst, *pgd_src);
> >
> > either we notice the recent update of this PGD, or (say) the subsequent
> > sync_global_pgds() can miss the child.
> 
> and perhaps !pgd_none(pgd_src) will look better.

Indeed - fixed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-13  6:47     ` Ingo Molnar
@ 2015-06-13  6:52       ` Ingo Molnar
  2015-06-13 17:45         ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  6:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Ingo Molnar <mingo@kernel.org> wrote:

> * Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > On 06/11, Ingo Molnar wrote:
> > >
> > > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
> > > +{
> > > +	/*
> > > +	 * This is called after a new MM has been made visible
> > > +	 * in fork() or exec().
> > > +	 *
> > > +	 * This barrier makes sure the MM is visible to new RCU
> > > +	 * walkers before we initialize it, so that we don't miss
> > > +	 * updates:
> > > +	 */
> > > +	smp_wmb();
> > 
> > I can't understand the comment and the barrier...
> > 
> > Afaics, we need to ensure that:
> > 
> > > +			if (pgd_val(*pgd_src))
> > > +				WRITE_ONCE(*pgd_dst, *pgd_src);
> > 
> > either we notice the recent update of this PGD, or (say) the subsequent
> > sync_global_pgds() can miss the child.
> > 
> > How the write barrier can help?
> 
> So the real thing this pairs with is the earlier:
> 
> 	tsk->mm = mm;
> 
> plus the linking of the new task in the task list.
> 
> _that_ write must become visible to others before we do the (conditional) copy 
> ourselves.
> 
> Granted, it happens quite a bit earlier, and the task linking's use of locking 
> is a natural barrier - but since this is lockless I didn't want to leave a 
> silent assumption in.
> 
> Perhaps remove the barrier and just leave a comment in that describes the 
> assumption on task-linking being a full barrier?

Ah, there's another detail I forgot. This might handle the fork case, but in 
exec() we have:

        tsk->mm = mm;
        arch_pgd_init_late(mm);

and since the task is already linked, here we need the barrier.

So how about I improve the comment to:

        /*
         * This function is called after a new MM has been made visible
         * in fork() or exec() via:
         *
         *   tsk->mm = mm;
         *
         * This barrier makes sure the MM is visible to new RCU
         * walkers before we initialize the pagetables below, so that
         * we don't miss updates:
         */
        smp_wmb();

and leave the barrier there?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-12 21:12   ` Oleg Nesterov
@ 2015-06-13  6:54     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  6:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  	syscall_tracepoint_update(p);
> >  	write_unlock_irq(&tasklist_lock);
> >  
> > +	/*
> > +	 * If we have a new PGD then initialize it:
> > +	 *
> > +	 * This method is called after a task has been made visible
> > +	 * on the task list already.
> > +	 *
> > +	 * Architectures that manage per task kernel pagetables
> > +	 * might use this callback to initialize them after they
> > +	 * are already visible to new updates.
> > +	 *
> > +	 * NOTE: any user-space parts of the PGD are already initialized
> > +	 *       and must not be clobbered.
> > +	 */
> > +	if (p->mm != current->mm)
> > +		arch_pgd_init_late(p->mm, p->mm->pgd);
> > +
> 
> Cosmetic, but imo
> 
> 	if (!(clone_flags & CLONE_VM))
> 		arch_pgd_init_late(...);
> 
> will look better and more consistent.

Indeed - fixed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:38             ` Oleg Nesterov
  2015-06-12 20:53               ` Oleg Nesterov
@ 2015-06-13  7:26               ` Ingo Molnar
  2015-06-13 18:00                 ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  7:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra


* Oleg Nesterov <oleg@redhat.com> wrote:

> > So we could add tsk->mm_leader or so,
> 
> No, no, please do not. Just do something like
> 
> 	for_each_process(p) {
> 
> 		for_each_thread(p, t) {

So far that's what the for_each_process_thread() iterations I added do, right?

> 			if (t->mm) {
> 				do_something(t->mm);
> 				break;
> 			}
> 		}
> 	}
> 
> But either way I don't understand what protects this ->mm. Perhaps this needs 
> find_lock_task_mm().

That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as a 
task can only go away via RCU expiry, and all the iterations I added are (supposed 
to) be RCU safe.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-12 22:48   ` Waiman Long
@ 2015-06-13  7:46     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  7:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Oleg Nesterov, Thomas Gleixner


* Waiman Long <waiman.long@hp.com> wrote:

> >@@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> >  	for (address = start; address<= end; address += PGDIR_SIZE) {
> >  		const pgd_t *pgd_ref = pgd_offset_k(address);
> >-		struct page *page;
> >+		struct task_struct *g, *p;
> >
> >  		/*
> >-		 * When it is called after memory hot remove, pgd_none()
> >-		 * returns true. In this case (removed == 1), we must clear
> >-		 * the PGD entries in the local PGD level page.
> >+		 * When this function is called after memory hot remove,
> >+		 * pgd_none() already returns true, but only the reference
> >+		 * kernel PGD has been cleared, not the process PGDs.
> >+		 *
> >+		 * So clear the affected entries in every process PGD as well:
> >  		 */
> >  		if (pgd_none(*pgd_ref)&&  !removed)
> >  			continue;
> >
> >  		spin_lock(&pgd_lock);
> >-		list_for_each_entry(page,&pgd_list, lru) {
> >-			pgd_t *pgd;
> >+
> >+		for_each_process_thread(g, p) {
> >+			pgd_t *pgd = p->mm->pgd;
> >  			spinlock_t *pgt_lock;
> >
> >-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> >-			/* the pgt_lock only for Xen */
> >-			pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
> >+			if (!p->mm)
> >+				continue;
> 
> pgd was initialized to p->mm->pgd before the "p->mm" check is done.
> Shouldn't the initialization be moved after that.

Yes, already found this bug in testing and fixed it - will send out a new series 
with all the feedback so far addressed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-12 23:15   ` Oleg Nesterov
@ 2015-06-13  7:48     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  7:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/11, Ingo Molnar wrote:
> >
> >  void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >  {
> > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> >  	for (address = start; address <= end; address += PGDIR_SIZE) {
> >  		const pgd_t *pgd_ref = pgd_offset_k(address);
> > -		struct page *page;
> > +		struct task_struct *g, *p;
> >
> >  		/*
> > -		 * When it is called after memory hot remove, pgd_none()
> > -		 * returns true. In this case (removed == 1), we must clear
> > -		 * the PGD entries in the local PGD level page.
> > +		 * When this function is called after memory hot remove,
> > +		 * pgd_none() already returns true, but only the reference
> > +		 * kernel PGD has been cleared, not the process PGDs.
> > +		 *
> > +		 * So clear the affected entries in every process PGD as well:
> >  		 */
> >  		if (pgd_none(*pgd_ref) && !removed)
> >  			continue;
> >
> >  		spin_lock(&pgd_lock);
> > -		list_for_each_entry(page, &pgd_list, lru) {
> > -			pgd_t *pgd;
> > +
> > +		for_each_process_thread(g, p) {
> > +			pgd_t *pgd = p->mm->pgd;
> 
> and we use the same pgd later: set_pgd(pgd, *pgd_ref).
> looks like this should be pgd_offset(p->mm, address) ?
> 
> And obviously you need to check ->mm != NULL first?

Yes.

> And perhaps it makes sense to swap "for (address)" and for_each_process() 
> loops...

Yes, but I'd flip around the logic in a later, separate patch.

Note that PGDIR_SIZE is 512 GB here, so the outer loop will most likely execute at 
most twice in practice.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()
  2015-06-12 22:28   ` Oleg Nesterov
@ 2015-06-13  7:51     ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-06-13  7:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> Yes I guess I am totally confused ;)
> 
> On 06/11, Ingo Molnar wrote:
> >
> > @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >  		 *
> >  		 * So clear the affected entries in every process PGD as well:
> >  		 */
> > -		if (pgd_none(*pgd_ref) && !removed)
> > +		if (pgd_none(*pgd_ref))
> 
> But doesn't this change invalidate the comment above?

Indeed - I fixed the comment to now say:

                /* Only sync (potentially) newly added PGD entries: */
                if (pgd_none(*pgd_ref))
                        continue;

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-13  6:52       ` Ingo Molnar
@ 2015-06-13 17:45         ` Oleg Nesterov
  2015-06-14  8:13           ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-13 17:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/13, Ingo Molnar wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > * Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > >
> > > Afaics, we need to ensure that:
> > >
> > > > +			if (pgd_val(*pgd_src))
> > > > +				WRITE_ONCE(*pgd_dst, *pgd_src);
> > >
> > > either we notice the recent update of this PGD, or (say) the subsequent
> > > sync_global_pgds() can miss the child.
> > >
> > > How the write barrier can help?
> >
> > So the real thing this pairs with is the earlier:
> >
> > 	tsk->mm = mm;
> >
> > plus the linking of the new task in the task list.
> >
> > _that_ write must become visible to others before we do the (conditional) copy
> > ourselves.

Hmm. that write must be visible before we start to _read_ *pgd_src,
afaics.

> > Granted, it happens quite a bit earlier, and the task linking's use of locking
> > is a natural barrier - but since this is lockless I didn't want to leave a
> > silent assumption in.

I agree,

> Ah, there's another detail I forgot. This might handle the fork case, but in
> exec() we have:
>
>         tsk->mm = mm;
>         arch_pgd_init_late(mm);

Yes, this too.

But wmb() can't help. At least we need the full mb() to serialize the
STORE above (or list addition in copy_process) with the LOAD which
reads *pgd_src.

Plus we need another mb() in sync_global_pgds(), say, before the main
for_each_process() loop.


it would be nice to make this more simple/clear somehow...

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  7:26               ` Ingo Molnar
@ 2015-06-13 18:00                 ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-13 18:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/13, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > So we could add tsk->mm_leader or so,
> >
> > No, no, please do not. Just do something like
> >
> > 	for_each_process(p) {
> >
> > 		for_each_thread(p, t) {
>
> So far that's what the for_each_process_thread() iterations I added do, right?

Not really,

> > 			if (t->mm) {
> > 				do_something(t->mm);
> > 				break;
                                ^^^^^

Note this "break". We stop the inner loop right after we find a thread "t"
with ->mm != NULL. In the likely case t == p (group leader) so the inner
loop stops on the 1st iteration.


> > But either way I don't understand what protects this ->mm. Perhaps this needs
> > find_lock_task_mm().
>
> That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

Well, in this particular case we are safe. As Boris explained this is called
from stop_machine(). But sync_global_pgds() is not safe.

> find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as
> task can only go away via RCU expiry, and all the iterations I added are (supposed
> to) be RCU safe.

Sure, you can do lock/unlock by hand. But find_lock_task_mm() can simplify
the code because it checks subthreads if group_leader->mm == NULL. You can
simply do

	rcu_read_lock();
	for_each_process(p) {
		t = find_lock_task_mm(p);
		if (!t)
			continue;

		do_something(t->mm);
		task_unlock(t);
	}
	rcu_read_unlock();

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-13 17:45         ` Oleg Nesterov
@ 2015-06-14  8:13           ` Ingo Molnar
  2015-06-14 20:54             ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2015-06-14  8:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/13, Ingo Molnar wrote:
> >
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > >
> > > > Afaics, we need to ensure that:
> > > >
> > > > > +			if (pgd_val(*pgd_src))
> > > > > +				WRITE_ONCE(*pgd_dst, *pgd_src);
> > > >
> > > > either we notice the recent update of this PGD, or (say) the subsequent
> > > > sync_global_pgds() can miss the child.
> > > >
> > > > How the write barrier can help?
> > >
> > > So the real thing this pairs with is the earlier:
> > >
> > > 	tsk->mm = mm;
> > >
> > > plus the linking of the new task in the task list.
> > >
> > > _that_ write must become visible to others before we do the (conditional) copy
> > > ourselves.
> 
> Hmm. that write must be visible before we start to _read_ *pgd_src,
> afaics.
> 
> > > Granted, it happens quite a bit earlier, and the task linking's use of locking
> > > is a natural barrier - but since this is lockless I didn't want to leave a
> > > silent assumption in.
> 
> I agree,
> 
> > Ah, there's another detail I forgot. This might handle the fork case, but in
> > exec() we have:
> >
> >         tsk->mm = mm;
> >         arch_pgd_init_late(mm);
> 
> Yes, this too.
> 
> But wmb() can't help. At least we need the full mb() to serialize the
> STORE above (or list addition in copy_process) with the LOAD which
> reads *pgd_src.

True.

> Plus we need another mb() in sync_global_pgds(), say, before the main 
> for_each_process() loop.

True.

So since we have a spin_lock() there already, I tried to find the right primitive 
to turn it into a full memory barrier but gave up. Is there a simple primitive for 
that?

Also, since this is x86 specific code we could rely on the fact that 
spinlock-acquire is a full memory barrier?

> 
> it would be nice to make this more simple/clear somehow...

I'm afraid lockless is rarely simple, but I'm open to suggestions ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-14  8:13           ` Ingo Molnar
@ 2015-06-14 20:54             ` Oleg Nesterov
  2015-06-14 22:10               ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-14 20:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/14, Ingo Molnar wrote:
>
> So since we have a spin_lock() there already,

Yeeeees, I thought about task_lock() or pgd_lock too.

> Also, since this is x86 specific code we could rely on the fact that
> spinlock-acquire is a full memory barrier?

we do not really need the full barrier if we rely on spinlock_t,
we can rely on acquire+release semantics.

Lets forget about exec_mmap(). If we add, say,

	// or unlock_wait() + barriers
	task_lock(current->group_leader);
	task_unlock(current->group_leader);

at the start of arch_pgd_init_late() we will fix the problems with
fork() even if pgd_none() below can leak into the critical section.

We rely on the fact that find_lock_task_mm() does lock/unlock too
and always starts with the group leader.

If sync_global_pgds() takes this lock first, we must see the change
in *PGD after task_unlock(). Actually right after task_lock().

Otherwise, sync_global_pgds() should see the result of list addition
if it takes this (the same) ->group_leader->lock_alloc after us.

But this is not nice, and exec_mmap() calls arch_pgd_init_late() under
task_lock().


So, unless you are going to remove pgd_lock altogether perhaps we can
rely on it the same way

	mb();
	spin_unlock_wait(&pgd_lock);
	rmb();


Avoids the barriers (and comments) on another side, but I can't say
I really like this...


So I won't argue with 2 mb's on both sides.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-14 20:54             ` Oleg Nesterov
@ 2015-06-14 22:10               ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2015-06-14 22:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mml, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/14, Oleg Nesterov wrote:
>
> So, unless you are going to remove pgd_lock altogether perhaps we can
> rely on it the same way
>
> 	mb();
> 	spin_unlock_wait(&pgd_lock);
> 	rmb();
>
>
> Avoids the barriers (and comments) on another side, but I can't say
> I really like this...
>
>
> So I won't argue with 2 mb's on both sides.

Or we can add

	// A new child created before can miss the PGD updates,
	// but we must see that child on the process list

	read_lock(tasklist_lock);
	read_unlock(tasklist_lock);

	// We can miss a new child forked after read_unlock(),
	// but then its parent must see all PGD updates right
	// after it does write_unlock(tasklist);

	for_each_process(p) {

before main for_each_process() loop in sync_global_pgds().

As for exec_mmap() we can rely on task_lock(), sync_global_pgds()
takes it too. The corner case is when exec changes the leader, so
exec_mmap/sync_global_pgds can take different locks. But in this
case we can rely on de_thread() (which takes tasklist for write)
by the same reason: either sync_global_pgds() will see the new
leader, or the new leader must see the updates.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2015-06-14 22:12 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-06-11 14:07 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
2015-06-11 14:07 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
2015-06-12 22:22   ` Oleg Nesterov
2015-06-12 22:48   ` Waiman Long
2015-06-13  7:46     ` Ingo Molnar
2015-06-12 23:15   ` Oleg Nesterov
2015-06-13  7:48     ` Ingo Molnar
2015-06-11 14:07 ` [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
2015-06-11 14:07 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
2015-06-12 22:28   ` Oleg Nesterov
2015-06-13  7:51     ` Ingo Molnar
2015-06-11 14:07 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
2015-06-11 18:23   ` Andrew Morton
2015-06-12  8:16     ` Ingo Molnar
2015-06-12 21:12   ` Oleg Nesterov
2015-06-13  6:54     ` Ingo Molnar
2015-06-11 14:07 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
2015-06-11 20:04   ` Boris Ostrovsky
2015-06-12  8:19     ` Ingo Molnar
2015-06-12 22:50   ` Oleg Nesterov
2015-06-12 22:57     ` Oleg Nesterov
2015-06-13  6:49       ` Ingo Molnar
2015-06-13  6:47     ` Ingo Molnar
2015-06-13  6:52       ` Ingo Molnar
2015-06-13 17:45         ` Oleg Nesterov
2015-06-14  8:13           ` Ingo Molnar
2015-06-14 20:54             ` Oleg Nesterov
2015-06-14 22:10               ` Oleg Nesterov
2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
2015-06-11 20:37   ` Linus Torvalds
2015-06-11 20:41     ` Linus Torvalds
2015-06-12  7:23       ` Ingo Molnar
     [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
2015-06-12  8:04           ` Ingo Molnar
2015-06-12 20:38             ` Oleg Nesterov
2015-06-12 20:53               ` Oleg Nesterov
2015-06-12 22:30                 ` Boris Ostrovsky
2015-06-12 23:36                   ` Oleg Nesterov
2015-06-13  7:26               ` Ingo Molnar
2015-06-13 18:00                 ` Oleg Nesterov
2015-06-11 14:07 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
2015-06-11 15:04   ` Andy Lutomirski
2015-06-11 15:49     ` Ingo Molnar
2015-06-11 16:10       ` Andy Lutomirski
2015-06-11 14:07 ` [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
2015-06-11 14:07 ` [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
2015-06-11 14:07 ` [PATCH 11/12] x86/mm: Remove pgd_list leftovers Ingo Molnar
2015-06-11 14:07 ` [PATCH 12/12] x86/mm: Simplify pgd_alloc() Ingo Molnar
2015-06-11 14:13 ` [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-06-11 14:13   ` Ingo Molnar
2015-06-11 16:46 ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.