All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12, v2] x86/mm: Implement lockless pgd_alloc()/pgd_free()
@ 2015-06-13  9:49 ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The purpose of this series is:

  Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
  moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
  some of the lock bouncing overhead.

  I think we can do much better: this series eliminates the pgd_list and makes
  pgd_alloc()/pgd_free() lockless.

This is the -v2 submission that addresses all feedback received so far, it
fixes bugs and cleans up details. There's a v1->v2 interdiff attached
further below: most of the changes relate to locking the task before looking
at tsk->mm.

The series is also available in -tip:master for easy testing:

  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

It's not applied to any visible topic tree yet. (Its sha1 is bc40a788fc63
if you want to test it without pending non-x86 bits in -tip.)

Thanks,

    Ingo

====================>
Ingo Molnar (12):
  x86/mm/pat: Don't free PGD entries on memory unmap
  x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm/hotplug: Simplify sync_global_pgds()
  mm: Introduce arch_pgd_init_late()
  x86/mm: Enable and use the arch_pgd_init_late() method
  x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  x86/mm: Remove pgd_list use from vmalloc_sync_all()
  x86/mm/pat/32: Remove pgd_list use from the PAT code
  x86/mm: Make pgd_alloc()/pgd_free() lockless
  x86/mm: Remove pgd_list leftovers
  x86/mm: Simplify pgd_alloc()

 arch/Kconfig                      |   9 ++++
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/pgtable.h    |   3 --
 arch/x86/include/asm/pgtable_64.h |   3 +-
 arch/x86/mm/fault.c               |  31 +++++++++-----
 arch/x86/mm/init_64.c             |  80 ++++++++++++-----------------------
 arch/x86/mm/pageattr.c            |  41 +++++++++---------
 arch/x86/mm/pgtable.c             | 131 ++++++++++++++++++++++++++++++----------------------------
 arch/x86/xen/mmu.c                |  51 +++++++++++++++++------
 fs/exec.c                         |   3 ++
 include/linux/mm.h                |   6 +++
 kernel/fork.c                     |  16 +++++++
 12 files changed, 209 insertions(+), 166 deletions(-)

-v1 => -v2 interdiff:
====

 arch/x86/mm/fault.c    | 17 +++++++++++------
 arch/x86/mm/init_64.c  | 23 ++++++++++++-----------
 arch/x86/mm/pageattr.c | 27 ++++++++++++++-------------
 arch/x86/mm/pgtable.c  | 22 ++++++++++++----------
 arch/x86/xen/mmu.c     | 47 +++++++++++++++++++++++++----------------------
 fs/exec.c              |  2 +-
 include/linux/mm.h     |  4 ++--
 kernel/fork.c          |  4 ++--
 8 files changed, 79 insertions(+), 67 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bebdd97f888b..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -237,28 +237,33 @@ void vmalloc_sync_all(void)
 
 		struct task_struct *g, *p;
 
-		rcu_read_lock();
-		spin_lock(&pgd_lock);
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
 			pmd_t *pmd_ret;
 
-			if (!p->mm)
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
 				continue;
+			}
 
 			/* The pgt_lock is only used on Xen: */
-			pgt_lock = &p->mm->page_table_lock;
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
-			pmd_ret = vmalloc_sync_one(p->mm->pgd, address);
+			pmd_ret = vmalloc_sync_one(mm->pgd, address);
 			spin_unlock(pgt_lock);
 
+			task_unlock(p);
+
 			if (!pmd_ret)
 				break;
 		}
 
 		spin_unlock(&pgd_lock);
-		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 730560c4873e..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -171,28 +171,28 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		const pgd_t *pgd_ref = pgd_offset_k(address);
 		struct task_struct *g, *p;
 
-		/*
-		 * When this function is called after memory hot remove,
-		 * pgd_none() already returns true, but only the reference
-		 * kernel PGD has been cleared, not the process PGDs.
-		 *
-		 * So clear the affected entries in every process PGD as well:
-		 */
+		/* Only sync (potentially) newly added PGD entries: */
 		if (pgd_none(*pgd_ref))
 			continue;
 
-		spin_lock(&pgd_lock);
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
-			if (!p->mm)
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
 				continue;
-			pgd = p->mm->pgd;
+			}
+
+			pgd = mm->pgd;
 
 			/* The pgt_lock is only used by Xen: */
-			pgt_lock = &p->mm->page_table_lock;
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +202,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
+			task_unlock(p);
 		}
 		spin_unlock(&pgd_lock);
 	}
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 93c134fdb398..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -440,29 +440,30 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	if (!SHARED_KERNEL_PMD) {
 		struct task_struct *g, *p;
 
-		rcu_read_lock();
+		/* We are holding pgd_lock, which implies rcu_read_lock(): */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			if (!p->mm)
-				continue;
+			task_lock(p);
+			mm = p->mm;
+			if (mm) {
+				pgt_lock = &mm->page_table_lock;
+				spin_lock(pgt_lock);
 
-			pgt_lock = &p->mm->page_table_lock;
-			spin_lock(pgt_lock);
+				pgd = mm->pgd + pgd_index(address);
+				pud = pud_offset(pgd, address);
+				pmd = pmd_offset(pud, address);
+				set_pte_atomic((pte_t *)pmd, pte);
 
-			pgd = p->mm->pgd + pgd_index(address);
-			pud = pud_offset(pgd, address);
-			pmd = pmd_offset(pud, address);
-			set_pte_atomic((pte_t *)pmd, pte);
-
-			spin_unlock(pgt_lock);
+				spin_unlock(pgt_lock);
+			}
+			task_unlock(p);
 		}
-
-		rcu_read_unlock();
 	}
 #endif
 }
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b1bd35f452ef..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -348,15 +348,17 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
  * at the highest, PGD level. Thus any other task extending it
  * will first update the reference PGD, then modify the task PGDs.
  */
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
+void arch_pgd_init_late(struct mm_struct *mm)
 {
 	/*
-	 * This is called after a new MM has been made visible
-	 * in fork() or exec().
+	 * This function is called after a new MM has been made visible
+	 * in fork() or exec() via:
+	 *
+	 *   tsk->mm = mm;
 	 *
 	 * This barrier makes sure the MM is visible to new RCU
-	 * walkers before we initialize it, so that we don't miss
-	 * updates:
+	 * walkers before we initialize the pagetables below, so that
+	 * we don't miss updates:
 	 */
 	smp_wmb();
 
@@ -365,12 +367,12 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
 	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
 	 * references from swapper_pg_dir:
 	 */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
+	if ( CONFIG_PGTABLE_LEVELS == 2 ||
 	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
+	     CONFIG_PGTABLE_LEVELS == 4) {
 
 		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
-		pgd_t *pgd_dst =            pgd + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =        mm->pgd + KERNEL_PGD_BOUNDARY;
 		int i;
 
 		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
@@ -387,8 +389,8 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
 			 * it won't change from under us. Parallel updates (new
 			 * allocations) will modify our (already visible) PGD:
 			 */
-			if (pgd_val(*pgd_src))
-				WRITE_ONCE(*pgd_dst, *pgd_src);
+			if (!pgd_none(*pgd_src))
+				set_pgd(pgd_dst, *pgd_src);
 		}
 	}
 }
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 87a8354435f8..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -855,27 +855,28 @@ void xen_mm_pin_all(void)
 {
 	struct task_struct *g, *p;
 
-	rcu_read_lock();
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
 		struct page *page;
 		pgd_t *pgd;
 
-		if (!p->mm)
-			continue;
-
-		pgd = p->mm->pgd;
-		page = virt_to_page(pgd);
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
 
-		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, pgd);
-			SetPageSavePinned(page);
+			if (!PagePinned(page)) {
+				__xen_pgd_pin(&init_mm, pgd);
+				SetPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
-	rcu_read_unlock();
 }
 
 /*
@@ -980,24 +981,26 @@ void xen_mm_unpin_all(void)
 {
 	struct task_struct *g, *p;
 
-	rcu_read_lock();
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
 		struct page *page;
 		pgd_t *pgd;
 
-		if (!p->mm)
-			continue;
-
-		pgd = p->mm->pgd;
-		page = virt_to_page(pgd);
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
 
-		if (PageSavePinned(page)) {
-			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, pgd);
-			ClearPageSavePinned(page);
+			if (PageSavePinned(page)) {
+				BUG_ON(!PagePinned(page));
+				__xen_pgd_unpin(&init_mm, pgd);
+				ClearPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
diff --git a/fs/exec.c b/fs/exec.c
index c1d213c64fda..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -862,7 +862,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 
 	tsk->mm = mm;
-	arch_pgd_init_late(mm, mm->pgd);
+	arch_pgd_init_late(mm);
 
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35d887d2b038..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,9 +1135,9 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
 #ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd);
+void arch_pgd_init_late(struct mm_struct *mm);
 #else
-static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { }
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
 #endif
 
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
diff --git a/kernel/fork.c b/kernel/fork.c
index 1f83ceca6c6c..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1605,8 +1605,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * NOTE: any user-space parts of the PGD are already initialized
 	 *       and must not be clobbered.
 	 */
-	if (p->mm != current->mm)
-		arch_pgd_init_late(p->mm, p->mm->pgd);
+	if (!(clone_flags & CLONE_VM))
+		arch_pgd_init_late(p->mm);
 
 	proc_fork_connector(p);
 	cgroup_post_fork(p);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 00/12, v2] x86/mm: Implement lockless pgd_alloc()/pgd_free()
@ 2015-06-13  9:49 ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The purpose of this series is:

  Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed
  moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce
  some of the lock bouncing overhead.

  I think we can do much better: this series eliminates the pgd_list and makes
  pgd_alloc()/pgd_free() lockless.

This is the -v2 submission that addresses all feedback received so far, it
fixes bugs and cleans up details. There's a v1->v2 interdiff attached
further below: most of the changes relate to locking the task before looking
at tsk->mm.

The series is also available in -tip:master for easy testing:

  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

It's not applied to any visible topic tree yet. (Its sha1 is bc40a788fc63
if you want to test it without pending non-x86 bits in -tip.)

Thanks,

    Ingo

====================>
Ingo Molnar (12):
  x86/mm/pat: Don't free PGD entries on memory unmap
  x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm/hotplug: Simplify sync_global_pgds()
  mm: Introduce arch_pgd_init_late()
  x86/mm: Enable and use the arch_pgd_init_late() method
  x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  x86/mm: Remove pgd_list use from vmalloc_sync_all()
  x86/mm/pat/32: Remove pgd_list use from the PAT code
  x86/mm: Make pgd_alloc()/pgd_free() lockless
  x86/mm: Remove pgd_list leftovers
  x86/mm: Simplify pgd_alloc()

 arch/Kconfig                      |   9 ++++
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/pgtable.h    |   3 --
 arch/x86/include/asm/pgtable_64.h |   3 +-
 arch/x86/mm/fault.c               |  31 +++++++++-----
 arch/x86/mm/init_64.c             |  80 ++++++++++++-----------------------
 arch/x86/mm/pageattr.c            |  41 +++++++++---------
 arch/x86/mm/pgtable.c             | 131 ++++++++++++++++++++++++++++++----------------------------
 arch/x86/xen/mmu.c                |  51 +++++++++++++++++------
 fs/exec.c                         |   3 ++
 include/linux/mm.h                |   6 +++
 kernel/fork.c                     |  16 +++++++
 12 files changed, 209 insertions(+), 166 deletions(-)

-v1 => -v2 interdiff:
====

 arch/x86/mm/fault.c    | 17 +++++++++++------
 arch/x86/mm/init_64.c  | 23 ++++++++++++-----------
 arch/x86/mm/pageattr.c | 27 ++++++++++++++-------------
 arch/x86/mm/pgtable.c  | 22 ++++++++++++----------
 arch/x86/xen/mmu.c     | 47 +++++++++++++++++++++++++----------------------
 fs/exec.c              |  2 +-
 include/linux/mm.h     |  4 ++--
 kernel/fork.c          |  4 ++--
 8 files changed, 79 insertions(+), 67 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bebdd97f888b..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -237,28 +237,33 @@ void vmalloc_sync_all(void)
 
 		struct task_struct *g, *p;
 
-		rcu_read_lock();
-		spin_lock(&pgd_lock);
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
 			pmd_t *pmd_ret;
 
-			if (!p->mm)
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
 				continue;
+			}
 
 			/* The pgt_lock is only used on Xen: */
-			pgt_lock = &p->mm->page_table_lock;
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
-			pmd_ret = vmalloc_sync_one(p->mm->pgd, address);
+			pmd_ret = vmalloc_sync_one(mm->pgd, address);
 			spin_unlock(pgt_lock);
 
+			task_unlock(p);
+
 			if (!pmd_ret)
 				break;
 		}
 
 		spin_unlock(&pgd_lock);
-		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 730560c4873e..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -171,28 +171,28 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		const pgd_t *pgd_ref = pgd_offset_k(address);
 		struct task_struct *g, *p;
 
-		/*
-		 * When this function is called after memory hot remove,
-		 * pgd_none() already returns true, but only the reference
-		 * kernel PGD has been cleared, not the process PGDs.
-		 *
-		 * So clear the affected entries in every process PGD as well:
-		 */
+		/* Only sync (potentially) newly added PGD entries: */
 		if (pgd_none(*pgd_ref))
 			continue;
 
-		spin_lock(&pgd_lock);
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
-			if (!p->mm)
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
 				continue;
-			pgd = p->mm->pgd;
+			}
+
+			pgd = mm->pgd;
 
 			/* The pgt_lock is only used by Xen: */
-			pgt_lock = &p->mm->page_table_lock;
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +202,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
+			task_unlock(p);
 		}
 		spin_unlock(&pgd_lock);
 	}
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 93c134fdb398..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -440,29 +440,30 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	if (!SHARED_KERNEL_PMD) {
 		struct task_struct *g, *p;
 
-		rcu_read_lock();
+		/* We are holding pgd_lock, which implies rcu_read_lock(): */
 
 		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			if (!p->mm)
-				continue;
+			task_lock(p);
+			mm = p->mm;
+			if (mm) {
+				pgt_lock = &mm->page_table_lock;
+				spin_lock(pgt_lock);
 
-			pgt_lock = &p->mm->page_table_lock;
-			spin_lock(pgt_lock);
+				pgd = mm->pgd + pgd_index(address);
+				pud = pud_offset(pgd, address);
+				pmd = pmd_offset(pud, address);
+				set_pte_atomic((pte_t *)pmd, pte);
 
-			pgd = p->mm->pgd + pgd_index(address);
-			pud = pud_offset(pgd, address);
-			pmd = pmd_offset(pud, address);
-			set_pte_atomic((pte_t *)pmd, pte);
-
-			spin_unlock(pgt_lock);
+				spin_unlock(pgt_lock);
+			}
+			task_unlock(p);
 		}
-
-		rcu_read_unlock();
 	}
 #endif
 }
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b1bd35f452ef..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -348,15 +348,17 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
  * at the highest, PGD level. Thus any other task extending it
  * will first update the reference PGD, then modify the task PGDs.
  */
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
+void arch_pgd_init_late(struct mm_struct *mm)
 {
 	/*
-	 * This is called after a new MM has been made visible
-	 * in fork() or exec().
+	 * This function is called after a new MM has been made visible
+	 * in fork() or exec() via:
+	 *
+	 *   tsk->mm = mm;
 	 *
 	 * This barrier makes sure the MM is visible to new RCU
-	 * walkers before we initialize it, so that we don't miss
-	 * updates:
+	 * walkers before we initialize the pagetables below, so that
+	 * we don't miss updates:
 	 */
 	smp_wmb();
 
@@ -365,12 +367,12 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
 	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
 	 * references from swapper_pg_dir:
 	 */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
+	if ( CONFIG_PGTABLE_LEVELS == 2 ||
 	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
+	     CONFIG_PGTABLE_LEVELS == 4) {
 
 		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
-		pgd_t *pgd_dst =            pgd + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =        mm->pgd + KERNEL_PGD_BOUNDARY;
 		int i;
 
 		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
@@ -387,8 +389,8 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd)
 			 * it won't change from under us. Parallel updates (new
 			 * allocations) will modify our (already visible) PGD:
 			 */
-			if (pgd_val(*pgd_src))
-				WRITE_ONCE(*pgd_dst, *pgd_src);
+			if (!pgd_none(*pgd_src))
+				set_pgd(pgd_dst, *pgd_src);
 		}
 	}
 }
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 87a8354435f8..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -855,27 +855,28 @@ void xen_mm_pin_all(void)
 {
 	struct task_struct *g, *p;
 
-	rcu_read_lock();
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
 		struct page *page;
 		pgd_t *pgd;
 
-		if (!p->mm)
-			continue;
-
-		pgd = p->mm->pgd;
-		page = virt_to_page(pgd);
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
 
-		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, pgd);
-			SetPageSavePinned(page);
+			if (!PagePinned(page)) {
+				__xen_pgd_pin(&init_mm, pgd);
+				SetPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
-	rcu_read_unlock();
 }
 
 /*
@@ -980,24 +981,26 @@ void xen_mm_unpin_all(void)
 {
 	struct task_struct *g, *p;
 
-	rcu_read_lock();
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
 	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
 		struct page *page;
 		pgd_t *pgd;
 
-		if (!p->mm)
-			continue;
-
-		pgd = p->mm->pgd;
-		page = virt_to_page(pgd);
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
 
-		if (PageSavePinned(page)) {
-			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, pgd);
-			ClearPageSavePinned(page);
+			if (PageSavePinned(page)) {
+				BUG_ON(!PagePinned(page));
+				__xen_pgd_unpin(&init_mm, pgd);
+				ClearPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
diff --git a/fs/exec.c b/fs/exec.c
index c1d213c64fda..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -862,7 +862,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 
 	tsk->mm = mm;
-	arch_pgd_init_late(mm, mm->pgd);
+	arch_pgd_init_late(mm);
 
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35d887d2b038..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,9 +1135,9 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
 #ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
-void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd);
+void arch_pgd_init_late(struct mm_struct *mm);
 #else
-static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { }
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
 #endif
 
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
diff --git a/kernel/fork.c b/kernel/fork.c
index 1f83ceca6c6c..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1605,8 +1605,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * NOTE: any user-space parts of the PGD are already initialized
 	 *       and must not be clobbered.
 	 */
-	if (p->mm != current->mm)
-		arch_pgd_init_late(p->mm, p->mm->pgd);
+	if (!(clone_flags & CLONE_VM))
+		arch_pgd_init_late(p->mm);
 
 	proc_fork_connector(p);
 	cgroup_post_fork(p);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 70d221fe2eb4..997fc97e9072 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
 	pgd_t *pgd_entry = root + pgd_index(addr);
 
 	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
 }
 
 static int alloc_pte_page(pmd_t *pmd)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when we unmap kernel memory range then we also check whether a PUD
is completely clear, and free it and clear the PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory mapped.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 70d221fe2eb4..997fc97e9072 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
 	pgd_t *pgd_entry = root + pgd_index(addr);
 
 	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
 }
 
 static int alloc_pte_page(pmd_t *pmd)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.

But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.

Also improve the comments a bit, to make this function easier
to understand.

Only lightly tested, as I don't have a memory hotplug setup.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..527d5d4d020c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
  */
 void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 {
@@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 
 	for (address = start; address <= end; address += PGDIR_SIZE) {
 		const pgd_t *pgd_ref = pgd_offset_k(address);
-		struct page *page;
+		struct task_struct *g, *p;
 
 		/*
-		 * When it is called after memory hot remove, pgd_none()
-		 * returns true. In this case (removed == 1), we must clear
-		 * the PGD entries in the local PGD level page.
+		 * When this function is called after memory hot remove,
+		 * pgd_none() already returns true, but only the reference
+		 * kernel PGD has been cleared, not the process PGDs.
+		 *
+		 * So clear the affected entries in every process PGD as well:
 		 */
 		if (pgd_none(*pgd_ref) && !removed)
 			continue;
 
-		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
+				continue;
+			}
+
+			pgd = mm->pgd;
+
+			/* The pgt_lock is only used by Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
-				BUG_ON(pgd_page_vaddr(*pgd)
-				       != pgd_page_vaddr(*pgd_ref));
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
 			if (removed) {
 				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +213,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			}
 
 			spin_unlock(pgt_lock);
+			task_unlock(p);
 		}
 		spin_unlock(&pgd_lock);
 	}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The memory hotplug code uses sync_global_pgds() to synchronize updates
to the global (&init_mm) kernel PGD and the task PGDs. It does this
by iterating over the pgd_list - which list closely tracks task
creation/destruction via fork/clone.

But we want to remove this list, so that it does not have to be
maintained from fork()/exit(), so convert the memory hotplug code
to use the task list to iterate over all pgds in the system.

Also improve the comments a bit, to make this function easier
to understand.

Only lightly tested, as I don't have a memory hotplug setup.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..527d5d4d020c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
+ * When memory was added/removed make sure all the process MMs have
+ * matching PGD entries in the local PGD level page as well.
  */
 void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 {
@@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 
 	for (address = start; address <= end; address += PGDIR_SIZE) {
 		const pgd_t *pgd_ref = pgd_offset_k(address);
-		struct page *page;
+		struct task_struct *g, *p;
 
 		/*
-		 * When it is called after memory hot remove, pgd_none()
-		 * returns true. In this case (removed == 1), we must clear
-		 * the PGD entries in the local PGD level page.
+		 * When this function is called after memory hot remove,
+		 * pgd_none() already returns true, but only the reference
+		 * kernel PGD has been cleared, not the process PGDs.
+		 *
+		 * So clear the affected entries in every process PGD as well:
 		 */
 		if (pgd_none(*pgd_ref) && !removed)
 			continue;
 
-		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			pgd_t *pgd;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
+				continue;
+			}
+
+			pgd = mm->pgd;
+
+			/* The pgt_lock is only used by Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
 
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
-				BUG_ON(pgd_page_vaddr(*pgd)
-				       != pgd_page_vaddr(*pgd_ref));
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
 			if (removed) {
 				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
@@ -202,6 +213,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			}
 
 			spin_unlock(pgt_lock);
+			task_unlock(p);
 		}
 		spin_unlock(&pgd_lock);
 	}
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 527d5d4d020c..7a988dbad240 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -778,27 +778,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -990,7 +969,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -1001,13 +979,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 527d5d4d020c..7a988dbad240 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -778,27 +778,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -990,7 +969,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -1001,13 +979,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.

So remove the 'removed' parameter and simplify the call sites.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h |  3 +--
 arch/x86/mm/fault.c               |  2 +-
 arch/x86/mm/init_64.c             | 27 ++++++++-------------------
 3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
 	native_set_pgd(pgd, native_make_pgd(0));
 }
 
-extern void sync_global_pgds(unsigned long start, unsigned long end,
-			     int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);
 
 /*
  * Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53bac3a7..50342825f221 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address)
 
 void vmalloc_sync_all(void)
 {
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
 }
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7a988dbad240..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
  * matching PGD entries in the local PGD level page as well.
  */
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
 
@@ -171,14 +171,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 		const pgd_t *pgd_ref = pgd_offset_k(address);
 		struct task_struct *g, *p;
 
-		/*
-		 * When this function is called after memory hot remove,
-		 * pgd_none() already returns true, but only the reference
-		 * kernel PGD has been cleared, not the process PGDs.
-		 *
-		 * So clear the affected entries in every process PGD as well:
-		 */
-		if (pgd_none(*pgd_ref) && !removed)
+		/* Only sync (potentially) newly added PGD entries: */
+		if (pgd_none(*pgd_ref))
 			continue;
 
 		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
@@ -204,13 +198,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
 				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
-			if (removed) {
-				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
-					pgd_clear(pgd);
-			} else {
-				if (pgd_none(*pgd))
-					set_pgd(pgd, *pgd_ref);
-			}
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
 			task_unlock(p);
@@ -644,7 +633,7 @@ kernel_physical_mapping_init(unsigned long start,
 	}
 
 	if (pgd_changed)
-		sync_global_pgds(addr, end - 1, 0);
+		sync_global_pgds(addr, end - 1);
 
 	__flush_tlb_all();
 
@@ -1284,7 +1273,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 	else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
-		sync_global_pgds(start, end - 1, 0);
+		sync_global_pgds(start, end - 1);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds()
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Now that the memory hotplug code does not remove PGD entries anymore,
the only users of sync_global_pgds() use it after extending the
PGD.

So remove the 'removed' parameter and simplify the call sites.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h |  3 +--
 arch/x86/mm/fault.c               |  2 +-
 arch/x86/mm/init_64.c             | 27 ++++++++-------------------
 3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee781114d34..f405fc3bb719 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
 	native_set_pgd(pgd, native_make_pgd(0));
 }
 
-extern void sync_global_pgds(unsigned long start, unsigned long end,
-			     int removed);
+extern void sync_global_pgds(unsigned long start, unsigned long end);
 
 /*
  * Conversion functions: convert a page and protection to a page entry,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53bac3a7..50342825f221 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address)
 
 void vmalloc_sync_all(void)
 {
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
+	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
 }
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7a988dbad240..dcb2f45caf0e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str)
 __setup("noexec32=", nonx32_setup);
 
 /*
- * When memory was added/removed make sure all the process MMs have
+ * When memory was added make sure all the process MMs have
  * matching PGD entries in the local PGD level page as well.
  */
-void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
 
@@ -171,14 +171,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 		const pgd_t *pgd_ref = pgd_offset_k(address);
 		struct task_struct *g, *p;
 
-		/*
-		 * When this function is called after memory hot remove,
-		 * pgd_none() already returns true, but only the reference
-		 * kernel PGD has been cleared, not the process PGDs.
-		 *
-		 * So clear the affected entries in every process PGD as well:
-		 */
-		if (pgd_none(*pgd_ref) && !removed)
+		/* Only sync (potentially) newly added PGD entries: */
+		if (pgd_none(*pgd_ref))
 			continue;
 
 		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
@@ -204,13 +198,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
 			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
 				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 
-			if (removed) {
-				if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
-					pgd_clear(pgd);
-			} else {
-				if (pgd_none(*pgd))
-					set_pgd(pgd, *pgd_ref);
-			}
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
 
 			spin_unlock(pgt_lock);
 			task_unlock(p);
@@ -644,7 +633,7 @@ kernel_physical_mapping_init(unsigned long start,
 	}
 
 	if (pgd_changed)
-		sync_global_pgds(addr, end - 1, 0);
+		sync_global_pgds(addr, end - 1);
 
 	__flush_tlb_all();
 
@@ -1284,7 +1273,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 	else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
-		sync_global_pgds(start, end - 1, 0);
+		sync_global_pgds(start, end - 1);
 	return err;
 }
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/12] mm: Introduce arch_pgd_init_late()
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().

The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.

Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/Kconfig       |  9 +++++++++
 fs/exec.c          |  3 +++
 include/linux/mm.h |  6 ++++++
 kernel/fork.c      | 16 ++++++++++++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..a8e866cd4247 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -491,6 +491,15 @@ config PGTABLE_LEVELS
 	int
 	default 2
 
+config ARCH_HAS_PGD_INIT_LATE
+	bool
+	help
+	  Architectures that want a late PGD initialization can define
+	  the arch_pgd_init_late() callback and it will be called
+	  by the generic new task (fork()) code after a new task has
+	  been made visible on the task list, but before it has been
+	  first scheduled.
+
 config ARCH_HAS_ELF_RANDOMIZE
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
+
 	tsk->mm = mm;
+	arch_pgd_init_late(mm);
+
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
+#endif
+
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/*
+	 * If we have a new PGD then initialize it:
+	 *
+	 * This method is called after a task has been made visible
+	 * on the task list already.
+	 *
+	 * Architectures that manage per task kernel pagetables
+	 * might use this callback to initialize them after they
+	 * are already visible to new updates.
+	 *
+	 * NOTE: any user-space parts of the PGD are already initialized
+	 *       and must not be clobbered.
+	 */
+	if (!(clone_flags & CLONE_VM))
+		arch_pgd_init_late(p->mm);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/12] mm: Introduce arch_pgd_init_late()
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Add a late PGD init callback to places that allocate a new MM
with a new PGD: copy_process() and exec().

The purpose of this callback is to allow architectures to implement
lockless initialization of task PGDs, to remove the scalability
limit of pgd_list/pgd_lock.

Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE
Kconfig flag. There's zero overhead on architectures that are not using it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/Kconfig       |  9 +++++++++
 fs/exec.c          |  3 +++
 include/linux/mm.h |  6 ++++++
 kernel/fork.c      | 16 ++++++++++++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..a8e866cd4247 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -491,6 +491,15 @@ config PGTABLE_LEVELS
 	int
 	default 2
 
+config ARCH_HAS_PGD_INIT_LATE
+	bool
+	help
+	  Architectures that want a late PGD initialization can define
+	  the arch_pgd_init_late() callback and it will be called
+	  by the generic new task (fork()) code after a new task has
+	  been made visible on the task list, but before it has been
+	  first scheduled.
+
 config ARCH_HAS_ELF_RANDOMIZE
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..4ce1383d5bba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
+
 	tsk->mm = mm;
+	arch_pgd_init_late(mm);
+
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..a3edc839e431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE
+void arch_pgd_init_late(struct mm_struct *mm);
+#else
+static inline void arch_pgd_init_late(struct mm_struct *mm) { }
+#endif
+
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..cfa84971fb52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/*
+	 * If we have a new PGD then initialize it:
+	 *
+	 * This method is called after a task has been made visible
+	 * on the task list already.
+	 *
+	 * Architectures that manage per task kernel pagetables
+	 * might use this callback to initialize them after they
+	 * are already visible to new updates.
+	 *
+	 * NOTE: any user-space parts of the PGD are already initialized
+	 *       and must not be clobbered.
+	 */
+	if (!(clone_flags & CLONE_VM))
+		arch_pgd_init_late(p->mm);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Prepare for lockless PGD init: enable the arch_pgd_init_late() callback
and add a 'careful' implementation of PGD init to it: only copy over
non-zero entries.

Since PGD entries only ever get added, this method catches any updates
to swapper_pg_dir[] that might have occurred between early PGD init
and late PGD init.

Note that this only matters for code that does not use the pgd_list but
the task list to find all PGDs in the system.

Subsequent patches will convert pgd_list users to task-list iterations.

[ This adds extra overhead in that we do the PGD initialization for a
  second time - a later patch will simplify this, once we don't have
  old pgd_list users. ]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/mm/pgtable.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7e39f9b22705..15c19ce149f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_HAS_PGD_INIT_LATE
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..7a561b7cc01c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,65 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	return NULL;
 }
 
+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm)
+{
+	/*
+	 * This function is called after a new MM has been made visible
+	 * in fork() or exec() via:
+	 *
+	 *   tsk->mm = mm;
+	 *
+	 * This barrier makes sure the MM is visible to new RCU
+	 * walkers before we initialize the pagetables below, so that
+	 * we don't miss updates:
+	 */
+	smp_wmb();
+
+	/*
+	 * If the pgd points to a shared pagetable level (either the
+	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
+	 * references from swapper_pg_dir:
+	 */
+	if ( CONFIG_PGTABLE_LEVELS == 2 ||
+	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+	     CONFIG_PGTABLE_LEVELS == 4) {
+
+		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =        mm->pgd + KERNEL_PGD_BOUNDARY;
+		int i;
+
+		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+			/*
+			 * This is lock-less, so it can race with PGD updates
+			 * coming from vmalloc() or CPA methods, but it's safe,
+			 * because:
+			 *
+			 * 1) this PGD is not in use yet, we have still not
+			 *    scheduled this task.
+			 * 2) we only ever extend PGD entries
+			 *
+			 * So if we observe a non-zero PGD entry we can copy it,
+			 * it won't change from under us. Parallel updates (new
+			 * allocations) will modify our (already visible) PGD:
+			 */
+			if (!pgd_none(*pgd_src))
+				set_pgd(pgd_dst, *pgd_src);
+		}
+	}
+}
+
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Prepare for lockless PGD init: enable the arch_pgd_init_late() callback
and add a 'careful' implementation of PGD init to it: only copy over
non-zero entries.

Since PGD entries only ever get added, this method catches any updates
to swapper_pg_dir[] that might have occurred between early PGD init
and late PGD init.

Note that this only matters for code that does not use the pgd_list but
the task list to find all PGDs in the system.

Subsequent patches will convert pgd_list users to task-list iterations.

[ This adds extra overhead in that we do the PGD initialization for a
  second time - a later patch will simplify this, once we don't have
  old pgd_list users. ]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/mm/pgtable.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7e39f9b22705..15c19ce149f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_HAS_PGD_INIT_LATE
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..7a561b7cc01c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -391,6 +391,65 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	return NULL;
 }
 
+/*
+ * Initialize the kernel portion of the PGD.
+ *
+ * This is done separately, because pgd_alloc() happens when
+ * the task is not on the task list yet - and PGD updates
+ * happen by walking the task list.
+ *
+ * No locking is needed here, as we just copy over the reference
+ * PGD. The reference PGD (pgtable_init) is only ever expanded
+ * at the highest, PGD level. Thus any other task extending it
+ * will first update the reference PGD, then modify the task PGDs.
+ */
+void arch_pgd_init_late(struct mm_struct *mm)
+{
+	/*
+	 * This function is called after a new MM has been made visible
+	 * in fork() or exec() via:
+	 *
+	 *   tsk->mm = mm;
+	 *
+	 * This barrier makes sure the MM is visible to new RCU
+	 * walkers before we initialize the pagetables below, so that
+	 * we don't miss updates:
+	 */
+	smp_wmb();
+
+	/*
+	 * If the pgd points to a shared pagetable level (either the
+	 * ptes in non-PAE, or shared PMD in PAE), then just copy the
+	 * references from swapper_pg_dir:
+	 */
+	if ( CONFIG_PGTABLE_LEVELS == 2 ||
+	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
+	     CONFIG_PGTABLE_LEVELS == 4) {
+
+		pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY;
+		pgd_t *pgd_dst =        mm->pgd + KERNEL_PGD_BOUNDARY;
+		int i;
+
+		for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) {
+			/*
+			 * This is lock-less, so it can race with PGD updates
+			 * coming from vmalloc() or CPA methods, but it's safe,
+			 * because:
+			 *
+			 * 1) this PGD is not in use yet, we have still not
+			 *    scheduled this task.
+			 * 2) we only ever extend PGD entries
+			 *
+			 * So if we observe a non-zero PGD entry we can copy it,
+			 * it won't change from under us. Parallel updates (new
+			 * allocations) will modify our (already visible) PGD:
+			 */
+			if (!pgd_none(*pgd_src))
+				set_pgd(pgd_dst, *pgd_src);
+		}
+	}
+}
+
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
  */
 void xen_mm_pin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
-	list_for_each_entry(page, &pgd_list, lru) {
-		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
-			SetPageSavePinned(page);
+	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
+
+			if (!PagePinned(page)) {
+				__xen_pgd_pin(&init_mm, pgd);
+				SetPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
@@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
  */
 void xen_mm_unpin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
-	list_for_each_entry(page, &pgd_list, lru) {
-		if (PageSavePinned(page)) {
-			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
-			ClearPageSavePinned(page);
+	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
+
+			if (PageSavePinned(page)) {
+				BUG_ON(!PagePinned(page));
+				__xen_pgd_unpin(&init_mm, pgd);
+				ClearPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..70a3df5b0b54 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
  */
 void xen_mm_pin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
-	list_for_each_entry(page, &pgd_list, lru) {
-		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
-			SetPageSavePinned(page);
+	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
+
+			if (!PagePinned(page)) {
+				__xen_pgd_pin(&init_mm, pgd);
+				SetPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
@@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
  */
 void xen_mm_unpin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
-	spin_lock(&pgd_lock);
+	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
 
-	list_for_each_entry(page, &pgd_list, lru) {
-		if (PageSavePinned(page)) {
-			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
-			ClearPageSavePinned(page);
+	for_each_process_thread(g, p) {
+		struct mm_struct *mm;
+		struct page *page;
+		pgd_t *pgd;
+
+		task_lock(p);
+		mm = p->mm;
+		if (mm) {
+			pgd = mm->pgd;
+			page = virt_to_page(pgd);
+
+			if (PageSavePinned(page)) {
+				BUG_ON(!PagePinned(page));
+				__xen_pgd_unpin(&init_mm, pgd);
+				ClearPageSavePinned(page);
+			}
 		}
+		task_unlock(p);
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs in certain rare cases,
like register_die_notifier().

This use seems to be somewhat questionable, as most other vmalloc
page table fixups are vmalloc_fault() driven, but nevertheless
it's there and it's using the pgd_list.

But we don't need the global list, as we can walk the task list
under RCU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50342825f221..366b8232f4b3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,23 +235,35 @@ void vmalloc_sync_all(void)
 	for (address = VMALLOC_START & PMD_MASK;
 	     address >= TASK_SIZE && address < FIXADDR_TOP;
 	     address += PMD_SIZE) {
-		struct page *page;
 
-		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+		struct task_struct *g, *p;
+
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
-			pmd_t *ret;
+			pmd_t *pmd_ret;
 
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
+				continue;
+			}
 
+			/* The pgt_lock is only used on Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
-			ret = vmalloc_sync_one(page_address(page), address);
+			pmd_ret = vmalloc_sync_one(mm->pgd, address);
 			spin_unlock(pgt_lock);
 
-			if (!ret)
+			task_unlock(p);
+
+			if (!pmd_ret)
 				break;
 		}
+
 		spin_unlock(&pgd_lock);
 	}
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all()
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The vmalloc() code uses vmalloc_sync_all() to synchronize changes to
the global reference kernel PGD to task PGDs in certain rare cases,
like register_die_notifier().

This use seems to be somewhat questionable, as most other vmalloc
page table fixups are vmalloc_fault() driven, but nevertheless
it's there and it's using the pgd_list.

But we don't need the global list, as we can walk the task list
under RCU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50342825f221..366b8232f4b3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -235,23 +235,35 @@ void vmalloc_sync_all(void)
 	for (address = VMALLOC_START & PMD_MASK;
 	     address >= TASK_SIZE && address < FIXADDR_TOP;
 	     address += PMD_SIZE) {
-		struct page *page;
 
-		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
+		struct task_struct *g, *p;
+
+		spin_lock(&pgd_lock); /* Implies rcu_read_lock(): */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
 			spinlock_t *pgt_lock;
-			pmd_t *ret;
+			pmd_t *pmd_ret;
 
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			task_lock(p);
+			mm = p->mm;
+			if (!mm) {
+				task_unlock(p);
+				continue;
+			}
 
+			/* The pgt_lock is only used on Xen: */
+			pgt_lock = &mm->page_table_lock;
 			spin_lock(pgt_lock);
-			ret = vmalloc_sync_one(page_address(page), address);
+			pmd_ret = vmalloc_sync_one(mm->pgd, address);
 			spin_unlock(pgt_lock);
 
-			if (!ret)
+			task_unlock(p);
+
+			if (!pmd_ret)
 				break;
 		}
+
 		spin_unlock(&pgd_lock);
 	}
 }
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.

This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.

(This code already holds the pgd_lock.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 997fc97e9072..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -438,17 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	set_pte_atomic(kpte, pte);
 #ifdef CONFIG_X86_32
 	if (!SHARED_KERNEL_PMD) {
-		struct page *page;
+		struct task_struct *g, *p;
 
-		list_for_each_entry(page, &pgd_list, lru) {
+		/* We are holding pgd_lock, which implies rcu_read_lock(): */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
+			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			pud = pud_offset(pgd, address);
-			pmd = pmd_offset(pud, address);
-			set_pte_atomic((pte_t *)pmd, pte);
+			task_lock(p);
+			mm = p->mm;
+			if (mm) {
+				pgt_lock = &mm->page_table_lock;
+				spin_lock(pgt_lock);
+
+				pgd = mm->pgd + pgd_index(address);
+				pud = pud_offset(pgd, address);
+				pmd = pmd_offset(pud, address);
+				set_pte_atomic((pte_t *)pmd, pte);
+
+				spin_unlock(pgt_lock);
+			}
+			task_unlock(p);
 		}
 	}
 #endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds.

This uses pgd_list currently, but we don't need the global
list as we can walk the task list under RCU.

(This code already holds the pgd_lock.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pageattr.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 997fc97e9072..4ff6a1808f1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -438,17 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 	set_pte_atomic(kpte, pte);
 #ifdef CONFIG_X86_32
 	if (!SHARED_KERNEL_PMD) {
-		struct page *page;
+		struct task_struct *g, *p;
 
-		list_for_each_entry(page, &pgd_list, lru) {
+		/* We are holding pgd_lock, which implies rcu_read_lock(): */
+
+		for_each_process_thread(g, p) {
+			struct mm_struct *mm;
+			spinlock_t *pgt_lock;
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			pud = pud_offset(pgd, address);
-			pmd = pmd_offset(pud, address);
-			set_pte_atomic((pte_t *)pmd, pte);
+			task_lock(p);
+			mm = p->mm;
+			if (mm) {
+				pgt_lock = &mm->page_table_lock;
+				spin_lock(pgt_lock);
+
+				pgd = mm->pgd + pgd_index(address);
+				pud = pud_offset(pgd, address);
+				pmd = pmd_offset(pud, address);
+				set_pte_atomic((pte_t *)pmd, pte);
+
+				spin_unlock(pgt_lock);
+			}
+			task_unlock(p);
 		}
 	}
 #endif
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.

The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.

The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.

So we can remove the pgd_list addition/removal from this code.

The new PGD is private while constructed, so it needs no extra
locking.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++------------------------
 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7a561b7cc01c..0ab56d13f24d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
 	}
-
-	/* list required to sync kernel mapping updates */
-	if (!SHARED_KERNEL_PMD) {
-		pgd_set_mm(pgd, mm);
-		pgd_list_add(pgd);
-	}
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	spin_lock(&pgd_lock);
-	pgd_list_del(pgd);
-	spin_unlock(&pgd_lock);
 }
 
 /*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * Make sure that pre-populating the pmds is atomic with
-	 * respect to anything walking the pgd_list, so that they
-	 * never see a partially populated pgd.
+	 * No locking is needed here, as the PGD is still private,
+	 * so no code walking the task list and looking at mm->pgd
+	 * will be able to see it before it's fully constructed:
 	 */
-	spin_lock(&pgd_lock);
-
 	pgd_ctor(mm, pgd);
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
-	spin_unlock(&pgd_lock);
-
 	return pgd;
 
 out_free_pmds:
@@ -453,7 +433,6 @@ void arch_pgd_init_late(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	pgd_dtor(pgd);
 	paravirt_pgd_free(mm, pgd);
 	_pgd_free(pgd);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate
the PGD, with platform specific code setting up kernel pagetables.

The x86 code uses a global pgd_list with an associated lock to update
all PGDs of all tasks in the system synchronously.

The lock is still kept to synchronize updates to all PGDs in the system,
but all users of the list have been migrated to use the task list.

So we can remove the pgd_list addition/removal from this code.

The new PGD is private while constructed, so it needs no extra
locking.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++------------------------
 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7a561b7cc01c..0ab56d13f24d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
 	}
-
-	/* list required to sync kernel mapping updates */
-	if (!SHARED_KERNEL_PMD) {
-		pgd_set_mm(pgd, mm);
-		pgd_list_add(pgd);
-	}
-}
-
-static void pgd_dtor(pgd_t *pgd)
-{
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	spin_lock(&pgd_lock);
-	pgd_list_del(pgd);
-	spin_unlock(&pgd_lock);
 }
 
 /*
@@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * Make sure that pre-populating the pmds is atomic with
-	 * respect to anything walking the pgd_list, so that they
-	 * never see a partially populated pgd.
+	 * No locking is needed here, as the PGD is still private,
+	 * so no code walking the task list and looking at mm->pgd
+	 * will be able to see it before it's fully constructed:
 	 */
-	spin_lock(&pgd_lock);
-
 	pgd_ctor(mm, pgd);
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
-	spin_unlock(&pgd_lock);
-
 	return pgd;
 
 out_free_pmds:
@@ -453,7 +433,6 @@ void arch_pgd_init_late(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	pgd_dtor(pgd);
 	paravirt_pgd_free(mm, pgd);
 	_pgd_free(pgd);
 }
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 11/12] x86/mm: Remove pgd_list leftovers
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Nothing uses the pgd_list anymore - remove the list itself and its helpers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |  3 ---
 arch/x86/mm/fault.c            |  1 -
 arch/x86/mm/pgtable.c          | 26 --------------------------
 3 files changed, 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fe57e7a98839..7e93438d8b53 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 
 extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 366b8232f4b3..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);
 
 #ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0ab56d13f24d..7cca42c01677 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_del(&page->lru);
-}
-
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
-	BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
-	virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
-	return (struct mm_struct *)page->index;
-}
-
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 {
 	/* If the pgd points to a shared pagetable level (either the
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 11/12] x86/mm: Remove pgd_list leftovers
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Nothing uses the pgd_list anymore - remove the list itself and its helpers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |  3 ---
 arch/x86/mm/fault.c            |  1 -
 arch/x86/mm/pgtable.c          | 26 --------------------------
 3 files changed, 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fe57e7a98839..7e93438d8b53 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 
 extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
-extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 366b8232f4b3..14b39c3511fd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-LIST_HEAD(pgd_list);
 
 #ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0ab56d13f24d..7cca42c01677 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_add(&page->lru, &pgd_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-
-	list_del(&page->lru);
-}
-
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-
-static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
-{
-	BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
-	virt_to_page(pgd)->index = (pgoff_t)mm;
-}
-
-struct mm_struct *pgd_page_get_mm(struct page *page)
-{
-	return (struct mm_struct *)page->index;
-}
-
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 {
 	/* If the pgd points to a shared pagetable level (either the
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 12/12] x86/mm: Simplify pgd_alloc()
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13  9:49   ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.

This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7cca42c01677..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
-	/* If the pgd points to a shared pagetable level (either the
-	   ptes in non-PAE, or shared PMD in PAE), then just copy the
-	   references from swapper_pg_dir. */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
-	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
-		clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
-				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-				KERNEL_PGD_PTRS);
-	}
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * No locking is needed here, as the PGD is still private,
-	 * so no code walking the task list and looking at mm->pgd
-	 * will be able to see it before it's fully constructed:
+	 * Zero out the kernel portion here, we'll set them up in
+	 * arch_pgd_init_late(), when the pgd is globally
+	 * visible already per the task list, so that it cannot
+	 * miss updates.
+	 *
+	 * We need to zero it here, to make sure arch_pgd_init_late()
+	 * can initialize them without locking.
 	 */
-	pgd_ctor(mm, pgd);
+	memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
 	return pgd;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 12/12] x86/mm: Simplify pgd_alloc()
@ 2015-06-13  9:49   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  9:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

Right now pgd_alloc() uses pgd_ctor(), which copies over the
current swapper_pg_dir[] to a new task's PGD.

This is not necessary, it's enough if we clear it: the PGD will
then be properly updated by arch_pgd_init_late().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7cca42c01677..d7d341e57e33 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #define UNSHARED_PTRS_PER_PGD				\
 	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
 
-static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
-{
-	/* If the pgd points to a shared pagetable level (either the
-	   ptes in non-PAE, or shared PMD in PAE), then just copy the
-	   references from swapper_pg_dir. */
-	if (CONFIG_PGTABLE_LEVELS == 2 ||
-	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
-		clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
-				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-				KERNEL_PGD_PTRS);
-	}
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		goto out_free_pmds;
 
 	/*
-	 * No locking is needed here, as the PGD is still private,
-	 * so no code walking the task list and looking at mm->pgd
-	 * will be able to see it before it's fully constructed:
+	 * Zero out the kernel portion here, we'll set them up in
+	 * arch_pgd_init_late(), when the pgd is globally
+	 * visible already per the task list, so that it cannot
+	 * miss updates.
+	 *
+	 * We need to zero it here, to make sure arch_pgd_init_late()
+	 * can initialize them without locking.
 	 */
-	pgd_ctor(mm, pgd);
+	memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t));
+
 	pgd_prepopulate_pmd(mm, pgd, pmds);
 
 	return pgd;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* why do we need vmalloc_sync_all?
  2015-06-13  9:49 ` Ingo Molnar
@ 2015-06-13 18:58   ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-13 18:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

I didn't read v2 yet, but I'd like to ask a question.

Why do we need vmalloc_sync_all()?

It has a single caller, register_die_notifier() which calls it without
any explanation. IMO, this needs a comment at least.

I am not sure I understand the changelog in 101f12af correctly, but
at first glance vmalloc_sync_all() is no longer needed at least on x86,
do_page_fault() no longer does notify_die(DIE_PAGE_FAULT). And btw
DIE_PAGE_FAULT has no users. DIE_MNI too...

Perhaps we can simply kill it on x86?

As for other architectures I am not sure. arch/tile implements
vmalloc_sync_all() and uses notify_die() in do_page_fault().

And in any case register_die_notifier()->vmalloc_sync() looks strange.
If (say) arch/tile needs this to fix the problem with modules, perhaps
it should do vmalloc_sync_all() in do_init_module() paths?

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* why do we need vmalloc_sync_all?
@ 2015-06-13 18:58   ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-13 18:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

I didn't read v2 yet, but I'd like to ask a question.

Why do we need vmalloc_sync_all()?

It has a single caller, register_die_notifier() which calls it without
any explanation. IMO, this needs a comment at least.

I am not sure I understand the changelog in 101f12af correctly, but
at first glance vmalloc_sync_all() is no longer needed at least on x86,
do_page_fault() no longer does notify_die(DIE_PAGE_FAULT). And btw
DIE_PAGE_FAULT has no users. DIE_MNI too...

Perhaps we can simply kill it on x86?

As for other architectures I am not sure. arch/tile implements
vmalloc_sync_all() and uses notify_die() in do_page_fault().

And in any case register_die_notifier()->vmalloc_sync() looks strange.
If (say) arch/tile needs this to fix the problem with modules, perhaps
it should do vmalloc_sync_all() in do_init_module() paths?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-13  9:49   ` Ingo Molnar
@ 2015-06-13 19:24     ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-13 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/13, Ingo Molnar wrote:
>
> @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
>  	for (address = start; address <= end; address += PGDIR_SIZE) {
>  		const pgd_t *pgd_ref = pgd_offset_k(address);
> -		struct page *page;
> +		struct task_struct *g, *p;
>
>  		/*
> -		 * When it is called after memory hot remove, pgd_none()
> -		 * returns true. In this case (removed == 1), we must clear
> -		 * the PGD entries in the local PGD level page.
> +		 * When this function is called after memory hot remove,
> +		 * pgd_none() already returns true, but only the reference
> +		 * kernel PGD has been cleared, not the process PGDs.
> +		 *
> +		 * So clear the affected entries in every process PGD as well:
>  		 */
>  		if (pgd_none(*pgd_ref) && !removed)
>  			continue;
>
> -		spin_lock(&pgd_lock);
> -		list_for_each_entry(page, &pgd_list, lru) {
> +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
                                         ^^^^^^^^^^^^^^^^^^^^^^^

Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand
how it actually works ;) But, say, rcu_check_callbacks() can be called from
irq and since spin_lock() doesn't increment current->rcu_read_lock_nesting
this can lead to rcu_preempt_qs()?

> +		for_each_process_thread(g, p) {
> +			struct mm_struct *mm;
>  			pgd_t *pgd;
>  			spinlock_t *pgt_lock;
>
> -			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> -			/* the pgt_lock only for Xen */
> -			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> +			task_lock(p);
> +			mm = p->mm;
> +			if (!mm) {
> +				task_unlock(p);
> +				continue;
> +			}

Again, you can simplify this code and avoid for_each_process_thread() if
you use for_each_process() + find_lock_task_mm().

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-13 19:24     ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-13 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/13, Ingo Molnar wrote:
>
> @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
>
>  	for (address = start; address <= end; address += PGDIR_SIZE) {
>  		const pgd_t *pgd_ref = pgd_offset_k(address);
> -		struct page *page;
> +		struct task_struct *g, *p;
>
>  		/*
> -		 * When it is called after memory hot remove, pgd_none()
> -		 * returns true. In this case (removed == 1), we must clear
> -		 * the PGD entries in the local PGD level page.
> +		 * When this function is called after memory hot remove,
> +		 * pgd_none() already returns true, but only the reference
> +		 * kernel PGD has been cleared, not the process PGDs.
> +		 *
> +		 * So clear the affected entries in every process PGD as well:
>  		 */
>  		if (pgd_none(*pgd_ref) && !removed)
>  			continue;
>
> -		spin_lock(&pgd_lock);
> -		list_for_each_entry(page, &pgd_list, lru) {
> +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
                                         ^^^^^^^^^^^^^^^^^^^^^^^

Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand
how it actually works ;) But, say, rcu_check_callbacks() can be called from
irq and since spin_lock() doesn't increment current->rcu_read_lock_nesting
this can lead to rcu_preempt_qs()?

> +		for_each_process_thread(g, p) {
> +			struct mm_struct *mm;
>  			pgd_t *pgd;
>  			spinlock_t *pgt_lock;
>
> -			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> -			/* the pgt_lock only for Xen */
> -			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> +			task_lock(p);
> +			mm = p->mm;
> +			if (!mm) {
> +				task_unlock(p);
> +				continue;
> +			}

Again, you can simplify this code and avoid for_each_process_thread() if
you use for_each_process() + find_lock_task_mm().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-13 19:24     ` Oleg Nesterov
@ 2015-06-14  7:36       ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  7:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/13, Ingo Molnar wrote:
> >
> > @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> >  	for (address = start; address <= end; address += PGDIR_SIZE) {
> >  		const pgd_t *pgd_ref = pgd_offset_k(address);
> > -		struct page *page;
> > +		struct task_struct *g, *p;
> >
> >  		/*
> > -		 * When it is called after memory hot remove, pgd_none()
> > -		 * returns true. In this case (removed == 1), we must clear
> > -		 * the PGD entries in the local PGD level page.
> > +		 * When this function is called after memory hot remove,
> > +		 * pgd_none() already returns true, but only the reference
> > +		 * kernel PGD has been cleared, not the process PGDs.
> > +		 *
> > +		 * So clear the affected entries in every process PGD as well:
> >  		 */
> >  		if (pgd_none(*pgd_ref) && !removed)
> >  			continue;
> >
> > -		spin_lock(&pgd_lock);
> > -		list_for_each_entry(page, &pgd_list, lru) {
> > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>                                          ^^^^^^^^^^^^^^^^^^^^^^^
> 
> Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it 
> actually works ;) But, say, rcu_check_callbacks() can be called from irq and 
> since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead 
> to rcu_preempt_qs()?

No, RCU grace periods are still defined by 'heavy' context boundaries such as 
context switches, entering idle or user-space mode.

PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the 
RCU read critical section - that is why it uses a separate nesting counter 
(current->rcu_read_lock_nesting), not the preempt count.

But if a piece of kernel code is non-preemptible, such as a spinlocked region or 
an irqs-off region, then those are still natural RCU read lock regions, regardless 
of the RCU model, and need no additional RCU locking.

rcu_check_callbacks() can be called from irq context, but only to observe whether 
the current CPU is in quiescent state. If it interrupts a spinlocked region it 
won't register a quiesent state.

> > +		for_each_process_thread(g, p) {
> > +			struct mm_struct *mm;
> >  			pgd_t *pgd;
> >  			spinlock_t *pgt_lock;
> >
> > -			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> > -			/* the pgt_lock only for Xen */
> > -			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> > +			task_lock(p);
> > +			mm = p->mm;
> > +			if (!mm) {
> > +				task_unlock(p);
> > +				continue;
> > +			}
> 
> Again, you can simplify this code and avoid for_each_process_thread() if you use 
> for_each_process() + find_lock_task_mm().

True!

So I looked at this when you first mentioned it but mis-read find_lock_task_mm(), 
which as you insist is exactly what this iteration needs to become faster and 
simpler. Thanks for the reminder - I have fixed it, will be part of -v3.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-14  7:36       ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  7:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/13, Ingo Molnar wrote:
> >
> > @@ -169,29 +169,40 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> >
> >  	for (address = start; address <= end; address += PGDIR_SIZE) {
> >  		const pgd_t *pgd_ref = pgd_offset_k(address);
> > -		struct page *page;
> > +		struct task_struct *g, *p;
> >
> >  		/*
> > -		 * When it is called after memory hot remove, pgd_none()
> > -		 * returns true. In this case (removed == 1), we must clear
> > -		 * the PGD entries in the local PGD level page.
> > +		 * When this function is called after memory hot remove,
> > +		 * pgd_none() already returns true, but only the reference
> > +		 * kernel PGD has been cleared, not the process PGDs.
> > +		 *
> > +		 * So clear the affected entries in every process PGD as well:
> >  		 */
> >  		if (pgd_none(*pgd_ref) && !removed)
> >  			continue;
> >
> > -		spin_lock(&pgd_lock);
> > -		list_for_each_entry(page, &pgd_list, lru) {
> > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>                                          ^^^^^^^^^^^^^^^^^^^^^^^
> 
> Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it 
> actually works ;) But, say, rcu_check_callbacks() can be called from irq and 
> since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead 
> to rcu_preempt_qs()?

No, RCU grace periods are still defined by 'heavy' context boundaries such as 
context switches, entering idle or user-space mode.

PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the 
RCU read critical section - that is why it uses a separate nesting counter 
(current->rcu_read_lock_nesting), not the preempt count.

But if a piece of kernel code is non-preemptible, such as a spinlocked region or 
an irqs-off region, then those are still natural RCU read lock regions, regardless 
of the RCU model, and need no additional RCU locking.

rcu_check_callbacks() can be called from irq context, but only to observe whether 
the current CPU is in quiescent state. If it interrupts a spinlocked region it 
won't register a quiesent state.

> > +		for_each_process_thread(g, p) {
> > +			struct mm_struct *mm;
> >  			pgd_t *pgd;
> >  			spinlock_t *pgt_lock;
> >
> > -			pgd = (pgd_t *)page_address(page) + pgd_index(address);
> > -			/* the pgt_lock only for Xen */
> > -			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> > +			task_lock(p);
> > +			mm = p->mm;
> > +			if (!mm) {
> > +				task_unlock(p);
> > +				continue;
> > +			}
> 
> Again, you can simplify this code and avoid for_each_process_thread() if you use 
> for_each_process() + find_lock_task_mm().

True!

So I looked at this when you first mentioned it but mis-read find_lock_task_mm(), 
which as you insist is exactly what this iteration needs to become faster and 
simpler. Thanks for the reminder - I have fixed it, will be part of -v3.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-13 18:58   ` Oleg Nesterov
@ 2015-06-14  7:59     ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  7:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> I didn't read v2 yet, but I'd like to ask a question.
> 
> Why do we need vmalloc_sync_all()?
> 
> It has a single caller, register_die_notifier() which calls it without
> any explanation. IMO, this needs a comment at least.

Yes, it's used to work around crashes in modular callbacks: if the callbacks 
happens to be called from within the page fault path, before the vmalloc page 
fault handler runs, then we have a catch-22 problem.

It's rare but not entirely impossible.

> I am not sure I understand the changelog in 101f12af correctly, but at first 
> glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault() 
> no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users. 
> DIE_MNI too...
> 
> Perhaps we can simply kill it on x86?

So in theory we could still have it run from DIE_OOPS, and that could turn a 
survivable kernel crash into a non-survivable one.

Note that all of this will go away if we also do the vmalloc fault handling 
simplification that I discussed with Andy:

 - this series already makes the set of kernel PGDs strictly monotonically 
   increasing during the lifetime of the x86 kernel

 - if in a subsequent patch we can synchronize new PGDs right after the vmalloc
   code creates it, before the area is used - so we can remove vmalloc_fault()
   altogether [or rather, turn it into a debug warning initially].
   vmalloc_fault() is a clever but somewhat fragile complication.

 - after that we can simply remove vmalloc_sync_all() from x86, because all active 
   vmalloc areas will be fully instantiated, all the time, on x86.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-14  7:59     ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  7:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Oleg Nesterov <oleg@redhat.com> wrote:

> I didn't read v2 yet, but I'd like to ask a question.
> 
> Why do we need vmalloc_sync_all()?
> 
> It has a single caller, register_die_notifier() which calls it without
> any explanation. IMO, this needs a comment at least.

Yes, it's used to work around crashes in modular callbacks: if the callbacks 
happens to be called from within the page fault path, before the vmalloc page 
fault handler runs, then we have a catch-22 problem.

It's rare but not entirely impossible.

> I am not sure I understand the changelog in 101f12af correctly, but at first 
> glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault() 
> no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users. 
> DIE_MNI too...
> 
> Perhaps we can simply kill it on x86?

So in theory we could still have it run from DIE_OOPS, and that could turn a 
survivable kernel crash into a non-survivable one.

Note that all of this will go away if we also do the vmalloc fault handling 
simplification that I discussed with Andy:

 - this series already makes the set of kernel PGDs strictly monotonically 
   increasing during the lifetime of the x86 kernel

 - if in a subsequent patch we can synchronize new PGDs right after the vmalloc
   code creates it, before the area is used - so we can remove vmalloc_fault()
   altogether [or rather, turn it into a debug warning initially].
   vmalloc_fault() is a clever but somewhat fragile complication.

 - after that we can simply remove vmalloc_sync_all() from x86, because all active 
   vmalloc areas will be fully instantiated, all the time, on x86.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  9:49   ` Ingo Molnar
@ 2015-06-14  8:26     ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  8:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long


* Ingo Molnar <mingo@kernel.org> wrote:

> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>   */
>  void xen_mm_unpin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (PageSavePinned(page)) {
> -			BUG_ON(!PagePinned(page));
> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> -			ClearPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (PageSavePinned(page)) {
> +				BUG_ON(!PagePinned(page));
> +				__xen_pgd_unpin(&init_mm, pgd);
> +				ClearPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> +	rcu_read_unlock();
>  }

I also removed the leftover stray rcu_read_unlock() from -v3.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-14  8:26     ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-14  8:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long


* Ingo Molnar <mingo@kernel.org> wrote:

> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>   */
>  void xen_mm_unpin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (PageSavePinned(page)) {
> -			BUG_ON(!PagePinned(page));
> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> -			ClearPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (PageSavePinned(page)) {
> +				BUG_ON(!PagePinned(page));
> +				__xen_pgd_unpin(&init_mm, pgd);
> +				ClearPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> +	rcu_read_unlock();
>  }

I also removed the leftover stray rcu_read_unlock() from -v3.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-14  7:36       ` Ingo Molnar
@ 2015-06-14 19:24         ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney

On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > to rcu_preempt_qs()?
>
> No, RCU grace periods are still defined by 'heavy' context boundaries such as
> context switches, entering idle or user-space mode.
>
> PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> RCU read critical section - that is why it uses a separate nesting counter
> (current->rcu_read_lock_nesting), not the preempt count.

Yes.

> But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> an irqs-off region, then those are still natural RCU read lock regions, regardless
> of the RCU model, and need no additional RCU locking.

I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
finish the gp, but if there are no other rcu-read-lock holders then it
seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
this CPU no longer needs rcu_preempt_note_context_switch().

OK, I can be easily wrong, I do not really understand the implementation
of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
with the _current_ implementation. Still this doesn't look right even if
happens to work, and Documentation/RCU/checklist.txt says:

11.	Note that synchronize_rcu() -only- guarantees to wait until
	all currently executing rcu_read_lock()-protected RCU read-side
	critical sections complete.  It does -not- necessarily guarantee
	that all currently running interrupts, NMIs, preempt_disable()
	code, or idle loops will complete.  Therefore, if your
	read-side critical sections are protected by something other
	than rcu_read_lock(), do -not- use synchronize_rcu().

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-14 19:24         ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney

On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > to rcu_preempt_qs()?
>
> No, RCU grace periods are still defined by 'heavy' context boundaries such as
> context switches, entering idle or user-space mode.
>
> PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> RCU read critical section - that is why it uses a separate nesting counter
> (current->rcu_read_lock_nesting), not the preempt count.

Yes.

> But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> an irqs-off region, then those are still natural RCU read lock regions, regardless
> of the RCU model, and need no additional RCU locking.

I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
finish the gp, but if there are no other rcu-read-lock holders then it
seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
this CPU no longer needs rcu_preempt_note_context_switch().

OK, I can be easily wrong, I do not really understand the implementation
of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
with the _current_ implementation. Still this doesn't look right even if
happens to work, and Documentation/RCU/checklist.txt says:

11.	Note that synchronize_rcu() -only- guarantees to wait until
	all currently executing rcu_read_lock()-protected RCU read-side
	critical sections complete.  It does -not- necessarily guarantee
	that all currently running interrupts, NMIs, preempt_disable()
	code, or idle loops will complete.  Therefore, if your
	read-side critical sections are protected by something other
	than rcu_read_lock(), do -not- use synchronize_rcu().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-14 19:24         ` Oleg Nesterov
@ 2015-06-14 19:38           ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 19:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney

On 06/14, Oleg Nesterov wrote:
>
> On 06/14, Ingo Molnar wrote:
> >
> > * Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > to rcu_preempt_qs()?
> >
> > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > context switches, entering idle or user-space mode.
> >
> > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > RCU read critical section - that is why it uses a separate nesting counter
> > (current->rcu_read_lock_nesting), not the preempt count.
>
> Yes.
>
> > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > of the RCU model, and need no additional RCU locking.
>
> I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> finish the gp, but if there are no other rcu-read-lock holders then it
> seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> this CPU no longer needs rcu_preempt_note_context_switch().
>
> OK, I can be easily wrong, I do not really understand the implementation
> of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> with the _current_ implementation. Still this doesn't look right even if
> happens to work, and Documentation/RCU/checklist.txt says:
>
> 11.	Note that synchronize_rcu() -only- guarantees to wait until
> 	all currently executing rcu_read_lock()-protected RCU read-side
> 	critical sections complete.  It does -not- necessarily guarantee
> 	that all currently running interrupts, NMIs, preempt_disable()
> 	code, or idle loops will complete.  Therefore, if your
> 	read-side critical sections are protected by something other
> 	than rcu_read_lock(), do -not- use synchronize_rcu().


I've even checked this ;) I applied the stupid patch below and then

	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
	[1] 565

	$ taskset 1 perl -e 'syscall 157, 777'

	$
	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'

	$ dmesg -c
	SPIN start
	SYNC start
	SYNC done!
	SPIN done!

Oleg.

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 }
 #endif
 
+#include <linux/delay.h>
+
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case 666:
+		preempt_disable();
+		pr_crit("SPIN start\n");
+		while (arg2--)
+			mdelay(1);
+		pr_crit("SPIN done!\n");
+		preempt_enable();
+		break;
+	case 777:
+		pr_crit("SYNC start\n");
+		synchronize_rcu();
+		pr_crit("SYNC done!\n");
+		break;
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-14 19:38           ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 19:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long,
	Paul E. McKenney

On 06/14, Oleg Nesterov wrote:
>
> On 06/14, Ingo Molnar wrote:
> >
> > * Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > to rcu_preempt_qs()?
> >
> > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > context switches, entering idle or user-space mode.
> >
> > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > RCU read critical section - that is why it uses a separate nesting counter
> > (current->rcu_read_lock_nesting), not the preempt count.
>
> Yes.
>
> > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > of the RCU model, and need no additional RCU locking.
>
> I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> finish the gp, but if there are no other rcu-read-lock holders then it
> seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> this CPU no longer needs rcu_preempt_note_context_switch().
>
> OK, I can be easily wrong, I do not really understand the implementation
> of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> with the _current_ implementation. Still this doesn't look right even if
> happens to work, and Documentation/RCU/checklist.txt says:
>
> 11.	Note that synchronize_rcu() -only- guarantees to wait until
> 	all currently executing rcu_read_lock()-protected RCU read-side
> 	critical sections complete.  It does -not- necessarily guarantee
> 	that all currently running interrupts, NMIs, preempt_disable()
> 	code, or idle loops will complete.  Therefore, if your
> 	read-side critical sections are protected by something other
> 	than rcu_read_lock(), do -not- use synchronize_rcu().


I've even checked this ;) I applied the stupid patch below and then

	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
	[1] 565

	$ taskset 1 perl -e 'syscall 157, 777'

	$
	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'

	$ dmesg -c
	SPIN start
	SYNC start
	SYNC done!
	SPIN done!

Oleg.

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 }
 #endif
 
+#include <linux/delay.h>
+
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case 666:
+		preempt_disable();
+		pr_crit("SPIN start\n");
+		while (arg2--)
+			mdelay(1);
+		pr_crit("SPIN done!\n");
+		preempt_enable();
+		break;
+	case 777:
+		pr_crit("SYNC start\n");
+		synchronize_rcu();
+		pr_crit("SYNC done!\n");
+		break;
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-14  7:59     ` Ingo Molnar
@ 2015-06-14 20:06       ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 20:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > I didn't read v2 yet, but I'd like to ask a question.
> >
> > Why do we need vmalloc_sync_all()?
> >
> > It has a single caller, register_die_notifier() which calls it without
> > any explanation. IMO, this needs a comment at least.
>
> Yes, it's used to work around crashes in modular callbacks: if the callbacks
> happens to be called from within the page fault path, before the vmalloc page
> fault handler runs, then we have a catch-22 problem.
>
> It's rare but not entirely impossible.

But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
handler could try to resolve this problem and trigger another fault.

> > I am not sure I understand the changelog in 101f12af correctly, but at first
> > glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault()
> > no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users.
> > DIE_MNI too...
> >
> > Perhaps we can simply kill it on x86?
>
> So in theory we could still have it run from DIE_OOPS, and that could turn a
> survivable kernel crash into a non-survivable one.

I don't understand... But OK, my understanding of this magic is very limited,
please forget.

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-14 20:06       ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-14 20:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On 06/14, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > I didn't read v2 yet, but I'd like to ask a question.
> >
> > Why do we need vmalloc_sync_all()?
> >
> > It has a single caller, register_die_notifier() which calls it without
> > any explanation. IMO, this needs a comment at least.
>
> Yes, it's used to work around crashes in modular callbacks: if the callbacks
> happens to be called from within the page fault path, before the vmalloc page
> fault handler runs, then we have a catch-22 problem.
>
> It's rare but not entirely impossible.

But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
handler could try to resolve this problem and trigger another fault.

> > I am not sure I understand the changelog in 101f12af correctly, but at first
> > glance vmalloc_sync_all() is no longer needed at least on x86, do_page_fault()
> > no longer does notify_die(DIE_PAGE_FAULT). And btw DIE_PAGE_FAULT has no users.
> > DIE_MNI too...
> >
> > Perhaps we can simply kill it on x86?
>
> So in theory we could still have it run from DIE_OOPS, and that could turn a
> survivable kernel crash into a non-survivable one.

I don't understand... But OK, my understanding of this magic is very limited,
please forget.

Thanks,

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-14 19:38           ` Oleg Nesterov
@ 2015-06-15  0:40             ` Paul E. McKenney
  -1 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-06-15  0:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> On 06/14, Oleg Nesterov wrote:
> >
> > On 06/14, Ingo Molnar wrote:
> > >
> > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > > >
> > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > to rcu_preempt_qs()?
> > >
> > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > context switches, entering idle or user-space mode.
> > >
> > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > RCU read critical section - that is why it uses a separate nesting counter
> > > (current->rcu_read_lock_nesting), not the preempt count.
> >
> > Yes.
> >
> > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > of the RCU model, and need no additional RCU locking.
> >
> > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > finish the gp, but if there are no other rcu-read-lock holders then it
> > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > this CPU no longer needs rcu_preempt_note_context_switch().
> >
> > OK, I can be easily wrong, I do not really understand the implementation
> > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > with the _current_ implementation. Still this doesn't look right even if
> > happens to work, and Documentation/RCU/checklist.txt says:
> >
> > 11.	Note that synchronize_rcu() -only- guarantees to wait until
> > 	all currently executing rcu_read_lock()-protected RCU read-side
> > 	critical sections complete.  It does -not- necessarily guarantee
> > 	that all currently running interrupts, NMIs, preempt_disable()
> > 	code, or idle loops will complete.  Therefore, if your
> > 	read-side critical sections are protected by something other
> > 	than rcu_read_lock(), do -not- use synchronize_rcu().
> 
> 
> I've even checked this ;) I applied the stupid patch below and then
> 
> 	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
> 	[1] 565
> 
> 	$ taskset 1 perl -e 'syscall 157, 777'
> 
> 	$
> 	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'
> 
> 	$ dmesg -c
> 	SPIN start
> 	SYNC start
> 	SYNC done!
> 	SPIN done!

Please accept my apologies for my late entry to this thread.
Youngest kid graduated from university this weekend, so my
attention has been elsewhere.

If you were to disable interrupts instead of preemption, I would expect
that the preemptible-RCU grace period would be blocked -- though I am
not particularly comfortable with people relying on disabled interrupts
blocking a preemptible-RCU grace period.

Here is what can happen if you try to block a preemptible-RCU grace
period by disabling preemption, assuming that there are at least two
online CPUs in the system:

1.	CPU 0 does spin_lock(), which disables preemption.

2.	CPU 1 starts a grace period.

3.	CPU 0 takes a scheduling-clock interrupt.  It raises softirq,
	and the RCU_SOFTIRQ handler notes that there is a new grace
	period and sets state so that a subsequent quiescent state on
	this CPU will be noted.

4.	CPU 0 takes another scheduling-clock interrupt, which checks
	current->rcu_read_lock_nesting, and notes that there is no
	preemptible-RCU read-side critical section in progress.  It
	again raises softirq, and the RCU_SOFTIRQ handler reports
	the quiescent state to core RCU.

5.	Once each of the other CPUs report a quiescent state, the
	grace period can end, despite CPU 0 having preemption
	disabled the whole time.

So Oleg's test is correct, disabling preemption is not sufficient
to block a preemptible-RCU grace period.

The usual suggestion would be to add rcu_read_lock() just after the
lock is acquired and rcu_read_unlock() just before each release of that
same lock.  Putting the entire RCU read-side critical section under
the lock prevents RCU from having to invoke rcu_read_unlock_special()
due to preemption.  (It might still invoke it if the RCU read-side
critical section was overly long, but that is much cheaper than the
preemption-handling case.)

							Thanx, Paul

> Oleg.
> 
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
>  }
>  #endif
> 
> +#include <linux/delay.h>
> +
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> 
>  	error = 0;
>  	switch (option) {
> +	case 666:
> +		preempt_disable();
> +		pr_crit("SPIN start\n");
> +		while (arg2--)
> +			mdelay(1);
> +		pr_crit("SPIN done!\n");
> +		preempt_enable();
> +		break;
> +	case 777:
> +		pr_crit("SYNC start\n");
> +		synchronize_rcu();
> +		pr_crit("SYNC done!\n");
> +		break;
>  	case PR_SET_PDEATHSIG:
>  		if (!valid_signal(arg2)) {
>  			error = -EINVAL;
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-15  0:40             ` Paul E. McKenney
  0 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-06-15  0:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> On 06/14, Oleg Nesterov wrote:
> >
> > On 06/14, Ingo Molnar wrote:
> > >
> > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > > >
> > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > to rcu_preempt_qs()?
> > >
> > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > context switches, entering idle or user-space mode.
> > >
> > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > RCU read critical section - that is why it uses a separate nesting counter
> > > (current->rcu_read_lock_nesting), not the preempt count.
> >
> > Yes.
> >
> > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > of the RCU model, and need no additional RCU locking.
> >
> > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > finish the gp, but if there are no other rcu-read-lock holders then it
> > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > this CPU no longer needs rcu_preempt_note_context_switch().
> >
> > OK, I can be easily wrong, I do not really understand the implementation
> > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > with the _current_ implementation. Still this doesn't look right even if
> > happens to work, and Documentation/RCU/checklist.txt says:
> >
> > 11.	Note that synchronize_rcu() -only- guarantees to wait until
> > 	all currently executing rcu_read_lock()-protected RCU read-side
> > 	critical sections complete.  It does -not- necessarily guarantee
> > 	that all currently running interrupts, NMIs, preempt_disable()
> > 	code, or idle loops will complete.  Therefore, if your
> > 	read-side critical sections are protected by something other
> > 	than rcu_read_lock(), do -not- use synchronize_rcu().
> 
> 
> I've even checked this ;) I applied the stupid patch below and then
> 
> 	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
> 	[1] 565
> 
> 	$ taskset 1 perl -e 'syscall 157, 777'
> 
> 	$
> 	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'
> 
> 	$ dmesg -c
> 	SPIN start
> 	SYNC start
> 	SYNC done!
> 	SPIN done!

Please accept my apologies for my late entry to this thread.
Youngest kid graduated from university this weekend, so my
attention has been elsewhere.

If you were to disable interrupts instead of preemption, I would expect
that the preemptible-RCU grace period would be blocked -- though I am
not particularly comfortable with people relying on disabled interrupts
blocking a preemptible-RCU grace period.

Here is what can happen if you try to block a preemptible-RCU grace
period by disabling preemption, assuming that there are at least two
online CPUs in the system:

1.	CPU 0 does spin_lock(), which disables preemption.

2.	CPU 1 starts a grace period.

3.	CPU 0 takes a scheduling-clock interrupt.  It raises softirq,
	and the RCU_SOFTIRQ handler notes that there is a new grace
	period and sets state so that a subsequent quiescent state on
	this CPU will be noted.

4.	CPU 0 takes another scheduling-clock interrupt, which checks
	current->rcu_read_lock_nesting, and notes that there is no
	preemptible-RCU read-side critical section in progress.  It
	again raises softirq, and the RCU_SOFTIRQ handler reports
	the quiescent state to core RCU.

5.	Once each of the other CPUs report a quiescent state, the
	grace period can end, despite CPU 0 having preemption
	disabled the whole time.

So Oleg's test is correct, disabling preemption is not sufficient
to block a preemptible-RCU grace period.

The usual suggestion would be to add rcu_read_lock() just after the
lock is acquired and rcu_read_unlock() just before each release of that
same lock.  Putting the entire RCU read-side critical section under
the lock prevents RCU from having to invoke rcu_read_unlock_special()
due to preemption.  (It might still invoke it if the RCU read-side
critical section was overly long, but that is much cheaper than the
preemption-handling case.)

							Thanx, Paul

> Oleg.
> 
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
>  }
>  #endif
> 
> +#include <linux/delay.h>
> +
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> 
>  	error = 0;
>  	switch (option) {
> +	case 666:
> +		preempt_disable();
> +		pr_crit("SPIN start\n");
> +		while (arg2--)
> +			mdelay(1);
> +		pr_crit("SPIN done!\n");
> +		preempt_enable();
> +		break;
> +	case 777:
> +		pr_crit("SYNC start\n");
> +		synchronize_rcu();
> +		pr_crit("SYNC done!\n");
> +		break;
>  	case PR_SET_PDEATHSIG:
>  		if (!valid_signal(arg2)) {
>  			error = -EINVAL;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-14 20:06       ` Oleg Nesterov
@ 2015-06-15  2:47         ` Andi Kleen
  -1 siblings, 0 replies; 83+ messages in thread
From: Andi Kleen @ 2015-06-15  2:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

Oleg Nesterov <oleg@redhat.com> writes:
>
> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
> handler could try to resolve this problem and trigger another fault.

The same problem can happen from NMI handlers or machine check
handlers. It's not necessarily tied to page faults only.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-15  2:47         ` Andi Kleen
  0 siblings, 0 replies; 83+ messages in thread
From: Andi Kleen @ 2015-06-15  2:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

Oleg Nesterov <oleg@redhat.com> writes:
>
> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
> handler could try to resolve this problem and trigger another fault.

The same problem can happen from NMI handlers or machine check
handlers. It's not necessarily tied to page faults only.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-15  2:47         ` Andi Kleen
@ 2015-06-15  2:57           ` Andy Lutomirski
  -1 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-06-15  2:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Oleg Nesterov, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
>> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
>> handler could try to resolve this problem and trigger another fault.
>
> The same problem can happen from NMI handlers or machine check
> handlers. It's not necessarily tied to page faults only.

AIUI, the point of the one and only vmalloc_sync_all call is to
prevent infinitely recursive faults when we call a notify_die
callback.  The only thing that it could realistically protect is
module text or static non-per-cpu module data, since that's the only
thing that's reliably already in the init pgd.  I'm with Oleg: I don't
see how that can happen, since do_page_fault fixes up vmalloc faults
before it calls notify_die.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-15  2:57           ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-06-15  2:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Oleg Nesterov, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long

On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> But again, the kernel no longer does this? do_page_fault() does vmalloc_fault()
>> without notify_die(). If it fails, I do not see how/why a modular DIE_OOPS
>> handler could try to resolve this problem and trigger another fault.
>
> The same problem can happen from NMI handlers or machine check
> handlers. It's not necessarily tied to page faults only.

AIUI, the point of the one and only vmalloc_sync_all call is to
prevent infinitely recursive faults when we call a notify_die
callback.  The only thing that it could realistically protect is
module text or static non-per-cpu module data, since that's the only
thing that's reliably already in the init pgd.  I'm with Oleg: I don't
see how that can happen, since do_page_fault fixes up vmalloc faults
before it calls notify_die.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  9:49   ` Ingo Molnar
@ 2015-06-15  9:05     ` Ian Campbell
  -1 siblings, 0 replies; 83+ messages in thread
From: Ian Campbell @ 2015-06-15  9:05 UTC (permalink / raw)
  To: Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel
  Cc: linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> suspend/restore. It's a stop-all method that needs to iterate through
> all allocated pgds in the system to fix them up for Xen's use.
> 
> This code uses pgd_list, probably because it was an easy interface.
> 
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.
> 
> (As I don't use Xen this is was only build tested.)

In which case it seems extra important to copy the appropriate
maintainers, which I've done here.

Ian.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index dd151b2045b0..70a3df5b0b54 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>   */
>  void xen_mm_pin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (!PagePinned(page)) {
> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
> -			SetPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (!PagePinned(page)) {
> +				__xen_pgd_pin(&init_mm, pgd);
> +				SetPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>   */
>  void xen_mm_unpin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (PageSavePinned(page)) {
> -			BUG_ON(!PagePinned(page));
> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> -			ClearPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (PageSavePinned(page)) {
> +				BUG_ON(!PagePinned(page));
> +				__xen_pgd_unpin(&init_mm, pgd);
> +				ClearPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> +	rcu_read_unlock();
>  }
>  
>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-15  9:05     ` Ian Campbell
  0 siblings, 0 replies; 83+ messages in thread
From: Ian Campbell @ 2015-06-15  9:05 UTC (permalink / raw)
  To: Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel
  Cc: linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> suspend/restore. It's a stop-all method that needs to iterate through
> all allocated pgds in the system to fix them up for Xen's use.
> 
> This code uses pgd_list, probably because it was an easy interface.
> 
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.
> 
> (As I don't use Xen this is was only build tested.)

In which case it seems extra important to copy the appropriate
maintainers, which I've done here.

Ian.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index dd151b2045b0..70a3df5b0b54 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>   */
>  void xen_mm_pin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (!PagePinned(page)) {
> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
> -			SetPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (!PagePinned(page)) {
> +				__xen_pgd_pin(&init_mm, pgd);
> +				SetPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>   */
>  void xen_mm_unpin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (PageSavePinned(page)) {
> -			BUG_ON(!PagePinned(page));
> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> -			ClearPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (PageSavePinned(page)) {
> +				BUG_ON(!PagePinned(page));
> +				__xen_pgd_unpin(&init_mm, pgd);
> +				ClearPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> +	rcu_read_unlock();
>  }
>  
>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  9:49   ` Ingo Molnar
  (?)
  (?)
@ 2015-06-15  9:05   ` Ian Campbell
  -1 siblings, 0 replies; 83+ messages in thread
From: Ian Campbell @ 2015-06-15  9:05 UTC (permalink / raw)
  To: Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel
  Cc: Waiman Long, Denys Vlasenko, Peter Zijlstra, Brian Gerst,
	Oleg Nesterov, linux-kernel, Andy Lutomirski, linux-mm,
	Borislav Petkov, H. Peter Anvin, xen-devel, Andrew Morton,
	Linus Torvalds, Thomas Gleixner

On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
> suspend/restore. It's a stop-all method that needs to iterate through
> all allocated pgds in the system to fix them up for Xen's use.
> 
> This code uses pgd_list, probably because it was an easy interface.
> 
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.
> 
> (As I don't use Xen this is was only build tested.)

In which case it seems extra important to copy the appropriate
maintainers, which I've done here.

Ian.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index dd151b2045b0..70a3df5b0b54 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>   */
>  void xen_mm_pin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (!PagePinned(page)) {
> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
> -			SetPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (!PagePinned(page)) {
> +				__xen_pgd_pin(&init_mm, pgd);
> +				SetPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>   */
>  void xen_mm_unpin_all(void)
>  {
> -	struct page *page;
> +	struct task_struct *g, *p;
>  
> -	spin_lock(&pgd_lock);
> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>  
> -	list_for_each_entry(page, &pgd_list, lru) {
> -		if (PageSavePinned(page)) {
> -			BUG_ON(!PagePinned(page));
> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
> -			ClearPageSavePinned(page);
> +	for_each_process_thread(g, p) {
> +		struct mm_struct *mm;
> +		struct page *page;
> +		pgd_t *pgd;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		if (mm) {
> +			pgd = mm->pgd;
> +			page = virt_to_page(pgd);
> +
> +			if (PageSavePinned(page)) {
> +				BUG_ON(!PagePinned(page));
> +				__xen_pgd_unpin(&init_mm, pgd);
> +				ClearPageSavePinned(page);
> +			}
>  		}
> +		task_unlock(p);
>  	}
>  
>  	spin_unlock(&pgd_lock);
> +	rcu_read_unlock();
>  }
>  
>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15  9:05     ` Ian Campbell
@ 2015-06-15 10:30       ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-15 10:30 UTC (permalink / raw)
  To: Ian Campbell, Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky
  Cc: linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On 15/06/15 10:05, Ian Campbell wrote:
> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>> suspend/restore. It's a stop-all method that needs to iterate through
>> all allocated pgds in the system to fix them up for Xen's use.
>>
>> This code uses pgd_list, probably because it was an easy interface.
>>
>> But we want to remove the pgd_list, so convert the code over to walk
>> all tasks in the system. This is an equivalent method.

It is not equivalent because pgd_alloc() now populates entries in pgds
that are not visible to xen_mm_pin_all() (note how the original code
adds the pgd to the pgd_list in pgd_ctor() before calling
pgd_prepopulate_pmd()).  These newly allocated page tables won't be
correctly converted on suspend/resume and the new process will die after
resume.

David

>>
>> (As I don't use Xen this is was only build tested.)
> 
> In which case it seems extra important to copy the appropriate
> maintainers, which I've done here.
> 
> Ian.
> 
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Denys Vlasenko <dvlasenk@redhat.com>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Waiman Long <Waiman.Long@hp.com>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 38 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index dd151b2045b0..70a3df5b0b54 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>>   */
>>  void xen_mm_pin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (!PagePinned(page)) {
>> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
>> -			SetPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (!PagePinned(page)) {
>> +				__xen_pgd_pin(&init_mm, pgd);
>> +				SetPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>>   */
>>  void xen_mm_unpin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (PageSavePinned(page)) {
>> -			BUG_ON(!PagePinned(page));
>> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
>> -			ClearPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (PageSavePinned(page)) {
>> +				BUG_ON(!PagePinned(page));
>> +				__xen_pgd_unpin(&init_mm, pgd);
>> +				ClearPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> +	rcu_read_unlock();
>>  }
>>  
>>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
> 
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-15 10:30       ` David Vrabel
  0 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-15 10:30 UTC (permalink / raw)
  To: Ian Campbell, Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky
  Cc: linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On 15/06/15 10:05, Ian Campbell wrote:
> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>> suspend/restore. It's a stop-all method that needs to iterate through
>> all allocated pgds in the system to fix them up for Xen's use.
>>
>> This code uses pgd_list, probably because it was an easy interface.
>>
>> But we want to remove the pgd_list, so convert the code over to walk
>> all tasks in the system. This is an equivalent method.

It is not equivalent because pgd_alloc() now populates entries in pgds
that are not visible to xen_mm_pin_all() (note how the original code
adds the pgd to the pgd_list in pgd_ctor() before calling
pgd_prepopulate_pmd()).  These newly allocated page tables won't be
correctly converted on suspend/resume and the new process will die after
resume.

David

>>
>> (As I don't use Xen this is was only build tested.)
> 
> In which case it seems extra important to copy the appropriate
> maintainers, which I've done here.
> 
> Ian.
> 
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Denys Vlasenko <dvlasenk@redhat.com>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Waiman Long <Waiman.Long@hp.com>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 38 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index dd151b2045b0..70a3df5b0b54 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>>   */
>>  void xen_mm_pin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (!PagePinned(page)) {
>> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
>> -			SetPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (!PagePinned(page)) {
>> +				__xen_pgd_pin(&init_mm, pgd);
>> +				SetPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>>   */
>>  void xen_mm_unpin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (PageSavePinned(page)) {
>> -			BUG_ON(!PagePinned(page));
>> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
>> -			ClearPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (PageSavePinned(page)) {
>> +				BUG_ON(!PagePinned(page));
>> +				__xen_pgd_unpin(&init_mm, pgd);
>> +				ClearPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> +	rcu_read_unlock();
>>  }
>>  
>>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15  9:05     ` Ian Campbell
  (?)
  (?)
@ 2015-06-15 10:30     ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-15 10:30 UTC (permalink / raw)
  To: Ian Campbell, Ingo Molnar, Konrad Rzeszutek Wilk, Boris Ostrovsky
  Cc: Waiman Long, Denys Vlasenko, Peter Zijlstra, Brian Gerst,
	Oleg Nesterov, linux-kernel, Andy Lutomirski, linux-mm,
	Borislav Petkov, H. Peter Anvin, xen-devel, Andrew Morton,
	Linus Torvalds, Thomas Gleixner

On 15/06/15 10:05, Ian Campbell wrote:
> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>> suspend/restore. It's a stop-all method that needs to iterate through
>> all allocated pgds in the system to fix them up for Xen's use.
>>
>> This code uses pgd_list, probably because it was an easy interface.
>>
>> But we want to remove the pgd_list, so convert the code over to walk
>> all tasks in the system. This is an equivalent method.

It is not equivalent because pgd_alloc() now populates entries in pgds
that are not visible to xen_mm_pin_all() (note how the original code
adds the pgd to the pgd_list in pgd_ctor() before calling
pgd_prepopulate_pmd()).  These newly allocated page tables won't be
correctly converted on suspend/resume and the new process will die after
resume.

David

>>
>> (As I don't use Xen this is was only build tested.)
> 
> In which case it seems extra important to copy the appropriate
> maintainers, which I've done here.
> 
> Ian.
> 
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Denys Vlasenko <dvlasenk@redhat.com>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Waiman Long <Waiman.Long@hp.com>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  arch/x86/xen/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 38 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index dd151b2045b0..70a3df5b0b54 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -853,15 +853,27 @@ static void xen_pgd_pin(struct mm_struct *mm)
>>   */
>>  void xen_mm_pin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (!PagePinned(page)) {
>> -			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
>> -			SetPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (!PagePinned(page)) {
>> +				__xen_pgd_pin(&init_mm, pgd);
>> +				SetPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> @@ -967,19 +979,32 @@ static void xen_pgd_unpin(struct mm_struct *mm)
>>   */
>>  void xen_mm_unpin_all(void)
>>  {
>> -	struct page *page;
>> +	struct task_struct *g, *p;
>>  
>> -	spin_lock(&pgd_lock);
>> +	spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
>>  
>> -	list_for_each_entry(page, &pgd_list, lru) {
>> -		if (PageSavePinned(page)) {
>> -			BUG_ON(!PagePinned(page));
>> -			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
>> -			ClearPageSavePinned(page);
>> +	for_each_process_thread(g, p) {
>> +		struct mm_struct *mm;
>> +		struct page *page;
>> +		pgd_t *pgd;
>> +
>> +		task_lock(p);
>> +		mm = p->mm;
>> +		if (mm) {
>> +			pgd = mm->pgd;
>> +			page = virt_to_page(pgd);
>> +
>> +			if (PageSavePinned(page)) {
>> +				BUG_ON(!PagePinned(page));
>> +				__xen_pgd_unpin(&init_mm, pgd);
>> +				ClearPageSavePinned(page);
>> +			}
>>  		}
>> +		task_unlock(p);
>>  	}
>>  
>>  	spin_unlock(&pgd_lock);
>> +	rcu_read_unlock();
>>  }
>>  
>>  static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
> 
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-15  2:57           ` Andy Lutomirski
@ 2015-06-15 20:28             ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Oleg Nesterov, linux-kernel, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Oleg Nesterov <oleg@redhat.com> writes:
> >>
> >> But again, the kernel no longer does this? do_page_fault() does 
> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a 
> >> modular DIE_OOPS handler could try to resolve this problem and trigger 
> >> another fault.
> >
> > The same problem can happen from NMI handlers or machine check handlers. It's 
> > not necessarily tied to page faults only.
> 
> AIUI, the point of the one and only vmalloc_sync_all call is to prevent 
> infinitely recursive faults when we call a notify_die callback.  The only thing 
> that it could realistically protect is module text or static non-per-cpu module 
> data, since that's the only thing that's reliably already in the init pgd.  I'm 
> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc 
> faults before it calls notify_die.

Yes, but what I meant is that it can happen if due to an unrelated kernel bug and 
unlucky timing we have installed this new handler just when that other unrelated 
kernel bug triggers: say a #GPF crash in kernel code.

In any case it should all be mooted with the removal of lazy PGD instantiation.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-15 20:28             ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Oleg Nesterov, linux-kernel, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Oleg Nesterov <oleg@redhat.com> writes:
> >>
> >> But again, the kernel no longer does this? do_page_fault() does 
> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a 
> >> modular DIE_OOPS handler could try to resolve this problem and trigger 
> >> another fault.
> >
> > The same problem can happen from NMI handlers or machine check handlers. It's 
> > not necessarily tied to page faults only.
> 
> AIUI, the point of the one and only vmalloc_sync_all call is to prevent 
> infinitely recursive faults when we call a notify_die callback.  The only thing 
> that it could realistically protect is module text or static non-per-cpu module 
> data, since that's the only thing that's reliably already in the init pgd.  I'm 
> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc 
> faults before it calls notify_die.

Yes, but what I meant is that it can happen if due to an unrelated kernel bug and 
unlucky timing we have installed this new handler just when that other unrelated 
kernel bug triggers: say a #GPF crash in kernel code.

In any case it should all be mooted with the removal of lazy PGD instantiation.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
  2015-06-15  0:40             ` Paul E. McKenney
@ 2015-06-15 20:33               ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> > On 06/14, Oleg Nesterov wrote:
> > >
> > > On 06/14, Ingo Molnar wrote:
> > > >
> > > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > > > >
> > > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > > to rcu_preempt_qs()?
> > > >
> > > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > > context switches, entering idle or user-space mode.
> > > >
> > > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > > RCU read critical section - that is why it uses a separate nesting counter
> > > > (current->rcu_read_lock_nesting), not the preempt count.
> > >
> > > Yes.
> > >
> > > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > > of the RCU model, and need no additional RCU locking.
> > >
> > > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > > finish the gp, but if there are no other rcu-read-lock holders then it
> > > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > > this CPU no longer needs rcu_preempt_note_context_switch().
> > >
> > > OK, I can be easily wrong, I do not really understand the implementation
> > > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > > with the _current_ implementation. Still this doesn't look right even if
> > > happens to work, and Documentation/RCU/checklist.txt says:
> > >
> > > 11.	Note that synchronize_rcu() -only- guarantees to wait until
> > > 	all currently executing rcu_read_lock()-protected RCU read-side
> > > 	critical sections complete.  It does -not- necessarily guarantee
> > > 	that all currently running interrupts, NMIs, preempt_disable()
> > > 	code, or idle loops will complete.  Therefore, if your
> > > 	read-side critical sections are protected by something other
> > > 	than rcu_read_lock(), do -not- use synchronize_rcu().
> > 
> > 
> > I've even checked this ;) I applied the stupid patch below and then
> > 
> > 	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
> > 	[1] 565
> > 
> > 	$ taskset 1 perl -e 'syscall 157, 777'
> > 
> > 	$
> > 	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'
> > 
> > 	$ dmesg -c
> > 	SPIN start
> > 	SYNC start
> > 	SYNC done!
> > 	SPIN done!
> 
> Please accept my apologies for my late entry to this thread.
> Youngest kid graduated from university this weekend, so my
> attention has been elsewhere.

Congratulations! :-)

> If you were to disable interrupts instead of preemption, I would expect
> that the preemptible-RCU grace period would be blocked -- though I am
> not particularly comfortable with people relying on disabled interrupts
> blocking a preemptible-RCU grace period.
> 
> Here is what can happen if you try to block a preemptible-RCU grace
> period by disabling preemption, assuming that there are at least two
> online CPUs in the system:
> 
> 1.	CPU 0 does spin_lock(), which disables preemption.
> 
> 2.	CPU 1 starts a grace period.
> 
> 3.	CPU 0 takes a scheduling-clock interrupt.  It raises softirq,
> 	and the RCU_SOFTIRQ handler notes that there is a new grace
> 	period and sets state so that a subsequent quiescent state on
> 	this CPU will be noted.
> 
> 4.	CPU 0 takes another scheduling-clock interrupt, which checks
> 	current->rcu_read_lock_nesting, and notes that there is no
> 	preemptible-RCU read-side critical section in progress.  It
> 	again raises softirq, and the RCU_SOFTIRQ handler reports
> 	the quiescent state to core RCU.
> 
> 5.	Once each of the other CPUs report a quiescent state, the
> 	grace period can end, despite CPU 0 having preemption
> 	disabled the whole time.
> 
> So Oleg's test is correct, disabling preemption is not sufficient
> to block a preemptible-RCU grace period.

I stand corrected!

> The usual suggestion would be to add rcu_read_lock() just after the lock is 
> acquired and rcu_read_unlock() just before each release of that same lock.  

Will fix it that way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
@ 2015-06-15 20:33               ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, linux-kernel, linux-mm, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Waiman Long


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> > On 06/14, Oleg Nesterov wrote:
> > >
> > > On 06/14, Ingo Molnar wrote:
> > > >
> > > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > > > >
> > > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > > to rcu_preempt_qs()?
> > > >
> > > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > > context switches, entering idle or user-space mode.
> > > >
> > > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > > RCU read critical section - that is why it uses a separate nesting counter
> > > > (current->rcu_read_lock_nesting), not the preempt count.
> > >
> > > Yes.
> > >
> > > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > > of the RCU model, and need no additional RCU locking.
> > >
> > > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > > finish the gp, but if there are no other rcu-read-lock holders then it
> > > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > > this CPU no longer needs rcu_preempt_note_context_switch().
> > >
> > > OK, I can be easily wrong, I do not really understand the implementation
> > > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > > with the _current_ implementation. Still this doesn't look right even if
> > > happens to work, and Documentation/RCU/checklist.txt says:
> > >
> > > 11.	Note that synchronize_rcu() -only- guarantees to wait until
> > > 	all currently executing rcu_read_lock()-protected RCU read-side
> > > 	critical sections complete.  It does -not- necessarily guarantee
> > > 	that all currently running interrupts, NMIs, preempt_disable()
> > > 	code, or idle loops will complete.  Therefore, if your
> > > 	read-side critical sections are protected by something other
> > > 	than rcu_read_lock(), do -not- use synchronize_rcu().
> > 
> > 
> > I've even checked this ;) I applied the stupid patch below and then
> > 
> > 	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
> > 	[1] 565
> > 
> > 	$ taskset 1 perl -e 'syscall 157, 777'
> > 
> > 	$
> > 	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'
> > 
> > 	$ dmesg -c
> > 	SPIN start
> > 	SYNC start
> > 	SYNC done!
> > 	SPIN done!
> 
> Please accept my apologies for my late entry to this thread.
> Youngest kid graduated from university this weekend, so my
> attention has been elsewhere.

Congratulations! :-)

> If you were to disable interrupts instead of preemption, I would expect
> that the preemptible-RCU grace period would be blocked -- though I am
> not particularly comfortable with people relying on disabled interrupts
> blocking a preemptible-RCU grace period.
> 
> Here is what can happen if you try to block a preemptible-RCU grace
> period by disabling preemption, assuming that there are at least two
> online CPUs in the system:
> 
> 1.	CPU 0 does spin_lock(), which disables preemption.
> 
> 2.	CPU 1 starts a grace period.
> 
> 3.	CPU 0 takes a scheduling-clock interrupt.  It raises softirq,
> 	and the RCU_SOFTIRQ handler notes that there is a new grace
> 	period and sets state so that a subsequent quiescent state on
> 	this CPU will be noted.
> 
> 4.	CPU 0 takes another scheduling-clock interrupt, which checks
> 	current->rcu_read_lock_nesting, and notes that there is no
> 	preemptible-RCU read-side critical section in progress.  It
> 	again raises softirq, and the RCU_SOFTIRQ handler reports
> 	the quiescent state to core RCU.
> 
> 5.	Once each of the other CPUs report a quiescent state, the
> 	grace period can end, despite CPU 0 having preemption
> 	disabled the whole time.
> 
> So Oleg's test is correct, disabling preemption is not sufficient
> to block a preemptible-RCU grace period.

I stand corrected!

> The usual suggestion would be to add rcu_read_lock() just after the lock is 
> acquired and rcu_read_unlock() just before each release of that same lock.  

Will fix it that way.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15 10:30       ` David Vrabel
@ 2015-06-15 20:35         ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:35 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long


* David Vrabel <david.vrabel@citrix.com> wrote:

> On 15/06/15 10:05, Ian Campbell wrote:
> > On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:

> >> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
> >> suspend/restore. It's a stop-all method that needs to iterate through all 
> >> allocated pgds in the system to fix them up for Xen's use.
> >>
> >> This code uses pgd_list, probably because it was an easy interface.
> >>
> >> But we want to remove the pgd_list, so convert the code over to walk all 
> >> tasks in the system. This is an equivalent method.
> 
> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
> allocated page tables won't be correctly converted on suspend/resume and the new 
> process will die after resume.

So how should the Xen logic be fixed for the new scheme? I can't say I can see 
through the paravirt complexity here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-15 20:35         ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:35 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long


* David Vrabel <david.vrabel@citrix.com> wrote:

> On 15/06/15 10:05, Ian Campbell wrote:
> > On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:

> >> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
> >> suspend/restore. It's a stop-all method that needs to iterate through all 
> >> allocated pgds in the system to fix them up for Xen's use.
> >>
> >> This code uses pgd_list, probably because it was an easy interface.
> >>
> >> But we want to remove the pgd_list, so convert the code over to walk all 
> >> tasks in the system. This is an equivalent method.
> 
> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
> allocated page tables won't be correctly converted on suspend/resume and the new 
> process will die after resume.

So how should the Xen logic be fixed for the new scheme? I can't say I can see 
through the paravirt complexity here.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15 10:30       ` David Vrabel
  (?)
@ 2015-06-15 20:35       ` Ingo Molnar
  -1 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-06-15 20:35 UTC (permalink / raw)
  To: David Vrabel
  Cc: Waiman Long, Denys Vlasenko, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	Andy Lutomirski, linux-mm, Borislav Petkov, Ian Campbell,
	xen-devel, Boris Ostrovsky, Linus Torvalds, Andrew Morton


* David Vrabel <david.vrabel@citrix.com> wrote:

> On 15/06/15 10:05, Ian Campbell wrote:
> > On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:

> >> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
> >> suspend/restore. It's a stop-all method that needs to iterate through all 
> >> allocated pgds in the system to fix them up for Xen's use.
> >>
> >> This code uses pgd_list, probably because it was an easy interface.
> >>
> >> But we want to remove the pgd_list, so convert the code over to walk all 
> >> tasks in the system. This is an equivalent method.
> 
> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
> allocated page tables won't be correctly converted on suspend/resume and the new 
> process will die after resume.

So how should the Xen logic be fixed for the new scheme? I can't say I can see 
through the paravirt complexity here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
  2015-06-15 20:28             ` Ingo Molnar
@ 2015-06-15 20:48               ` Andy Lutomirski
  -1 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-06-15 20:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Oleg Nesterov, linux-kernel, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On Mon, Jun 15, 2015 at 1:28 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> > Oleg Nesterov <oleg@redhat.com> writes:
>> >>
>> >> But again, the kernel no longer does this? do_page_fault() does
>> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a
>> >> modular DIE_OOPS handler could try to resolve this problem and trigger
>> >> another fault.
>> >
>> > The same problem can happen from NMI handlers or machine check handlers. It's
>> > not necessarily tied to page faults only.
>>
>> AIUI, the point of the one and only vmalloc_sync_all call is to prevent
>> infinitely recursive faults when we call a notify_die callback.  The only thing
>> that it could realistically protect is module text or static non-per-cpu module
>> data, since that's the only thing that's reliably already in the init pgd.  I'm
>> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc
>> faults before it calls notify_die.
>
> Yes, but what I meant is that it can happen if due to an unrelated kernel bug and
> unlucky timing we have installed this new handler just when that other unrelated
> kernel bug triggers: say a #GPF crash in kernel code.

I still don't see the problem.

CPU A: crash and start executing do_page_fault

CPU B: register_die_notifier

CPU A: notify_die

now we get a vmalloc fault, fix it up, and return to do_page_fault and
print the oops.

>
> In any case it should all be mooted with the removal of lazy PGD instantiation.

Agreed.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: why do we need vmalloc_sync_all?
@ 2015-06-15 20:48               ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-06-15 20:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Oleg Nesterov, linux-kernel, linux-mm, Andrew Morton,
	Denys Vlasenko, Brian Gerst, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Waiman Long

On Mon, Jun 15, 2015 at 1:28 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Sun, Jun 14, 2015 at 7:47 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> > Oleg Nesterov <oleg@redhat.com> writes:
>> >>
>> >> But again, the kernel no longer does this? do_page_fault() does
>> >> vmalloc_fault() without notify_die(). If it fails, I do not see how/why a
>> >> modular DIE_OOPS handler could try to resolve this problem and trigger
>> >> another fault.
>> >
>> > The same problem can happen from NMI handlers or machine check handlers. It's
>> > not necessarily tied to page faults only.
>>
>> AIUI, the point of the one and only vmalloc_sync_all call is to prevent
>> infinitely recursive faults when we call a notify_die callback.  The only thing
>> that it could realistically protect is module text or static non-per-cpu module
>> data, since that's the only thing that's reliably already in the init pgd.  I'm
>> with Oleg: I don't see how that can happen, since do_page_fault fixes up vmalloc
>> faults before it calls notify_die.
>
> Yes, but what I meant is that it can happen if due to an unrelated kernel bug and
> unlucky timing we have installed this new handler just when that other unrelated
> kernel bug triggers: say a #GPF crash in kernel code.

I still don't see the problem.

CPU A: crash and start executing do_page_fault

CPU B: register_die_notifier

CPU A: notify_die

now we get a vmalloc fault, fix it up, and return to do_page_fault and
print the oops.

>
> In any case it should all be mooted with the removal of lazy PGD instantiation.

Agreed.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15 20:35         ` Ingo Molnar
@ 2015-06-16 14:15           ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On 15/06/15 21:35, Ingo Molnar wrote:
> 
> * David Vrabel <david.vrabel@citrix.com> wrote:
> 
>> On 15/06/15 10:05, Ian Campbell wrote:
>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> 
>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
>>>> suspend/restore. It's a stop-all method that needs to iterate through all 
>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>
>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>
>>>> But we want to remove the pgd_list, so convert the code over to walk all 
>>>> tasks in the system. This is an equivalent method.
>>
>> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
>> allocated page tables won't be correctly converted on suspend/resume and the new 
>> process will die after resume.
> 
> So how should the Xen logic be fixed for the new scheme? I can't say I can see 
> through the paravirt complexity here.

Actually, since we freeze_processes() before trying to pin page tables,
I think it should be ok as-is.

I'll put the patch through some tests.

David

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-16 14:15           ` David Vrabel
  0 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	linux-kernel, linux-mm, xen-devel, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds, Oleg Nesterov,
	Thomas Gleixner, Waiman Long

On 15/06/15 21:35, Ingo Molnar wrote:
> 
> * David Vrabel <david.vrabel@citrix.com> wrote:
> 
>> On 15/06/15 10:05, Ian Campbell wrote:
>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> 
>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
>>>> suspend/restore. It's a stop-all method that needs to iterate through all 
>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>
>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>
>>>> But we want to remove the pgd_list, so convert the code over to walk all 
>>>> tasks in the system. This is an equivalent method.
>>
>> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
>> allocated page tables won't be correctly converted on suspend/resume and the new 
>> process will die after resume.
> 
> So how should the Xen logic be fixed for the new scheme? I can't say I can see 
> through the paravirt complexity here.

Actually, since we freeze_processes() before trying to pin page tables,
I think it should be ok as-is.

I'll put the patch through some tests.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-15 20:35         ` Ingo Molnar
  (?)
@ 2015-06-16 14:15         ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Denys Vlasenko, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	Andy Lutomirski, linux-mm, Borislav Petkov, Ian Campbell,
	xen-devel, Boris Ostrovsky, Linus Torvalds, Andrew Morton

On 15/06/15 21:35, Ingo Molnar wrote:
> 
> * David Vrabel <david.vrabel@citrix.com> wrote:
> 
>> On 15/06/15 10:05, Ian Campbell wrote:
>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
> 
>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance 
>>>> suspend/restore. It's a stop-all method that needs to iterate through all 
>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>
>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>
>>>> But we want to remove the pgd_list, so convert the code over to walk all 
>>>> tasks in the system. This is an equivalent method.
>>
>> It is not equivalent because pgd_alloc() now populates entries in pgds that are 
>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the 
>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly 
>> allocated page tables won't be correctly converted on suspend/resume and the new 
>> process will die after resume.
> 
> So how should the Xen logic be fixed for the new scheme? I can't say I can see 
> through the paravirt complexity here.

Actually, since we freeze_processes() before trying to pin page tables,
I think it should be ok as-is.

I'll put the patch through some tests.

David

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-16 14:15           ` David Vrabel
@ 2015-06-16 14:19             ` Boris Ostrovsky
  -1 siblings, 0 replies; 83+ messages in thread
From: Boris Ostrovsky @ 2015-06-16 14:19 UTC (permalink / raw)
  To: David Vrabel, Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	xen-devel, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On 06/16/2015 10:15 AM, David Vrabel wrote:
> On 15/06/15 21:35, Ingo Molnar wrote:
>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>
>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>>>>> suspend/restore. It's a stop-all method that needs to iterate through all
>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>
>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>
>>>>> But we want to remove the pgd_list, so convert the code over to walk all
>>>>> tasks in the system. This is an equivalent method.
>>> It is not equivalent because pgd_alloc() now populates entries in pgds that are
>>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly
>>> allocated page tables won't be correctly converted on suspend/resume and the new
>>> process will die after resume.
>> So how should the Xen logic be fixed for the new scheme? I can't say I can see
>> through the paravirt complexity here.
> Actually, since we freeze_processes() before trying to pin page tables,
> I think it should be ok as-is.
>
> I'll put the patch through some tests.

Actually, I just ran this through a couple of boot/suspend/resume tests 
and didn't see any issues (with the one fix I mentioned to Ingo 
earlier). On unstable Xen only.

-boris



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-16 14:19             ` Boris Ostrovsky
  0 siblings, 0 replies; 83+ messages in thread
From: Boris Ostrovsky @ 2015-06-16 14:19 UTC (permalink / raw)
  To: David Vrabel, Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	xen-devel, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On 06/16/2015 10:15 AM, David Vrabel wrote:
> On 15/06/15 21:35, Ingo Molnar wrote:
>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>
>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>>>>> suspend/restore. It's a stop-all method that needs to iterate through all
>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>
>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>
>>>>> But we want to remove the pgd_list, so convert the code over to walk all
>>>>> tasks in the system. This is an equivalent method.
>>> It is not equivalent because pgd_alloc() now populates entries in pgds that are
>>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly
>>> allocated page tables won't be correctly converted on suspend/resume and the new
>>> process will die after resume.
>> So how should the Xen logic be fixed for the new scheme? I can't say I can see
>> through the paravirt complexity here.
> Actually, since we freeze_processes() before trying to pin page tables,
> I think it should be ok as-is.
>
> I'll put the patch through some tests.

Actually, I just ran this through a couple of boot/suspend/resume tests 
and didn't see any issues (with the one fix I mentioned to Ingo 
earlier). On unstable Xen only.

-boris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-16 14:15           ` David Vrabel
  (?)
@ 2015-06-16 14:19           ` Boris Ostrovsky
  -1 siblings, 0 replies; 83+ messages in thread
From: Boris Ostrovsky @ 2015-06-16 14:19 UTC (permalink / raw)
  To: David Vrabel, Ingo Molnar
  Cc: Waiman Long, Denys Vlasenko, Peter Zijlstra, Brian Gerst,
	H. Peter Anvin, Oleg Nesterov, linux-kernel, Andy Lutomirski,
	linux-mm, Borislav Petkov, Ian Campbell, xen-devel,
	Andrew Morton, Linus Torvalds, Thomas Gleixner

On 06/16/2015 10:15 AM, David Vrabel wrote:
> On 15/06/15 21:35, Ingo Molnar wrote:
>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>
>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest instance
>>>>> suspend/restore. It's a stop-all method that needs to iterate through all
>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>
>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>
>>>>> But we want to remove the pgd_list, so convert the code over to walk all
>>>>> tasks in the system. This is an equivalent method.
>>> It is not equivalent because pgd_alloc() now populates entries in pgds that are
>>> not visible to xen_mm_pin_all() (note how the original code adds the pgd to the
>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These newly
>>> allocated page tables won't be correctly converted on suspend/resume and the new
>>> process will die after resume.
>> So how should the Xen logic be fixed for the new scheme? I can't say I can see
>> through the paravirt complexity here.
> Actually, since we freeze_processes() before trying to pin page tables,
> I think it should be ok as-is.
>
> I'll put the patch through some tests.

Actually, I just ran this through a couple of boot/suspend/resume tests 
and didn't see any issues (with the one fix I mentioned to Ingo 
earlier). On unstable Xen only.

-boris

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-16 14:19             ` Boris Ostrovsky
@ 2015-06-16 14:27               ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:27 UTC (permalink / raw)
  To: Boris Ostrovsky, Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	xen-devel, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On 16/06/15 15:19, Boris Ostrovsky wrote:
> On 06/16/2015 10:15 AM, David Vrabel wrote:
>> On 15/06/15 21:35, Ingo Molnar wrote:
>>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>>
>>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest
>>>>>> instance
>>>>>> suspend/restore. It's a stop-all method that needs to iterate
>>>>>> through all
>>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>>
>>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>>
>>>>>> But we want to remove the pgd_list, so convert the code over to
>>>>>> walk all
>>>>>> tasks in the system. This is an equivalent method.
>>>> It is not equivalent because pgd_alloc() now populates entries in
>>>> pgds that are
>>>> not visible to xen_mm_pin_all() (note how the original code adds the
>>>> pgd to the
>>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These
>>>> newly
>>>> allocated page tables won't be correctly converted on suspend/resume
>>>> and the new
>>>> process will die after resume.
>>> So how should the Xen logic be fixed for the new scheme? I can't say
>>> I can see
>>> through the paravirt complexity here.
>> Actually, since we freeze_processes() before trying to pin page tables,
>> I think it should be ok as-is.
>>
>> I'll put the patch through some tests.
> 
> Actually, I just ran this through a couple of boot/suspend/resume tests
> and didn't see any issues (with the one fix I mentioned to Ingo
> earlier). On unstable Xen only.

In which case this can have a:

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

Thanks.

David

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
@ 2015-06-16 14:27               ` David Vrabel
  0 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:27 UTC (permalink / raw)
  To: Boris Ostrovsky, Ingo Molnar
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	xen-devel, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

On 16/06/15 15:19, Boris Ostrovsky wrote:
> On 06/16/2015 10:15 AM, David Vrabel wrote:
>> On 15/06/15 21:35, Ingo Molnar wrote:
>>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>>
>>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest
>>>>>> instance
>>>>>> suspend/restore. It's a stop-all method that needs to iterate
>>>>>> through all
>>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>>
>>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>>
>>>>>> But we want to remove the pgd_list, so convert the code over to
>>>>>> walk all
>>>>>> tasks in the system. This is an equivalent method.
>>>> It is not equivalent because pgd_alloc() now populates entries in
>>>> pgds that are
>>>> not visible to xen_mm_pin_all() (note how the original code adds the
>>>> pgd to the
>>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These
>>>> newly
>>>> allocated page tables won't be correctly converted on suspend/resume
>>>> and the new
>>>> process will die after resume.
>>> So how should the Xen logic be fixed for the new scheme? I can't say
>>> I can see
>>> through the paravirt complexity here.
>> Actually, since we freeze_processes() before trying to pin page tables,
>> I think it should be ok as-is.
>>
>> I'll put the patch through some tests.
> 
> Actually, I just ran this through a couple of boot/suspend/resume tests
> and didn't see any issues (with the one fix I mentioned to Ingo
> earlier). On unstable Xen only.

In which case this can have a:

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

Thanks.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-16 14:19             ` Boris Ostrovsky
  (?)
@ 2015-06-16 14:27             ` David Vrabel
  -1 siblings, 0 replies; 83+ messages in thread
From: David Vrabel @ 2015-06-16 14:27 UTC (permalink / raw)
  To: Boris Ostrovsky, Ingo Molnar
  Cc: Waiman Long, Denys Vlasenko, Peter Zijlstra, Brian Gerst,
	H. Peter Anvin, Oleg Nesterov, linux-kernel, Andy Lutomirski,
	linux-mm, Borislav Petkov, Ian Campbell, xen-devel,
	Andrew Morton, Linus Torvalds, Thomas Gleixner

On 16/06/15 15:19, Boris Ostrovsky wrote:
> On 06/16/2015 10:15 AM, David Vrabel wrote:
>> On 15/06/15 21:35, Ingo Molnar wrote:
>>> * David Vrabel <david.vrabel@citrix.com> wrote:
>>>
>>>> On 15/06/15 10:05, Ian Campbell wrote:
>>>>> On Sat, 2015-06-13 at 11:49 +0200, Ingo Molnar wrote:
>>>>>> xen_mm_pin_all()/unpin_all() are used to implement full guest
>>>>>> instance
>>>>>> suspend/restore. It's a stop-all method that needs to iterate
>>>>>> through all
>>>>>> allocated pgds in the system to fix them up for Xen's use.
>>>>>>
>>>>>> This code uses pgd_list, probably because it was an easy interface.
>>>>>>
>>>>>> But we want to remove the pgd_list, so convert the code over to
>>>>>> walk all
>>>>>> tasks in the system. This is an equivalent method.
>>>> It is not equivalent because pgd_alloc() now populates entries in
>>>> pgds that are
>>>> not visible to xen_mm_pin_all() (note how the original code adds the
>>>> pgd to the
>>>> pgd_list in pgd_ctor() before calling pgd_prepopulate_pmd()).  These
>>>> newly
>>>> allocated page tables won't be correctly converted on suspend/resume
>>>> and the new
>>>> process will die after resume.
>>> So how should the Xen logic be fixed for the new scheme? I can't say
>>> I can see
>>> through the paravirt complexity here.
>> Actually, since we freeze_processes() before trying to pin page tables,
>> I think it should be ok as-is.
>>
>> I'll put the patch through some tests.
> 
> Actually, I just ran this through a couple of boot/suspend/resume tests
> and didn't see any issues (with the one fix I mentioned to Ingo
> earlier). On unstable Xen only.

In which case this can have a:

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

Thanks.

David

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-13  7:26               ` Ingo Molnar
@ 2015-06-13 18:00                 ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-13 18:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/13, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > So we could add tsk->mm_leader or so,
> >
> > No, no, please do not. Just do something like
> >
> > 	for_each_process(p) {
> >
> > 		for_each_thread(p, t) {
>
> So far that's what the for_each_process_thread() iterations I added do, right?

Not really,

> > 			if (t->mm) {
> > 				do_something(t->mm);
> > 				break;
                                ^^^^^

Note this "break". We stop the inner loop right after we find a thread "t"
with ->mm != NULL. In the likely case t == p (group leader) so the inner
loop stops on the 1st iteration.


> > But either way I don't understand what protects this ->mm. Perhaps this needs
> > find_lock_task_mm().
>
> That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

Well, in this particular case we are safe. As Boris explained this is called
from stop_machine(). But sync_global_pgds() is not safe.

> find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as
> task can only go away via RCU expiry, and all the iterations I added are (supposed
> to) be RCU safe.

Sure, you can do lock/unlock by hand. But find_lock_task_mm() can simplify
the code because it checks subthreads if group_leader->mm == NULL. You can
simply do

	rcu_read_lock();
	for_each_process(p) {
		t = find_lock_task_mm(p);
		if (!t)
			continue;

		do_something(t->mm);
		task_unlock(t);
	}
	rcu_read_unlock();

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:38             ` Oleg Nesterov
  2015-06-12 20:53               ` Oleg Nesterov
@ 2015-06-13  7:26               ` Ingo Molnar
  2015-06-13 18:00                 ` Oleg Nesterov
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-06-13  7:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra


* Oleg Nesterov <oleg@redhat.com> wrote:

> > So we could add tsk->mm_leader or so,
> 
> No, no, please do not. Just do something like
> 
> 	for_each_process(p) {
> 
> 		for_each_thread(p, t) {

So far that's what the for_each_process_thread() iterations I added do, right?

> 			if (t->mm) {
> 				do_something(t->mm);
> 				break;
> 			}
> 		}
> 	}
> 
> But either way I don't understand what protects this ->mm. Perhaps this needs 
> find_lock_task_mm().

That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm.

find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as a 
task can only go away via RCU expiry, and all the iterations I added are (supposed 
to) be RCU safe.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 22:30                 ` Boris Ostrovsky
@ 2015-06-12 23:36                   ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-12 23:36 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Ingo Molnar, Linus Torvalds, Waiman Long, Thomas Gleixner,
	Denys Vlasenko, Borislav Petkov, Andrew Morton, Andy Lutomirski,
	linux-mml, Linux Kernel Mailing List, Brian Gerst,
	H. Peter Anvin, Peter Zijlstra

On 06/12, Boris Ostrovsky wrote:
>
> On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
>>>
>>> 	for_each_process(p) {
>>>
>>> 		for_each_thread(p, t) {
>>> 			if (t->mm) {
>>> 				do_something(t->mm);
>>> 				break;
>>> 			}
>>> 		}
>>> 	}
>>>
>>> But either way I don't understand what protects this ->mm. Perhaps this needs
>>> find_lock_task_mm().
>>
>> And, I don't understand this code, probably this doesn't matter, but.
>>
>> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
>> and miss the new child. Is it OK?
>
>
> Currently xen_mm_pin_all() is only called in the suspend path, out of
> stop_machine(), so presumably at that time fork is not possible.

OK, thanks, this also means that this code shouldn't worry about ->mm,
it should be stable.

But for_each_process() in sync_global_pgds() should, afaics.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:53               ` Oleg Nesterov
@ 2015-06-12 22:30                 ` Boris Ostrovsky
  2015-06-12 23:36                   ` Oleg Nesterov
  0 siblings, 1 reply; 83+ messages in thread
From: Boris Ostrovsky @ 2015-06-12 22:30 UTC (permalink / raw)
  To: Oleg Nesterov, Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12/2015 04:53 PM, Oleg Nesterov wrote:
> On 06/12, Oleg Nesterov wrote:
>>
>> On 06/12, Ingo Molnar wrote:
>>>
>>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> So I think the only issue is that ->mm can become NULL when the thread group
>>>> leader dies - a non-NULL mm should always be shared among all threads.
>>>
>>> Indeed, we do that in exit_mm().
>>
>> Yes,
>>
>>> So we could add tsk->mm_leader or so,
>>
>> No, no, please do not. Just do something like
>>
>> 	for_each_process(p) {
>>
>> 		for_each_thread(p, t) {
>> 			if (t->mm) {
>> 				do_something(t->mm);
>> 				break;
>> 			}
>> 		}
>> 	}
>>
>> But either way I don't understand what protects this ->mm. Perhaps this needs
>> find_lock_task_mm().
>
> And, I don't understand this code, probably this doesn't matter, but.
>
> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
> and miss the new child. Is it OK?


Currently xen_mm_pin_all() is only called in the suspend path, out of 
stop_machine(), so presumably at that time fork is not possible.


-boris

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12 20:38             ` Oleg Nesterov
@ 2015-06-12 20:53               ` Oleg Nesterov
  2015-06-12 22:30                 ` Boris Ostrovsky
  2015-06-13  7:26               ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-12 20:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12, Oleg Nesterov wrote:
>
> On 06/12, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > So I think the only issue is that ->mm can become NULL when the thread group
> > > leader dies - a non-NULL mm should always be shared among all threads.
> >
> > Indeed, we do that in exit_mm().
>
> Yes,
>
> > So we could add tsk->mm_leader or so,
>
> No, no, please do not. Just do something like
>
> 	for_each_process(p) {
>
> 		for_each_thread(p, t) {
> 			if (t->mm) {
> 				do_something(t->mm);
> 				break;
> 			}
> 		}
> 	}
>
> But either way I don't understand what protects this ->mm. Perhaps this needs
> find_lock_task_mm().

And, I don't understand this code, probably this doesn't matter, but.

unpin_all() is probably fine, but xen_mm_pin_all() can race with fork()
and miss the new child. Is it OK?

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-12  8:04           ` Ingo Molnar
@ 2015-06-12 20:38             ` Oleg Nesterov
  2015-06-12 20:53               ` Oleg Nesterov
  2015-06-13  7:26               ` Ingo Molnar
  0 siblings, 2 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-06-12 20:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Waiman Long, Thomas Gleixner, Denys Vlasenko,
	Borislav Petkov, Andrew Morton, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra

On 06/12, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > So I think the only issue is that ->mm can become NULL when the thread group
> > leader dies - a non-NULL mm should always be shared among all threads.
>
> Indeed, we do that in exit_mm().

Yes,

> So we could add tsk->mm_leader or so,

No, no, please do not. Just do something like

	for_each_process(p) {

		for_each_thread(p, t) {
			if (t->mm) {
				do_something(t->mm);
				break;
			}
		}
	}

But either way I don't understand what protects this ->mm. Perhaps this needs
find_lock_task_mm().

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
       [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
@ 2015-06-12  8:04           ` Ingo Molnar
  2015-06-12 20:38             ` Oleg Nesterov
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-06-12  8:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Thomas Gleixner, Denys Vlasenko, Borislav Petkov,
	Andrew Morton, Oleg Nesterov, Andy Lutomirski, linux-mml,
	Linux Kernel Mailing List, Brian Gerst, H. Peter Anvin,
	Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Jun 12, 2015 00:23, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> > We might make it so: but that would mean restricting certain clone_flags 
> > variants - not sure that's possible with our current ABI usage?
> 
> We already do that. You can't share signal info unless you share the mm. And a 
> shared signal state is what defines a thread group.
> 
> So I think the only issue is that ->mm can become NULL when the thread group 
> leader dies - a non-NULL mm should always be shared among all threads.

Indeed, we do that in exit_mm().

So we could add tsk->mm_leader or so, which does not get cleared and which the 
scheduler does not look at, but I'm not sure it's entirely safe that way: we don't 
have a refcount, and when the last thread exits it becomes bogus for a small 
window until the zombie leader is unlinked from the task list.

To close that race we'd have __mmdrop() or so clear out tsk->mm_leader - but the 
task doing the mmdrop() might be a lazy thread totally unrelated to the original 
thread group so we don't know which tsk->mm_leader to clear out.

To solve that we'd have to track the leader owning an MM in mm_struct - which gets 
interesting for the exec() case where the thread group gets a new leader, so we'd 
have to re-link the mm's leader pointer there.

So unless I missed some simpler solution there a good number of steps where this 
could go wrong, in small looking race windows - how about we just live with 
iterating through all tasks instead of just all processes, once per 512 GB of 
memory mapped?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 20:41     ` Linus Torvalds
@ 2015-06-12  7:23       ` Ingo Molnar
       [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-06-12  7:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Don't use "for_each_process_thread(g, p)". You only care about each mm, and 
> > threads all share the same mm, so just do
> >
> >     for_each_process(p)
> >
> > instead of iterating over all threads too.
> 
> Hmm. I may be wrong. It strikes me that one of the group leaders might have 
> exited but the subthreads live on. We'd see p->mm being NULL, even though the mm 
> was originally in use.
> 
> Ugh. So maybe the code really does need to iterate over all threads.

Yeah, for_each_process() is indeed no guarantee that we iterate over all mm's.

We might make it so: but that would mean restricting certain clone_flags variants 
- not sure that's possible with our current ABI usage?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 20:37   ` Linus Torvalds
@ 2015-06-11 20:41     ` Linus Torvalds
  2015-06-12  7:23       ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2015-06-11 20:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long

On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Don't use "for_each_process_thread(g, p)". You only care about each
> mm, and threads all share the same mm, so just do
>
>     for_each_process(p)
>
> instead of iterating over all threads too.

Hmm. I may be wrong. It strikes me that one of the group leaders might
have exited but the subthreads live on. We'd see p->mm being NULL,
even though the mm was originally in use.

Ugh. So maybe the code really does need to iterate over all threads.

                  Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
@ 2015-06-11 20:37   ` Linus Torvalds
  2015-06-11 20:41     ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2015-06-11 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mml, Andy Lutomirski,
	Andrew Morton, Denys Vlasenko, Brian Gerst, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Thomas Gleixner,
	Waiman Long

On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> But we want to remove the pgd_list, so convert the code over to walk
> all tasks in the system. This is an equivalent method.

This is not at all equivalent, and it looks stupid.

Don't use "for_each_process_thread(g, p)". You only care about each
mm, and threads all share the same mm, so just do

    for_each_process(p)

instead of iterating over all threads too.

No?

               Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code
  2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
@ 2015-06-11 14:07 ` Ingo Molnar
  2015-06-11 20:37   ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-06-11 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mml, Andy Lutomirski, Andrew Morton, Denys Vlasenko,
	Brian Gerst, Peter Zijlstra, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, Oleg Nesterov, Thomas Gleixner, Waiman Long

xen_mm_pin_all()/unpin_all() are used to implement full guest instance
suspend/restore. It's a stop-all method that needs to iterate through
all allocated pgds in the system to fix them up for Xen's use.

This code uses pgd_list, probably because it was an easy interface.

But we want to remove the pgd_list, so convert the code over to walk
all tasks in the system. This is an equivalent method.

(As I don't use Xen this is was only build tested.)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2045b0..87a8354435f8 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -853,18 +853,29 @@ static void xen_pgd_pin(struct mm_struct *mm)
  */
 void xen_mm_pin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
+	rcu_read_lock();
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process_thread(g, p) {
+		struct page *page;
+		pgd_t *pgd;
+
+		if (!p->mm)
+			continue;
+
+		pgd = p->mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (!PagePinned(page)) {
-			__xen_pgd_pin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_pin(&init_mm, pgd);
 			SetPageSavePinned(page);
 		}
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -967,19 +978,30 @@ static void xen_pgd_unpin(struct mm_struct *mm)
  */
 void xen_mm_unpin_all(void)
 {
-	struct page *page;
+	struct task_struct *g, *p;
 
+	rcu_read_lock();
 	spin_lock(&pgd_lock);
 
-	list_for_each_entry(page, &pgd_list, lru) {
+	for_each_process_thread(g, p) {
+		struct page *page;
+		pgd_t *pgd;
+
+		if (!p->mm)
+			continue;
+
+		pgd = p->mm->pgd;
+		page = virt_to_page(pgd);
+
 		if (PageSavePinned(page)) {
 			BUG_ON(!PagePinned(page));
-			__xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page));
+			__xen_pgd_unpin(&init_mm, pgd);
 			ClearPageSavePinned(page);
 		}
 	}
 
 	spin_unlock(&pgd_lock);
+	rcu_read_unlock();
 }
 
 static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2015-06-16 14:33 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-13  9:49 [PATCH 00/12, v2] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-06-13  9:49 ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13 19:24   ` Oleg Nesterov
2015-06-13 19:24     ` Oleg Nesterov
2015-06-14  7:36     ` Ingo Molnar
2015-06-14  7:36       ` Ingo Molnar
2015-06-14 19:24       ` Oleg Nesterov
2015-06-14 19:24         ` Oleg Nesterov
2015-06-14 19:38         ` Oleg Nesterov
2015-06-14 19:38           ` Oleg Nesterov
2015-06-15  0:40           ` Paul E. McKenney
2015-06-15  0:40             ` Paul E. McKenney
2015-06-15 20:33             ` Ingo Molnar
2015-06-15 20:33               ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-14  8:26   ` Ingo Molnar
2015-06-14  8:26     ` Ingo Molnar
2015-06-15  9:05   ` Ian Campbell
2015-06-15  9:05   ` Ian Campbell
2015-06-15  9:05     ` Ian Campbell
2015-06-15 10:30     ` David Vrabel
2015-06-15 10:30       ` David Vrabel
2015-06-15 20:35       ` Ingo Molnar
2015-06-15 20:35       ` Ingo Molnar
2015-06-15 20:35         ` Ingo Molnar
2015-06-16 14:15         ` David Vrabel
2015-06-16 14:15         ` David Vrabel
2015-06-16 14:15           ` David Vrabel
2015-06-16 14:19           ` Boris Ostrovsky
2015-06-16 14:19           ` Boris Ostrovsky
2015-06-16 14:19             ` Boris Ostrovsky
2015-06-16 14:27             ` David Vrabel
2015-06-16 14:27             ` David Vrabel
2015-06-16 14:27               ` David Vrabel
2015-06-15 10:30     ` David Vrabel
2015-06-13  9:49 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 11/12] x86/mm: Remove pgd_list leftovers Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 12/12] x86/mm: Simplify pgd_alloc() Ingo Molnar
2015-06-13  9:49   ` Ingo Molnar
2015-06-13 18:58 ` why do we need vmalloc_sync_all? Oleg Nesterov
2015-06-13 18:58   ` Oleg Nesterov
2015-06-14  7:59   ` Ingo Molnar
2015-06-14  7:59     ` Ingo Molnar
2015-06-14 20:06     ` Oleg Nesterov
2015-06-14 20:06       ` Oleg Nesterov
2015-06-15  2:47       ` Andi Kleen
2015-06-15  2:47         ` Andi Kleen
2015-06-15  2:57         ` Andy Lutomirski
2015-06-15  2:57           ` Andy Lutomirski
2015-06-15 20:28           ` Ingo Molnar
2015-06-15 20:28             ` Ingo Molnar
2015-06-15 20:48             ` Andy Lutomirski
2015-06-15 20:48               ` Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2015-06-11 14:07 [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-06-11 14:07 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
2015-06-11 20:37   ` Linus Torvalds
2015-06-11 20:41     ` Linus Torvalds
2015-06-12  7:23       ` Ingo Molnar
     [not found]         ` <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
2015-06-12  8:04           ` Ingo Molnar
2015-06-12 20:38             ` Oleg Nesterov
2015-06-12 20:53               ` Oleg Nesterov
2015-06-12 22:30                 ` Boris Ostrovsky
2015-06-12 23:36                   ` Oleg Nesterov
2015-06-13  7:26               ` Ingo Molnar
2015-06-13 18:00                 ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.