linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines
@ 2013-12-03  8:51 Mel Gorman
  2013-12-03  8:51 ` [PATCH 01/15] mm: numa: Do not batch handle PMD pages Mel Gorman
                   ` (13 more replies)
  0 siblings, 14 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

Alex Thorlton reported segementation faults when NUMA balancing is enabled
on large machines. There is no obvious explanation from the console what the
problem is so this series is based on code review.

The series is against 3.12. In the event it addresses the problem the
patches will need to be forward-ported and retested. The series is not
against the latest mainline as changes to PTE scan rates may mask the bug.

 arch/x86/mm/gup.c       |  13 ++++
 include/linux/hugetlb.h |  10 ++--
 include/linux/migrate.h |  27 ++++++++-
 kernel/sched/fair.c     |  16 ++++-
 mm/huge_memory.c        |  58 ++++++++++++++----
 mm/hugetlb.c            |  51 ++++++----------
 mm/memory.c             |  93 ++---------------------------
 mm/mempolicy.c          |   2 +
 mm/migrate.c            | 155 ++++++++++++++++++++++++++++++++++++++++++++----
 mm/mprotect.c           |  50 ++++------------
 mm/pgtable-generic.c    |   3 +
 mm/swap.c               | 143 +++++++++++++++++++++++++-------------------
 12 files changed, 371 insertions(+), 250 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/15] mm: numa: Do not batch handle PMD pages
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 02/15] mm: hugetlbfs: fix hugetlbfs optimization Mel Gorman
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
   migrated
2. The handling of PMDs in batch means that faults get accounted to
   the task that is handling the fault. While care is taken to only
   mark PMDs where the last CPU and PID match it can still have problems
   due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
   requires taking a heavy lock that can contend with THP migration
   and handling the fault requires the release/acquisition of the PTL
   for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/memory.c   | 91 ++---------------------------------------------------------
 mm/mprotect.c | 40 ++------------------------
 2 files changed, 5 insertions(+), 126 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d176154..f453384 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3586,93 +3586,6 @@ out:
 	return 0;
 }
 
-/* NUMA hinting page fault entry point for regular pmds */
-#ifdef CONFIG_NUMA_BALANCING
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	pte_t *pte, *orig_pte;
-	unsigned long _addr = addr & PMD_MASK;
-	unsigned long offset;
-	spinlock_t *ptl;
-	bool numa = false;
-
-	spin_lock(&mm->page_table_lock);
-	pmd = *pmdp;
-	if (pmd_numa(pmd)) {
-		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
-		numa = true;
-	}
-	spin_unlock(&mm->page_table_lock);
-
-	if (!numa)
-		return 0;
-
-	/* we're in a page fault so some vma must be in the range */
-	BUG_ON(!vma);
-	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
-	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
-	VM_BUG_ON(offset >= PMD_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	pte += offset >> PAGE_SHIFT;
-	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
-		pte_t pteval = *pte;
-		struct page *page;
-		int page_nid = -1;
-		int target_nid;
-		bool migrated = false;
-
-		if (!pte_present(pteval))
-			continue;
-		if (!pte_numa(pteval))
-			continue;
-		if (addr >= vma->vm_end) {
-			vma = find_vma(mm, addr);
-			/* there's a pte present so there must be a vma */
-			BUG_ON(!vma);
-			BUG_ON(addr < vma->vm_start);
-		}
-		if (pte_numa(pteval)) {
-			pteval = pte_mknonnuma(pteval);
-			set_pte_at(mm, addr, pte, pteval);
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page))
-			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
-
-		page_nid = page_to_nid(page);
-		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-		pte_unmap_unlock(pte, ptl);
-		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
-			if (migrated)
-				page_nid = target_nid;
-		} else {
-			put_page(page);
-		}
-
-		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
-
-		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	}
-	pte_unmap_unlock(orig_pte, ptl);
-
-	return 0;
-}
-#else
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	BUG();
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3811,8 +3724,8 @@ retry:
 		}
 	}
 
-	if (pmd_numa(*pmd))
-		return do_pmd_numa_page(mm, vma, address, pmd);
+	/* THP should already have been handled */
+	BUG_ON(pmd_numa(*pmd));
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 412ba2b..18f1117 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,12 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
-	int last_nid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -63,12 +61,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int this_nid = page_to_nid(page);
-					if (last_nid == -1)
-						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
-
 					/* only check non-shared pages */
 					if (!pte_numa(oldpte) &&
 					    page_mapcount(page) == 1) {
@@ -111,26 +103,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
 	return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	spin_lock(&mm->page_table_lock);
-	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -138,7 +113,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -156,16 +130,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
-
-		/*
-		 * If we are changing protections for NUMA hinting faults then
-		 * set pmd_numa if the examined pages were all on the same
-		 * node. This allows a regular PMD to be handled as one fault
-		 * and effectively batches the taking of the PTL
-		 */
-		if (prot_numa && all_same_node)
-			change_pmd_protnuma(vma->vm_mm, addr, pmd);
+				 dirty_accountable, prot_numa);
+
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/15] mm: hugetlbfs: fix hugetlbfs optimization
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
  2013-12-03  8:51 ` [PATCH 01/15] mm: numa: Do not batch handle PMD pages Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page Mel Gorman
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream.

Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

  echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

  BUG: Bad page state in process bash  pfn:59a01
  page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
  page flags: 0x1c00000000008000(tail)
  Modules linked in:
  CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/hugetlb.h |   6 ++
 mm/hugetlb.c            |  17 ++++++
 mm/swap.c               | 143 ++++++++++++++++++++++++++++--------------------
 3 files changed, 106 insertions(+), 60 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0393270..6125579 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,6 +31,7 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
+int PageHeadHuge(struct page *page_head);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -104,6 +105,11 @@ static inline int PageHuge(struct page *page)
 	return 0;
 }
 
+static inline int PageHeadHuge(struct page *page_head)
+{
+	return 0;
+}
+
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0b7656e..f0a4ca4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -736,6 +736,23 @@ int PageHuge(struct page *page)
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
+/*
+ * PageHeadHuge() only returns true for hugetlbfs head page, but not for
+ * normal or transparent huge pages.
+ */
+int PageHeadHuge(struct page *page_head)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageHead(page_head))
+		return 0;
+
+	dtor = get_compound_page_dtor(page_head);
+
+	return dtor == free_huge_page;
+}
+EXPORT_SYMBOL_GPL(PageHeadHuge);
+
 pgoff_t __basepage_index(struct page *page)
 {
 	struct page *page_head = compound_head(page);
diff --git a/mm/swap.c b/mm/swap.c
index 759c3ca..0c8f7a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -82,19 +82,6 @@ static void __put_compound_page(struct page *page)
 
 static void put_compound_page(struct page *page)
 {
-	/*
-	 * hugetlbfs pages cannot be split from under us.  If this is a
-	 * hugetlbfs page, check refcount on head page and release the page if
-	 * the refcount becomes zero.
-	 */
-	if (PageHuge(page)) {
-		page = compound_head(page);
-		if (put_page_testzero(page))
-			__put_compound_page(page);
-
-		return;
-	}
-
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_trans_head(page);
@@ -111,14 +98,31 @@ static void put_compound_page(struct page *page)
 			 * still hot on arches that do not support
 			 * this_cpu_cmpxchg_double().
 			 */
-			if (PageSlab(page_head)) {
-				if (PageTail(page)) {
+			if (PageSlab(page_head) || PageHeadHuge(page_head)) {
+				if (likely(PageTail(page))) {
+					/*
+					 * __split_huge_page_refcount
+					 * cannot race here.
+					 */
+					VM_BUG_ON(!PageHead(page_head));
+					atomic_dec(&page->_mapcount);
 					if (put_page_testzero(page_head))
 						VM_BUG_ON(1);
-
-					atomic_dec(&page->_mapcount);
-					goto skip_lock_tail;
+					if (put_page_testzero(page_head))
+						__put_compound_page(page_head);
+					return;
 				} else
+					/*
+					 * __split_huge_page_refcount
+					 * run before us, "page" was a
+					 * THP tail. The split
+					 * page_head has been freed
+					 * and reallocated as slab or
+					 * hugetlbfs page of smaller
+					 * order (only possible if
+					 * reallocated as slab on
+					 * x86).
+					 */
 					goto skip_lock;
 			}
 			/*
@@ -132,8 +136,27 @@ static void put_compound_page(struct page *page)
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 skip_lock:
-				if (put_page_testzero(page_head))
-					__put_single_page(page_head);
+				if (put_page_testzero(page_head)) {
+					/*
+					 * The head page may have been
+					 * freed and reallocated as a
+					 * compound page of smaller
+					 * order and then freed again.
+					 * All we know is that it
+					 * cannot have become: a THP
+					 * page, a compound page of
+					 * higher order, a tail page.
+					 * That is because we still
+					 * hold the refcount of the
+					 * split THP tail and
+					 * page_head was the THP head
+					 * before the split.
+					 */
+					if (PageHead(page_head))
+						__put_compound_page(page_head);
+					else
+						__put_single_page(page_head);
+				}
 out_put_single:
 				if (put_page_testzero(page))
 					__put_single_page(page);
@@ -155,7 +178,6 @@ out_put_single:
 			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 
-skip_lock_tail:
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
 					__put_compound_page(page_head);
@@ -198,51 +220,52 @@ bool __get_page_tail(struct page *page)
 	 * proper PT lock that already serializes against
 	 * split_huge_page().
 	 */
+	unsigned long flags;
 	bool got = false;
-	struct page *page_head;
-
-	/*
-	 * If this is a hugetlbfs page it cannot be split under us.  Simply
-	 * increment refcount for the head page.
-	 */
-	if (PageHuge(page)) {
-		page_head = compound_head(page);
-		atomic_inc(&page_head->_count);
-		got = true;
-	} else {
-		unsigned long flags;
+	struct page *page_head = compound_trans_head(page);
 
-		page_head = compound_trans_head(page);
-		if (likely(page != page_head &&
-					get_page_unless_zero(page_head))) {
-
-			/* Ref to put_compound_page() comment. */
-			if (PageSlab(page_head)) {
-				if (likely(PageTail(page))) {
-					__get_page_tail_foll(page, false);
-					return true;
-				} else {
-					put_page(page_head);
-					return false;
-				}
-			}
-
-			/*
-			 * page_head wasn't a dangling pointer but it
-			 * may not be a head page anymore by the time
-			 * we obtain the lock. That is ok as long as it
-			 * can't be freed from under us.
-			 */
-			flags = compound_lock_irqsave(page_head);
-			/* here __split_huge_page_refcount won't run anymore */
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/* Ref to put_compound_page() comment. */
+		if (PageSlab(page_head) || PageHeadHuge(page_head)) {
 			if (likely(PageTail(page))) {
+				/*
+				 * This is a hugetlbfs page or a slab
+				 * page. __split_huge_page_refcount
+				 * cannot race here.
+				 */
+				VM_BUG_ON(!PageHead(page_head));
 				__get_page_tail_foll(page, false);
-				got = true;
-			}
-			compound_unlock_irqrestore(page_head, flags);
-			if (unlikely(!got))
+				return true;
+			} else {
+				/*
+				 * __split_huge_page_refcount run
+				 * before us, "page" was a THP
+				 * tail. The split page_head has been
+				 * freed and reallocated as slab or
+				 * hugetlbfs page of smaller order
+				 * (only possible if reallocated as
+				 * slab on x86).
+				 */
 				put_page(page_head);
+				return false;
+			}
+		}
+
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = true;
 		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
 	}
 	return got;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
  2013-12-03  8:51 ` [PATCH 01/15] mm: numa: Do not batch handle PMD pages Mel Gorman
  2013-12-03  8:51 ` [PATCH 02/15] mm: hugetlbfs: fix hugetlbfs optimization Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-04 16:59   ` Alex Thorlton
  2013-12-03  8:51 ` [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Dave Hansen <dave.hansen@linux.intel.com>

commit 30b0a105d9f7141e4cbf72ae5511832457d89788 upstream.

Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

       if (PageHuge(page) || PageTransHuge(page))
                copy_huge_page(newpage, page);

So, yay for code reuse.  But:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate().  This works 99% of the
time because page_hstate() determines the hstate from the page order
alone.  Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);
        if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
  ...

Mel noticed that the migration code is really the only user of these
functions.  This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/hugetlb.h |  4 ----
 mm/hugetlb.c            | 34 ----------------------------------
 mm/migrate.c            | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 38 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6125579..4694afc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -70,7 +70,6 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 bool is_hugepage_active(struct page *page);
-void copy_huge_page(struct page *dst, struct page *src);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
@@ -146,9 +145,6 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 #define isolate_huge_page(p, l) false
 #define putback_active_hugepage(p)	do {} while (0)
 #define is_hugepage_active(x)	false
-static inline void copy_huge_page(struct page *dst, struct page *src)
-{
-}
 
 static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f0a4ca4..0defeb6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -476,40 +476,6 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return 0;
 }
 
-static void copy_gigantic_page(struct page *dst, struct page *src)
-{
-	int i;
-	struct hstate *h = page_hstate(src);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_highpage(dst, src);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-
-void copy_huge_page(struct page *dst, struct page *src)
-{
-	int i;
-	struct hstate *h = page_hstate(src);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_highpage(dst + i, src + i);
-	}
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index c046927..fbcac8b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -441,6 +441,54 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * Gigantic pages are so large that we do not guarantee that page++ pointer
+ * arithmetic will work across the entire page.  We need something more
+ * specialized.
+ */
+static void __copy_gigantic_page(struct page *dst, struct page *src,
+				int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages; ) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+static void copy_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	int nr_pages;
+
+	if (PageHuge(src)) {
+		/* hugetlbfs page */
+		struct hstate *h = page_hstate(src);
+		nr_pages = pages_per_huge_page(h);
+
+		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+			__copy_gigantic_page(dst, src, nr_pages);
+			return;
+		}
+	} else {
+		/* thp page */
+		BUG_ON(!PageTransHuge(src));
+		nr_pages = hpage_nr_pages(src);
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
+/*
  * Copy the page to its new location
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (2 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03 23:07   ` Rik van Riel
  2013-12-03  8:51 ` [PATCH 05/15] mm: numa: Call MMU notifiers on " Mel Gorman
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against THP
migration and properly account for the NUMA hinting fault. On the migration
side the page table lock is taken for each PTE update.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c | 13 +++++++++++++
 mm/huge_memory.c  | 26 +++++++++++++++++---------
 mm/migrate.c      | 38 +++++++++++++++++++++++++++++++-------
 3 files changed, 61 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..0596e8e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		pte_t pte = gup_get_pte(ptep);
 		struct page *page;
 
+		/* Similar to the PMD case, NUMA hinting must take slow path */
+		if (pte_numa(pte)) {
+			pte_unmap(ptep);
+			return 0;
+		}
+
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
@@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..203b5bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1240,6 +1240,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1306,26 +1310,30 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	if (!page_locked) {
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
 		page_nid = -1;
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PTE did not change while unlocked */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index fbcac8b..c4743d6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1709,6 +1709,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	pmd_t orig_entry;
 
 	/*
 	 * Don't migrate pages that are mapped in multiple processes.
@@ -1750,7 +1751,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Recheck the target PMD */
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
+fail_putback:
 		spin_unlock(&mm->page_table_lock);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1780,16 +1782,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 */
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
+	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
-	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
+
+	if (page_count(page) != 2) {
+		set_pmd_at(mm, mmun_start, pmd, orig_entry);
+		flush_tlb_range(vma, mmun_start, mmun_end);
+		update_mmu_cache_pmd(vma, address, &entry);
+		page_remove_rmap(new_page);
+		goto fail_putback;
+	}
+
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1814,9 +1834,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(&mm->page_table_lock);
 
 	unlock_page(page);
 	put_page(page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/15] mm: numa: Call MMU notifiers on THP migration
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (3 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 06/15] mm: Clear pmd_numa before invalidating Mel Gorman
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

MMU notifiers must be called on THP page migration or secondary MMUs will
get very confused.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c4743d6..3a87511 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/balloon_compaction.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -1703,12 +1704,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 				unsigned long address,
 				struct page *page, int node)
 {
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	unsigned long mmun_start = address & HPAGE_PMD_MASK;
+	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 	pmd_t orig_entry;
 
 	/*
@@ -1750,10 +1752,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(&mm->page_table_lock);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1794,10 +1798,11 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-	page_add_new_anon_rmap(new_page, vma, haddr);
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	flush_cache_range(vma, mmun_start, mmun_end);
+	page_add_new_anon_rmap(new_page, vma, mmun_start);
+	pmdp_clear_flush(vma, mmun_start, pmd);
+	set_pmd_at(mm, mmun_start, pmd, entry);
+	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
@@ -1817,6 +1822,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	unlock_page(new_page);
 	unlock_page(page);
@@ -1837,7 +1843,7 @@ out_dropref:
 	spin_lock(&mm->page_table_lock);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, haddr, pmd, entry);
+		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(&mm->page_table_lock);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/15] mm: Clear pmd_numa before invalidating
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (4 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 05/15] mm: numa: Call MMU notifiers on " Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 07/15] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

pmdp_invalidate clears the present bit without taking into account that it
might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
pmd_numa before invalidating.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/pgtable-generic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 3929a40..a7aefd3 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
+	pmd_t entry = *pmdp;
+	if (pmd_numa(entry))
+		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/15] mm: numa: Do not clear PMD during PTE update scan
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (5 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 06/15] mm: Clear pmd_numa before invalidating Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 08/15] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

If the PMD is flushed then a parallel fault in handle_mm_fault() will
enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
attempt to insert a huge zero page. This is wasteful so the patch
avoids clearing the PMD when setting pmd_numa.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 203b5bc..d6c3bf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1474,20 +1474,24 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
 			BUG_ON(pmd_write(entry));
+			set_pmd_at(mm, addr, pmd, entry);
 		} else {
 			struct page *page = pmd_page(*pmd);
+			entry = *pmd;
 
 			/* only check non-shared pages */
 			if (page_mapcount(page) == 1 &&
 			    !pmd_numa(*pmd)) {
 				entry = pmd_mknuma(entry);
+				set_pmd_at(mm, addr, pmd, entry);
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
 	}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/15] mm: numa: Do not clear PTE for pte_numa update
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (6 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 07/15] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 09/15] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

The TLB must be flushed if the PTE is updated but change_pte_range is clearing
the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
reinserts the same entry. Without the flush, it's conceivable that two processors
have different TLBs for the same virtual address and at the very least it would
generate spurious faults. This patch only unmaps the pages in change_pte_range for
a full protection change.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 18f1117..c53e332 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 			bool updated = false;
 
-			ptent = ptep_modify_prot_start(mm, addr, pte);
 			if (!prot_numa) {
+				ptent = ptep_modify_prot_start(mm, addr, pte);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
 				struct page *page;
 
+				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					/* only check non-shared pages */
@@ -81,7 +82,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-			ptep_modify_prot_commit(mm, addr, pte, ptent);
+
+			/* Only !prot_numa always clears the pte */
+			if (!prot_numa)
+				ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/15] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (7 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 08/15] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 10/15] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

The anon_vma lock prevents parallel THP splits and any associated complexity
that arises when handling splits during THP migration. This patch checks
if the lock was successfully acquired and bails from THP migration if it
failed for any reason.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6c3bf4..98b6a79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1342,6 +1342,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
+	/* Bail if we fail to protect against THP splits for any reason */
+	if (unlikely(!anon_vma)) {
+		put_page(page);
+		page_nid = -1;
+		goto clear_pmdnuma;
+	}
+
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/15] mm: numa: Avoid unnecessary work on the failure path
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (8 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 09/15] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 11/15] sched: numa: Skip inaccessible VMAs Mel Gorman
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 3a87511..e429206 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1774,7 +1774,8 @@ fail_putback:
 		putback_lru_page(page);
 		mod_zone_page_state(page_zone(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-		goto out_fail;
+
+		goto out_unlock;
 	}
 
 	/*
@@ -1848,6 +1849,7 @@ out_dropref:
 	}
 	spin_unlock(&mm->page_table_lock);
 
+out_unlock:
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/15] sched: numa: Skip inaccessible VMAs
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (9 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 10/15] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:51 ` [PATCH 12/15] Clear numa on mprotect Mel Gorman
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..40d8ea3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -970,6 +970,13 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
+		/*
+		 * Skip inaccessible VMAs to avoid any confusion between
+		 * PROT_NONE and NUMA hinting ptes
+		 */
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			continue;
+
 		/* Skip small VMAs. They are not likely to be of relevance */
 		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 			continue;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 12/15] Clear numa on mprotect
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (10 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 11/15] sched: numa: Skip inaccessible VMAs Mel Gorman
@ 2013-12-03  8:51 ` Mel Gorman
  2013-12-03  8:52 ` [PATCH 13/15] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
  2013-12-03  8:52 ` [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update Mel Gorman
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:51 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

---
 mm/huge_memory.c | 2 ++
 mm/mprotect.c    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 98b6a79..fa277fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,6 +1484,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 		if (!prot_numa) {
 			entry = pmdp_get_and_clear(mm, addr, pmd);
+			if (pmd_numa(entry))
+				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			BUG_ON(pmd_write(entry));
 			set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c53e332..510f138 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (!prot_numa) {
 				ptent = ptep_modify_prot_start(mm, addr, pte);
+				if (pte_numa(ptent))
+					ptent = pte_mknonnuma(ptent);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 13/15] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (11 preceding siblings ...)
  2013-12-03  8:51 ` [PATCH 12/15] Clear numa on mprotect Mel Gorman
@ 2013-12-03  8:52 ` Mel Gorman
  2013-12-03  8:52 ` [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update Mel Gorman
  13 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:52 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

do_huge_pmd_numa_page() handles the case where there is parallel THP
migration.  However, by the time it is checked the NUMA hinting information
has already been disrupted. This patch adds an earlier check with some helpers.
It reuses the helper to warn if a huge pmd copy takes place in parallel with
THP migration as that potentially leads to corruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h | 10 +++++++++-
 mm/huge_memory.c        | 22 ++++++++++++++++------
 mm/migrate.c            | 17 +++++++++++++++++
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..804651c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,10 +90,18 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
+extern bool pmd_trans_migrating(pmd_t pmd);
+extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page, int node);
 extern bool migrate_ratelimited(int node);
 #else
+static inline bool pmd_trans_migrating(pmd_t pmd)
+{
+	return false;
+}
+static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+}
 static inline int migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa277fa..4c7abd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -884,6 +884,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		ret = 0;
 		goto out_unlock;
 	}
+
+	/* mmap_sem prevents this happening but warn if that changes */
+	WARN_ON(pmd_trans_migrating(pmd));
+
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(&src_mm->page_table_lock);
@@ -1294,6 +1298,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
+	/*
+	 * If there are potential migrations, wait for completion and retry
+	 * without disrupting NUMA hinting information. Do not relock and
+	 * check_same as the page may no longer be mapped.
+	 */
+	if (unlikely(pmd_trans_migrating(*pmdp))) {
+		spin_unlock(&mm->page_table_lock);
+		wait_migrate_huge_page(vma->anon_vma, pmdp);
+		goto out;
+	}
+
 	page = pmd_page(pmd);
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
@@ -1312,12 +1327,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto clear_pmdnuma;
 	}
 
-	/*
-	 * If there are potential migrations, wait for completion and retry. We
-	 * do not relock and check_same as the page may no longer be mapped.
-	 * Furtermore, even if the page is currently misplaced, there is no
-	 * guarantee it is still misplaced after the migration completes.
-	 */
+	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index e429206..5dfd552 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1645,6 +1645,23 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
+bool pmd_trans_migrating(pmd_t pmd) {
+	struct page *page = pmd_page(pmd);
+	return PageLocked(page);
+}
+
+void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+	struct page *page = pmd_page(*pmd);
+	if (get_page_unless_zero(page)) {
+		wait_on_page_locked(page);
+		put_page(page);
+	}
+
+	/* Guarantee that the newly migrated PTE is visible */
+	smp_rmb();
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
                   ` (12 preceding siblings ...)
  2013-12-03  8:52 ` [PATCH 13/15] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
@ 2013-12-03  8:52 ` Mel Gorman
  2013-12-03 23:07   ` Rik van Riel
  13 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-03  8:52 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML, Mel Gorman

NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
hinting faults do not flush the TLB as X86 at least does not cache TLB
entries for !present PTEs. However, in the event that the two race a NUMA
hinting fault may return with the TLB in an inconsistent state between
different processors. This patch detects potential for races between the
NUMA PTE scanner and fault handler and will flush the TLB for the affected
range if there is a race.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h | 17 +++++++++++++++++
 kernel/sched/fair.c     |  3 +++
 mm/huge_memory.c        |  5 +++++
 mm/memory.c             |  6 ++++++
 mm/migrate.c            | 33 +++++++++++++++++++++++++++++++++
 5 files changed, 64 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 804651c..28aa613 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -94,6 +94,11 @@ extern bool pmd_trans_migrating(pmd_t pmd);
 extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page, int node);
 extern bool migrate_ratelimited(int node);
+extern unsigned long numa_fault_prepare(struct mm_struct *mm);
+extern void numa_fault_commit(struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				unsigned long start_addr, int nr_pages,
+				unsigned long seq);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
@@ -110,6 +115,18 @@ static inline bool migrate_ratelimited(int node)
 {
 	return false;
 }
+static inline unsigned long numa_fault_prepare(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void numa_fault_commit(struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				unsigned long start_addr, int nr_pages,
+				unsigned long seq)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40d8ea3..af1a710 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -959,6 +959,9 @@ void task_numa_work(struct callback_head *work)
 	if (!pages)
 		return;
 
+	/* Paired with numa_fault_prepare */
+	smp_wmb();
+
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c7abd7..84f9907 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,6 +1293,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid;
 	bool page_locked;
 	bool migrated = false;
+	unsigned long scan_seq;
+
+	scan_seq = numa_fault_prepare(mm);
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1387,6 +1390,8 @@ out:
 	if (page_nid != -1)
 		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
 
+	numa_fault_commit(mm, vma, haddr, HPAGE_PMD_NR, scan_seq);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index f453384..6db850f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3540,6 +3540,9 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
+	unsigned long scan_seq;
+
+	scan_seq = numa_fault_prepare(mm);
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3583,6 +3586,9 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 out:
 	if (page_nid != -1)
 		task_numa_fault(page_nid, 1, migrated);
+
+	numa_fault_commit(mm, vma, addr, 1, scan_seq);
+
 	return 0;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 5dfd552..ccc814b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
 	smp_rmb();
 }
 
+unsigned long numa_fault_prepare(struct mm_struct *mm)
+{
+	/* Paired with task_numa_work */
+	smp_rmb();
+	return mm->numa_next_reset;
+}
+
+/* Returns true if there was a race with the NUMA pte scan update */
+void numa_fault_commit(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			unsigned long start_addr, int nr_pages,
+			unsigned long seq)
+{
+	unsigned current_seq;
+
+	/* Paired with task_numa_work */
+	smp_rmb();
+	current_seq = mm->numa_next_reset;
+
+	if (current_seq == seq)
+		return;
+
+	/*
+	 * Raced with NUMA pte scan update which may be deferring a flush.
+	 * Flush now to avoid CPUs having an inconsistent view
+	 */
+	if (nr_pages == 1)
+		flush_tlb_page(vma, start_addr);
+	else
+		flush_tlb_range(vma, start_addr,
+					start_addr + (nr_pages << PAGE_SHIFT));
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-03  8:52 ` [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update Mel Gorman
@ 2013-12-03 23:07   ` Rik van Riel
  2013-12-03 23:46     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-03 23:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML

On 12/03/2013 03:52 AM, Mel Gorman wrote:
> NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
> setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
> hinting faults do not flush the TLB as X86 at least does not cache TLB
> entries for !present PTEs. However, in the event that the two race a NUMA
> hinting fault may return with the TLB in an inconsistent state between
> different processors. This patch detects potential for races between the
> NUMA PTE scanner and fault handler and will flush the TLB for the affected
> range if there is a race.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5dfd552..ccc814b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
>  	smp_rmb();
>  }
>  
> +unsigned long numa_fault_prepare(struct mm_struct *mm)
> +{
> +	/* Paired with task_numa_work */
> +	smp_rmb();
> +	return mm->numa_next_reset;
> +}

The patch that introduces mm->numa_next_reset, and the
patch that increments it, seem to be missing from your
series...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-03  8:51 ` [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
@ 2013-12-03 23:07   ` Rik van Riel
  2013-12-03 23:54     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-03 23:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML

On 12/03/2013 03:51 AM, Mel Gorman wrote:

> +
> +	if (page_count(page) != 2) {
> +		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> +		flush_tlb_range(vma, mmun_start, mmun_end);

The mmun_start and mmun_end variables are introduced in patch 5.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-03 23:07   ` Rik van Riel
@ 2013-12-03 23:46     ` Mel Gorman
  2013-12-04 14:33       ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-03 23:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alex Thorlton, Linux-MM, LKML

On Tue, Dec 03, 2013 at 06:07:06PM -0500, Rik van Riel wrote:
> On 12/03/2013 03:52 AM, Mel Gorman wrote:
> > NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
> > setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
> > hinting faults do not flush the TLB as X86 at least does not cache TLB
> > entries for !present PTEs. However, in the event that the two race a NUMA
> > hinting fault may return with the TLB in an inconsistent state between
> > different processors. This patch detects potential for races between the
> > NUMA PTE scanner and fault handler and will flush the TLB for the affected
> > range if there is a race.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5dfd552..ccc814b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
> >  	smp_rmb();
> >  }
> >  
> > +unsigned long numa_fault_prepare(struct mm_struct *mm)
> > +{
> > +	/* Paired with task_numa_work */
> > +	smp_rmb();
> > +	return mm->numa_next_reset;
> > +}
> 
> The patch that introduces mm->numa_next_reset, and the
> patch that increments it, seem to be missing from your
> series...
> 

Damn. s/numa_next_reset/numa_next_scan/ in that patch

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-03 23:07   ` Rik van Riel
@ 2013-12-03 23:54     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-03 23:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alex Thorlton, Linux-MM, LKML

On Tue, Dec 03, 2013 at 06:07:51PM -0500, Rik van Riel wrote:
> On 12/03/2013 03:51 AM, Mel Gorman wrote:
> 
> > +
> > +	if (page_count(page) != 2) {
> > +		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> > +		flush_tlb_range(vma, mmun_start, mmun_end);
> 
> The mmun_start and mmun_end variables are introduced in patch 5.
> 

Thanks, fixed now.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-03 23:46     ` Mel Gorman
@ 2013-12-04 14:33       ` Rik van Riel
  2013-12-04 16:07         ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-04 14:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML

On 12/03/2013 06:46 PM, Mel Gorman wrote:
> On Tue, Dec 03, 2013 at 06:07:06PM -0500, Rik van Riel wrote:
>> On 12/03/2013 03:52 AM, Mel Gorman wrote:
>>> NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
>>> setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
>>> hinting faults do not flush the TLB as X86 at least does not cache TLB
>>> entries for !present PTEs. However, in the event that the two race a NUMA
>>> hinting fault may return with the TLB in an inconsistent state between
>>> different processors. This patch detects potential for races between the
>>> NUMA PTE scanner and fault handler and will flush the TLB for the affected
>>> range if there is a race.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 5dfd552..ccc814b 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
>>>  	smp_rmb();
>>>  }
>>>  
>>> +unsigned long numa_fault_prepare(struct mm_struct *mm)
>>> +{
>>> +	/* Paired with task_numa_work */
>>> +	smp_rmb();
>>> +	return mm->numa_next_reset;
>>> +}
>>
>> The patch that introduces mm->numa_next_reset, and the
>> patch that increments it, seem to be missing from your
>> series...
>>
> 
> Damn. s/numa_next_reset/numa_next_scan/ in that patch

How does that protect against the race?

Would it not be possible for task_numa_work to have a longer
runtime than the numa fault?

In other words, task_numa_work can increment numa_next_scan
before the numa fault starts, and still be doing its thing
when numa_fault_commit is run...

At that point, numa_fault_commit will not be seeing an
increment in numa_next_scan, and we are relying completely
on the batched tlb flush by the change_prot_numa.

Is that scenario a problem, or is it ok?

And, why? :)


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-04 14:33       ` Rik van Riel
@ 2013-12-04 16:07         ` Mel Gorman
  2013-12-05 15:40           ` Rik van Riel
  2013-12-06 19:13           ` [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range Rik van Riel
  0 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-04 16:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alex Thorlton, Linux-MM, LKML

On Wed, Dec 04, 2013 at 09:33:53AM -0500, Rik van Riel wrote:
> On 12/03/2013 06:46 PM, Mel Gorman wrote:
> > On Tue, Dec 03, 2013 at 06:07:06PM -0500, Rik van Riel wrote:
> >> On 12/03/2013 03:52 AM, Mel Gorman wrote:
> >>> NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
> >>> setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
> >>> hinting faults do not flush the TLB as X86 at least does not cache TLB
> >>> entries for !present PTEs. However, in the event that the two race a NUMA
> >>> hinting fault may return with the TLB in an inconsistent state between
> >>> different processors. This patch detects potential for races between the
> >>> NUMA PTE scanner and fault handler and will flush the TLB for the affected
> >>> range if there is a race.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>> index 5dfd552..ccc814b 100644
> >>> --- a/mm/migrate.c
> >>> +++ b/mm/migrate.c
> >>> @@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
> >>>  	smp_rmb();
> >>>  }
> >>>  
> >>> +unsigned long numa_fault_prepare(struct mm_struct *mm)
> >>> +{
> >>> +	/* Paired with task_numa_work */
> >>> +	smp_rmb();
> >>> +	return mm->numa_next_reset;
> >>> +}
> >>
> >> The patch that introduces mm->numa_next_reset, and the
> >> patch that increments it, seem to be missing from your
> >> series...
> >>
> > 
> > Damn. s/numa_next_reset/numa_next_scan/ in that patch
> 
> How does that protect against the race?
> 

It's the local processors TLB I was primarily thinking about and the case
in particular is where the fault has cleared the pmd_numa and the scanner
sets it again before the fault completes and without any flush.

> Would it not be possible for task_numa_work to have a longer
> runtime than the numa fault?
> 

Yes.

> In other words, task_numa_work can increment numa_next_scan
> before the numa fault starts, and still be doing its thing
> when numa_fault_commit is run...
> 

a) the PTE was previously pte_numa, scanner ignores it, fault traps and
   clears it with no flush or TLB consistency due to the page being
   inaccessible before

b) the PTE was previously !pte_numa, scanner will set it
   o Reference is first? No trap
   o Reference is after the scanner goes by. If there is a fault trap,
     it means the local TLB has seen the protection change and is
     consistent. numa_next_scan will not appear to change and a further
     flush should be unnecessary as the page was previously inaccessible

c) PTE was previous pte_numa, fault starts, clears pmd, but scanner
   resets it before the fault returns. In this case, a change in
   numa_next_scan will be observed and the fault will flush the TLB before
   returning. It does mean that that particular page gets flushed twice
   but TLB of the scanner and faulting processor will be consistent on
   return from fault. The faulting CPU will probably fault again due to
   the pte being marked numa.

It was the third situation I was concerned with -- a NUMA fault returning
with pmd_numa still set and the TLBs of different processors having different
views. Due to a potential migration copy, the data may be in the TLB but
now inconsistent with the scanner. What's less clear is how the CPU reacts
in this case or if it's even defined. The architectural manual is vague
on what happens if there is access to a PTE just after a protection change
but before a TLB flush. If it was a race against mprotect and the process
segfaulted, it would be considered a buggy application.

> At that point, numa_fault_commit will not be seeing an
> increment in numa_next_scan, and we are relying completely
> on the batched tlb flush by the change_prot_numa.
> 
> Is that scenario a problem, or is it ok?
> 

I think the TLB is always in an consistent state after the patch even
though additional faults are possible in the event of races.

> And, why? :)
> 

Because I found it impossible to segfault processes under any level of
scanning and numa hinting fault stress after it was applied

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page
  2013-12-03  8:51 ` [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page Mel Gorman
@ 2013-12-04 16:59   ` Alex Thorlton
  2013-12-05 13:35     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Thorlton @ 2013-12-04 16:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, Linux-MM, LKML

> -void copy_huge_page(struct page *dst, struct page *src)
> -{
> -	int i;
> -	struct hstate *h = page_hstate(src);
> -
> -	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {

With CONFIG_HUGETLB_PAGE=n, the kernel fails to build, throwing this
error:

mm/migrate.c: In function a??copy_huge_pagea??:
mm/migrate.c:473: error: implicit declaration of function a??page_hstatea??

I got it to build by sticking the following into hugetlb.h:

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4694afc..fd76912 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -403,6 +403,7 @@ struct hstate {};
 #define hstate_sizelog(s) NULL
 #define hstate_vma(v) NULL
 #define hstate_inode(i) NULL
+#define page_hstate(p) NULL
 #define huge_page_size(h) PAGE_SIZE
 #define huge_page_mask(h) PAGE_MASK
 #define vma_kernel_pagesize(v) PAGE_SIZE

I figure that the #define I stuck in isn't actually solving the real
problem, but it got things working again.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page
  2013-12-04 16:59   ` Alex Thorlton
@ 2013-12-05 13:35     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-05 13:35 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, Linux-MM, LKML

On Wed, Dec 04, 2013 at 10:59:18AM -0600, Alex Thorlton wrote:
> > -void copy_huge_page(struct page *dst, struct page *src)
> > -{
> > -	int i;
> > -	struct hstate *h = page_hstate(src);
> > -
> > -	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
> 
> With CONFIG_HUGETLB_PAGE=n, the kernel fails to build, throwing this
> error:
> 
> mm/migrate.c: In function ???copy_huge_page???:
> mm/migrate.c:473: error: implicit declaration of function ???page_hstate???
> 
> I got it to build by sticking the following into hugetlb.h:
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4694afc..fd76912 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -403,6 +403,7 @@ struct hstate {};
>  #define hstate_sizelog(s) NULL
>  #define hstate_vma(v) NULL
>  #define hstate_inode(i) NULL
> +#define page_hstate(p) NULL
>  #define huge_page_size(h) PAGE_SIZE
>  #define huge_page_mask(h) PAGE_MASK
>  #define vma_kernel_pagesize(v) PAGE_SIZE
> 
> I figure that the #define I stuck in isn't actually solving the real
> problem, but it got things working again.
> 

It's based on an upstream patch so I'll check if the problem is there as
well and backport accordingly. This patch to unblock yourself is fine
for now.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-04 16:07         ` Mel Gorman
@ 2013-12-05 15:40           ` Rik van Riel
  2013-12-05 19:54             ` Mel Gorman
  2013-12-06 19:13           ` [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range Rik van Riel
  1 sibling, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-05 15:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML, hhuang

On Wed, 4 Dec 2013 16:07:41 +0000
Mel Gorman <mgorman@suse.de> wrote:

> Because I found it impossible to segfault processes under any level of
> scanning and numa hinting fault stress after it was applied
 
I think I still managed to trigger the bug, by setting numa page
scanning to ludicrous speed, and running two large specjbb2005
processes on a 4 node system in an infinite loop :)

I believe the reason is your patch flushes the TLB too late,
after the page contents have been migrated over to the new
page.

The changelog below should explain how the race works, and
how this patch supposedly fixes it. If it doesn't, let me
know and I'll go back to the drawing board :)

---8<---

Subject: mm,numa: fix memory corrupter race between THP NUMA unmap and migrate

There is a subtle race between THP NUMA migration, and the NUMA
unmapping code.

The NUMA unmapping code does a permission change on pages, which
is done with a batched (deferred) TLB flush. This is normally safe,
because the pages stay in the same place, and having other CPUs
continue to access them until the TLB flush is indistinguishable
from having other CPUs do those same accesses before the PTE
permission change.

The THP NUMA migration code normally does not do a remote TLB flush,
because the PTE is marked inaccessible, meaning no other CPUs should
have cached TLB entries that allow them to access the memory.

However, the following race is possible:

CPU A			CPU B			CPU C

						load TLB entry
make entry PMD_NUMA
			fault on entry
						write to page
			start migrating page
						write to page
			change PMD to new page
flush TLB
						reload TLB from new entry
						lose data

The obvious fix is to flush remote TLB entries from the numa
migrate code on CPU B, while CPU A is making PTE changes, and
has the TLB flush batched up for later.

The migration for 4kB pages is currently fine, because it calls
mk_ptenonnuma before migrating the page, which causes the migration
code to always do a remote TLB flush.  We should probably optimize
that at some point...

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |  3 +++
 kernel/sched/core.c      |  1 +
 kernel/sched/fair.c      |  4 ++++
 mm/huge_memory.c         | 10 ++++++++++
 4 files changed, 18 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 261ff4a..fa67ddb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,6 +427,9 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
+
+	/* task_numa_work is unmapping pages, with deferred TLB flush */
+	bool numa_tlb_lazy;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f14335..fe80455 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1732,6 +1732,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 		p->mm->numa_scan_seq = 0;
+		p->mm->numa_tlb_lazy = false;
 	}
 
 	if (clone_flags & CLONE_VM)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ec4afb..c9440f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1722,7 +1722,11 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
+			wmb(); /* with do_huge_pmd_numa_page */
+			mm->numa_tlb_lazy = true;
 			nr_pte_updates += change_prot_numa(vma, start, end);
+			wmb(); /* with do_huge_pmd_numa_page */
+			mm->numa_tlb_lazy = false;
 
 			/*
 			 * Scan sysctl_numa_balancing_scan_size but ensure that
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d68066f..3a03370 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1385,6 +1385,16 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
+	 * Another CPU is currently turning ptes of this process into
+	 * NUMA ptes. That permission change batches the TLB flush,
+	 * so other CPUs may still have valid TLB entries pointing to
+	 * the current page. Make sure those are flushed before we
+	 * migrate to a new page.
+	 */
+	rmb(); /* with task_numa_work */
+	if (mm->numa_tlb_lazy)
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-05 15:40           ` Rik van Riel
@ 2013-12-05 19:54             ` Mel Gorman
  2013-12-05 20:05               ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-05 19:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alex Thorlton, Linux-MM, LKML, hhuang

On Thu, Dec 05, 2013 at 10:40:15AM -0500, Rik van Riel wrote:
> On Wed, 4 Dec 2013 16:07:41 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Because I found it impossible to segfault processes under any level of
> > scanning and numa hinting fault stress after it was applied
>  
> I think I still managed to trigger the bug, by setting numa page
> scanning to ludicrous speed, and running two large specjbb2005
> processes on a 4 node system in an infinite loop :)
> 
> I believe the reason is your patch flushes the TLB too late,
> after the page contents have been migrated over to the new
> page.
> 
> The changelog below should explain how the race works, and
> how this patch supposedly fixes it. If it doesn't, let me
> know and I'll go back to the drawing board :)
> 

I think that's a better fit and a neater fix. Thanks! I think it barriers
more than it needs to (definite cost vs maybe cost), the flush can be
deferred until we are definitely trying to migrate and the pte case is
not guaranteed to be flushed before migration due to pte_mknonnuma causing
a flush in ptep_clear_flush to be avoided later. Mashing the two patches
together yields this.

---8<---
mm,numa: fix memory corrupter race between THP NUMA unmap and migrate

There is a subtle race between THP NUMA migration, and the NUMA
unmapping code.

The NUMA unmapping code does a permission change on pages, which
is done with a batched (deferred) TLB flush. This is normally safe,
because the pages stay in the same place, and having other CPUs
continue to access them until the TLB flush is indistinguishable
from having other CPUs do those same accesses before the PTE
permission change.

The THP NUMA migration code normally does not do a remote TLB flush,
because the PTE is marked inaccessible, meaning no other CPUs should
have cached TLB entries that allow them to access the memory.

However, the following race is possible:

CPU A			CPU B			CPU C

						load TLB entry
make entry PMD_NUMA
			fault on entry
						write to page
			start migrating page
						write to page
			change PMD to new page
flush TLB
						reload TLB from new entry
						lose data

The obvious fix is to flush remote TLB entries from the numa
migrate code on CPU B, while CPU A is making PTE changes, and
has the TLB flush batched up for later.

The migration for 4kB pages is currently fine, because it calls
mk_ptenonnuma before migrating the page, which causes the migration
code to always do a remote TLB flush.  We should probably optimize
that at some point...

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h  |  6 ++++--
 include/linux/mm_types.h |  3 +++
 kernel/sched/core.c      |  1 +
 kernel/sched/fair.c      |  6 ++++++
 mm/memory.c              |  2 +-
 mm/migrate.c             | 30 +++++++++++++++++++++++++-----
 6 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 804651c..5c60606 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,7 +92,8 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct vm_area_struct *vma, struct page *page,
+				  unsigned long addr, int node);
 extern bool migrate_ratelimited(int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
@@ -102,7 +103,8 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
 static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
 {
 }
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct vm_area_struct *vma,
+			struct page *page, unsigned long addr, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..5e5fa017 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -429,6 +429,9 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 
+	/* task_numa_work is unmapping pages, with deferred TLB flush */
+	bool numa_tlb_lazy;
+
 	/*
 	 * The first node a task was scheduled on. If a task runs on
 	 * a different node than Make PTE Scan Go Now.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..f436736 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1622,6 +1622,7 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_next_scan = jiffies;
 		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
+		p->mm->numa_tlb_lazy = false;
 	}
 
 	p->node_stamp = 0ULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40d8ea3..57d44a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -966,6 +966,9 @@ void task_numa_work(struct callback_head *work)
 		start = 0;
 		vma = mm->mmap;
 	}
+
+	wmb(); /* with do_huge_pmd_numa_page */
+	mm->numa_tlb_lazy = true;
 	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
@@ -994,6 +997,9 @@ void task_numa_work(struct callback_head *work)
 	}
 
 out:
+	wmb(); /* with do_huge_pmd_numa_page */
+	mm->numa_tlb_lazy = false;
+
 	/*
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
diff --git a/mm/memory.c b/mm/memory.c
index f453384..c077c9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3576,7 +3576,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(vma, page, addr, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 5dfd552..344c084 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1607,9 +1607,11 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
 	return rate_limited;
 }
 
-int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+int numamigrate_isolate_page(pg_data_t *pgdat, struct vm_area_struct *vma,
+				struct page *page, unsigned long addr)
 {
 	int page_lru;
+	unsigned long nr_pages;
 
 	VM_BUG_ON(compound_order(page) && !PageTransHuge(page));
 
@@ -1633,8 +1635,25 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	}
 
 	page_lru = page_is_file_cache(page);
+	nr_pages = hpage_nr_pages(page);
 	mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
-				hpage_nr_pages(page));
+				nr_pages);
+
+	/*
+	 * At the time this is called, another CPU is potentially turning ptes
+	 * of this process into NUMA ptes. That permission change batches the
+	 * TLB flush, so other CPUs may still have valid TLB entries pointing
+	 * to the address. These need to be flushed before migration.
+	 */
+	rmb();
+	if (vma->vm_mm->numa_tlb_lazy) {
+		if (nr_pages == 1) {
+			flush_tlb_page(vma, addr);
+		} else {
+			flush_tlb_range(vma, addr, addr +
+					(nr_pages << PAGE_SHIFT));
+		}
+	}
 
 	/*
 	 * Isolating the page has taken another reference, so the
@@ -1667,7 +1686,8 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct vm_area_struct *vma, struct page *page,
+			   unsigned long addr, int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1689,7 +1709,7 @@ int migrate_misplaced_page(struct page *page, int node)
 	if (numamigrate_update_ratelimit(pgdat, 1))
 		goto out;
 
-	isolated = numamigrate_isolate_page(pgdat, page);
+	isolated = numamigrate_isolate_page(pgdat, vma, page, addr);
 	if (!isolated)
 		goto out;
 
@@ -1752,7 +1772,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	page_nid_xchg_last(new_page, page_nid_last(page));
 
-	isolated = numamigrate_isolate_page(pgdat, page);
+	isolated = numamigrate_isolate_page(pgdat, vma, page, mmun_start);
 	if (!isolated) {
 		put_page(new_page);
 		goto out_fail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-05 19:54             ` Mel Gorman
@ 2013-12-05 20:05               ` Rik van Riel
  2013-12-06  9:24                 ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-05 20:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML, hhuang

On 12/05/2013 02:54 PM, Mel Gorman wrote:

> I think that's a better fit and a neater fix. Thanks! I think it barriers
> more than it needs to (definite cost vs maybe cost), the flush can be
> deferred until we are definitely trying to migrate and the pte case is
> not guaranteed to be flushed before migration due to pte_mknonnuma causing
> a flush in ptep_clear_flush to be avoided later. Mashing the two patches
> together yields this.

I think this would fix the numa migrate case.

However, I believe the same issue is also present in
mprotect(..., PROT_NONE) vs. compaction, for programs
that trap SIGSEGV for garbage collection purposes.

They could lose modifications done in-between when
the pte was set to PROT_NONE, and the actual TLB
flush, if compaction moves the page around in-between
those two events.

I don't know if this is a case we need to worry about
at all, but I think the same fix would apply to that
code path, so I guess we might as well make it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-05 20:05               ` Rik van Riel
@ 2013-12-06  9:24                 ` Mel Gorman
  2013-12-06 17:38                   ` Alex Thorlton
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-06  9:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alex Thorlton, Linux-MM, LKML, hhuang

On Thu, Dec 05, 2013 at 03:05:19PM -0500, Rik van Riel wrote:
> On 12/05/2013 02:54 PM, Mel Gorman wrote:
> 
> >I think that's a better fit and a neater fix. Thanks! I think it barriers
> >more than it needs to (definite cost vs maybe cost), the flush can be
> >deferred until we are definitely trying to migrate and the pte case is
> >not guaranteed to be flushed before migration due to pte_mknonnuma causing
> >a flush in ptep_clear_flush to be avoided later. Mashing the two patches
> >together yields this.
> 
> I think this would fix the numa migrate case.
> 

Good. So far I have not been seeing any problems with it at least.

> However, I believe the same issue is also present in
> mprotect(..., PROT_NONE) vs. compaction, for programs
> that trap SIGSEGV for garbage collection purposes.
> 

I'm not 100% convinced we need to be concerned with races with
mprotect(PROT_NONE) and a parallel reference to that area from userspace. I
would consider it to be a buggy application if two threads were not
co-ordinating the protection of a region and referencing it.  I would also
expect garbage collectors to be managing smart pointers and using reference
counting to copy between heap generations (or similar mechanisms) instead
of trapping sigsegv.

Intel's architectural manual 3A covers what happens for delayed TLB
invalidations in section 4.10.4.4 (in the version I'm looking at at
least). The following two snippets are the most important

	Software developers should understand that, between the modification
	of a paging-structure entry and execution of the invalidation
	instruction recommended in Section 4.10.4.2, the processor may
	use translations based on either the old value or the new value
	of the paging- structure entry. The following items describe some
	of the potential consequences of delayed invalidation:

	o If a paging-structure entry is modified to change from 1 to 0 the P
	flag from 1 to 0, an access to a linear address whose translation is
	controlled by this entry may or may not cause a page-fault exception.

	o If a paging-structure entry is modified to change the R/W flag
	from 0 to 1, write accesses to linear addresses whose translation is
	controlled by this entry may or may not cause a page-fault exception.

After the PROT_NONE may happen until after the deferred TLB flush. In a
race with mprotect(PROT_NONE) it'll either complete the access or receive
SIGSEGV signal due to failed protections but this is pretty much
expected and unpredictable.

I do not think the present bit gets cleared on mprotect(PROT_NONE) due
to the relevant bits been

#define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
                         _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY)
#define PAGE_NONE   __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)

If the present bit remains then compaction should flush the TLB on the
call to ptep_clear_flush as pte_accessible check is based on the present
bit. So even though it is possible for a write to complete during a call
to mprotect(PROT_NONE), the same is not true for compaction.

> They could lose modifications done in-between when
> the pte was set to PROT_NONE, and the actual TLB
> flush, if compaction moves the page around in-between
> those two events.
> 
> I don't know if this is a case we need to worry about
> at all, but I think the same fix would apply to that
> code path, so I guess we might as well make it...

I might be going "la la la la we're fine" and deluding myself but we
appear to be covered here and it would be a shame to add expense to a
path unnecessarily.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-06  9:24                 ` Mel Gorman
@ 2013-12-06 17:38                   ` Alex Thorlton
  2013-12-06 18:32                     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Thorlton @ 2013-12-06 17:38 UTC (permalink / raw)
  To: Mel Gorman, t; +Cc: Rik van Riel, Linux-MM, LKML, hhuang

On Fri, Dec 06, 2013 at 09:24:00AM +0000, Mel Gorman wrote:
> Good. So far I have not been seeing any problems with it at least.

I went through and tested all the different iterations of this patchset
last night, and have hit a few problems, but I *think* this has solved
the segfault problem.  I'm now hitting some rcu_sched stalls when
running my tests.

Initially things were getting hung up on a lock in change_huge_pmd, so
I applied Kirill's patches to split up the PTL, which did manage to ease
the contention on that lock, but, now it appears that I'm hitting stalls
somewhere else.

I'll play around with this a bit tonight/tomorrow and see if I can track
down exactly where things are getting stuck.  Unfortunately, on these
large systems, when we hit a stall, the system often completely locks up
before the NMI backtrace can complete on all cpus, so, as of right now,
I've not been able to get a backtrace for the cpu that's initially
causing the stall.  I'm going to see if I can slim down the code for the
stall detection to just give the backtrace for the cpu that's initially
stalling out.  In the meantime, let me know if you guys have any ideas
that could keep things moving.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
  2013-12-06 17:38                   ` Alex Thorlton
@ 2013-12-06 18:32                     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2013-12-06 18:32 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: t, Rik van Riel, Linux-MM, LKML, hhuang

On Fri, Dec 06, 2013 at 11:38:43AM -0600, Alex Thorlton wrote:
> On Fri, Dec 06, 2013 at 09:24:00AM +0000, Mel Gorman wrote:
> > Good. So far I have not been seeing any problems with it at least.
> 
> I went through and tested all the different iterations of this patchset
> last night, and have hit a few problems, but I *think* this has solved
> the segfault problem.  I'm now hitting some rcu_sched stalls when
> running my tests.
> 

Well that's news for the start of the weekend.

> Initially things were getting hung up on a lock in change_huge_pmd, so
> I applied Kirill's patches to split up the PTL, which did manage to ease
> the contention on that lock, but, now it appears that I'm hitting stalls
> somewhere else.
> 

To check this, the next version of the series will be based on 3.13-rc2 which
will include Kirill's patches. If the segfault is cleared up then at least
that much will be in flight and in the process of being backported to 3.12.
NUMA balancing in 3.12 is quite work intensive but the patches 3.13-rc2
should substantially reduce that overhead. It'd be best to check them
all in combination and seeing what falls out.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-04 16:07         ` Mel Gorman
  2013-12-05 15:40           ` Rik van Riel
@ 2013-12-06 19:13           ` Rik van Riel
  2013-12-06 20:32             ` Christoph Lameter
  1 sibling, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-06 19:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Linux-MM, LKML

On Wed, 4 Dec 2013 16:07:41 +0000
Mel Gorman <mgorman@suse.de> wrote:

> Because I found it impossible to segfault processes under any level of
> scanning and numa hinting fault stress after it was applied

As discussed on #mm, here is the new patch (just compile tested so far).

---8<---

Subject: mm: fix TLB flush race between migration, and change_protection_range

There are a few subtle races, between change_protection_range
(used by mprotect and change_prot_numa) on one side, and NUMA
page migration and compaction on the other side.

The basic race is that there is a time window between when the
PTE gets made non-present (PROT_NONE or NUMA), and the TLB is
flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA
migration code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the
cached translation, to what is no longer the current memory
location of the process.

This only affects x86, which has a somewhat optimistic
pte_accessible. All other architectures appear to be safe,
and will either always flush, or flush whenever there is
a valid mapping, even with no permissions (SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure
that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA
memory may still be accessible if there is a TLB flush pending for
the mm.

This should fix both NUMA migration and compaction.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/sparc/include/asm/pgtable_64.h |  4 ++--
 arch/x86/include/asm/pgtable.h      | 11 +++++++--
 include/asm-generic/pgtable.h       |  2 +-
 include/linux/mm_types.h            | 45 ++++++++++++++++++++++++++++++++++++-
 kernel/fork.c                       |  1 +
 mm/huge_memory.c                    |  7 ++++++
 mm/mprotect.c                       |  2 ++
 mm/pgtable-generic.c                |  5 +++--
 8 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index d22b92d..ecc7fa3 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -616,7 +616,7 @@ static inline unsigned long pte_present(pte_t pte)
 }
 
 #define pte_accessible pte_accessible
-static inline unsigned long pte_accessible(pte_t a)
+static inline unsigned long pte_accessible(struct mm_struct * mm, pte_t a)
 {
 	return pte_val(a) & _PAGE_VALID;
 }
@@ -806,7 +806,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
-	if (likely(mm != &init_mm) && pte_accessible(orig))
+	if (likely(mm != &init_mm) && pte_accessible(mm, orig))
 		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7f7fe69..a369b0a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -415,9 +415,16 @@ static inline int pte_present(pte_t a)
 }
 
 #define pte_accessible pte_accessible
-static inline int pte_accessible(pte_t a)
+static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
-	return pte_flags(a) & _PAGE_PRESENT;
+	if (pte_flags(a) & _PAGE_PRESENT)
+		return true;
+
+	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+			tlb_flush_pending(mm))
+		return true;
+
+	return false;
 }
 
 static inline int pte_hidden(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 18e27c2..71db9f1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -221,7 +221,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif
 
 #ifndef pte_accessible
-# define pte_accessible(pte)		((void)(pte),1)
+# define pte_accessible(mm, pte)	((void)(pte),1)
 #endif
 
 #ifndef flush_tlb_fix_spurious_fault
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 261ff4a..d451360 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -351,7 +351,6 @@ struct mm_struct {
 						 * by mmlist_lock
 						 */
 
-
 	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
 	unsigned long hiwater_vm;	/* High-water virtual memory usage */
 
@@ -428,6 +427,14 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+	/*
+	 * An operation with batched TLB flushing is going on. Anything that
+	 * can move process memory needs to flush the TLB when moving a
+	 * PROT_NONE or PROT_NUMA mapped page.
+	 */
+	bool tlb_flush_pending;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
@@ -444,4 +451,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+/*
+ * Memory barriers to keep this state in sync are graciously provided by
+ * the page table locks, outside of which no page table modifications happen.
+ * The barriers below prevent the compiler from re-ordering the instructions
+ * around the memory barriers that are already present in the code.
+ */
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	return mm->tlb_flush_pending;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+	mm->tlb_flush_pending = true;
+	barrier();
+}
+/* Clearing is done after a TLB flush, which also provides a barrier. */
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	mm->tlb_flush_pending = false;
+}
+#else
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	return false;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+#endif
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c10ecfe..c975693 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -544,6 +544,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	clear_tlb_flush_pending(mm);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d68066f..12b72ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1385,6 +1385,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
+	 * The page_table_lock above provides a memory barrier
+	 * with change_protection_range.
+	 */
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 781b7f3..ef0ebb3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -180,6 +180,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	set_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -191,6 +192,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
+	clear_tlb_flush_pending(mm);
 
 	return pages;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0e083c5..683f476 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -86,9 +86,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pte_t *ptep)
 {
+	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	if (pte_accessible(pte))
+	pte = ptep_get_and_clear(mm, address, ptep);
+	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	return pte;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-06 19:13           ` [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range Rik van Riel
@ 2013-12-06 20:32             ` Christoph Lameter
  2013-12-06 21:21               ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2013-12-06 20:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On Fri, 6 Dec 2013, Rik van Riel wrote:
>
> The basic race looks like this:
>
> CPU A			CPU B			CPU C
>
> 						load TLB entry
> make entry PTE/PMD_NUMA
> 			fault on entry
> 						read/write old page
> 			start migrating page

When you start migrating a page a special page migration entry is
created that will trap all accesses to the page. You can safely flush when
the migration entry is there. Only allow a new PTE/PMD to be put there
*after* the tlb flush.


> 			change PTE/PMD to new page

Dont do that. We have migration entries for a reason.

> 						read/write old page [*]

Should cause a page fault which should put the process to sleep. Process
will safely read the page after the migration entry is removed.

> flush TLB

Establish the new PTE/PMD after the flush removing the migration pte
entry and thereby avoiding the race.

> 						reload TLB from new entry
> 						read/write new page
> 						lose data
>
> [*] the old page may belong to a new user at this point!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-06 20:32             ` Christoph Lameter
@ 2013-12-06 21:21               ` Rik van Riel
  2013-12-07  0:25                 ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-06 21:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On 12/06/2013 03:32 PM, Christoph Lameter wrote:
> On Fri, 6 Dec 2013, Rik van Riel wrote:
>>
>> The basic race looks like this:
>>
>> CPU A			CPU B			CPU C
>>
>> 						load TLB entry
>> make entry PTE/PMD_NUMA
>> 			fault on entry
>> 						read/write old page
>> 			start migrating page
> 
> When you start migrating a page a special page migration entry is
> created that will trap all accesses to the page. You can safely flush when
> the migration entry is there. Only allow a new PTE/PMD to be put there
> *after* the tlb flush.

A PROT_NONE or NUMA pte is just as effective as a migration pte.
The only problem is, the TLB flush was not always done...

> 
>> 			change PTE/PMD to new page
> 
> Dont do that. We have migration entries for a reason.

We do not have migration entries for hugepages, do we?

>> 						read/write old page [*]
> 
> Should cause a page fault which should put the process to sleep. Process
> will safely read the page after the migration entry is removed.
> 
>> flush TLB
> 
> Establish the new PTE/PMD after the flush removing the migration pte
> entry and thereby avoiding the race.

That is what this patch does.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-06 21:21               ` Rik van Riel
@ 2013-12-07  0:25                 ` Christoph Lameter
  2013-12-07  3:14                   ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2013-12-07  0:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On Fri, 6 Dec 2013, Rik van Riel wrote:

> > When you start migrating a page a special page migration entry is
> > created that will trap all accesses to the page. You can safely flush when
> > the migration entry is there. Only allow a new PTE/PMD to be put there
> > *after* the tlb flush.
>
> A PROT_NONE or NUMA pte is just as effective as a migration pte.
> The only problem is, the TLB flush was not always done...

Ok then what are you trying to fix?

> > Dont do that. We have migration entries for a reason.
>
> We do not have migration entries for hugepages, do we?

Dunno.

> >
> > Should cause a page fault which should put the process to sleep. Process
> > will safely read the page after the migration entry is removed.
> >
> >> flush TLB
> >
> > Establish the new PTE/PMD after the flush removing the migration pte
> > entry and thereby avoiding the race.
>
> That is what this patch does.

If that is the case then this patch would not be needed and the tracking
of state in the mm_struct would not be necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-07  0:25                 ` Christoph Lameter
@ 2013-12-07  3:14                   ` Rik van Riel
  2013-12-09 16:00                     ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2013-12-07  3:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On 12/06/2013 07:25 PM, Christoph Lameter wrote:
> On Fri, 6 Dec 2013, Rik van Riel wrote:
> 
>>> When you start migrating a page a special page migration entry is
>>> created that will trap all accesses to the page. You can safely flush when
>>> the migration entry is there. Only allow a new PTE/PMD to be put there
>>> *after* the tlb flush.
>>
>> A PROT_NONE or NUMA pte is just as effective as a migration pte.
>> The only problem is, the TLB flush was not always done...
> 
> Ok then what are you trying to fix?

It would help if you had actually read the patch.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-07  3:14                   ` Rik van Riel
@ 2013-12-09 16:00                     ` Christoph Lameter
  2013-12-09 16:27                       ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2013-12-09 16:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On Fri, 6 Dec 2013, Rik van Riel wrote:

> > Ok then what are you trying to fix?
>
> It would help if you had actually read the patch.

I read the patch. Please update the documentation to accurately describe
the race.

>From what I can see this race affects only huge pages and the basic issue
seems to be that huge pages do not use migration entries but directly
replace the pmd (migrate_misplaced_transhuge_page() f.e.).

That is not safe and there may be multiple other races as we add more
general functionality to huge pages. An intermediate stage is needed
that allows the clearing out of remote tlb entries before the new tlb
entry becomes visible.

Then you wont need this code anymore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-09 16:00                     ` Christoph Lameter
@ 2013-12-09 16:27                       ` Mel Gorman
  2013-12-09 16:59                         ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2013-12-09 16:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Rik van Riel, Alex Thorlton, Linux-MM, LKML

On Mon, Dec 09, 2013 at 04:00:24PM +0000, Christoph Lameter wrote:
> On Fri, 6 Dec 2013, Rik van Riel wrote:
> 
> > > Ok then what are you trying to fix?
> >
> > It would help if you had actually read the patch.
> 
> I read the patch. Please update the documentation to accurately describe
> the race.
> 
> From what I can see this race affects only huge pages and the basic issue
> seems to be that huge pages do not use migration entries but directly
> replace the pmd (migrate_misplaced_transhuge_page() f.e.).
> 

I looked at what would be required to implement migration entry support for
PMDs. It's major surgery because we do not have something like swap-like
entries to use at that page table level. It looked like it would require
inserting a fake entry (easiest would be to point to a global page) that
all page table walkers would recognise, blocking on it and teaching every
page table walker to get it right.

One can't do something simple like clear the entry out because then the
no page handlers for GUP or faults insert the zero page behind and it goes
to hell and we can't hold the page table lock across the migration copy.

> That is not safe and there may be multiple other races as we add more
> general functionality to huge pages. An intermediate stage is needed
> that allows the clearing out of remote tlb entries before the new tlb
> entry becomes visible.
> 

The patch flushes the TLBs as it is and future accesses are help up in the
NUMA hinting fault handler. It's functionally similar to having a migration
entry albeit it is special cased to handle just automatic NUMA balancing

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-09 16:27                       ` Mel Gorman
@ 2013-12-09 16:59                         ` Christoph Lameter
  2013-12-09 21:01                           ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2013-12-09 16:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, Alex Thorlton, Linux-MM, LKML

On Mon, 9 Dec 2013, Mel Gorman wrote:

> I looked at what would be required to implement migration entry support for
> PMDs. It's major surgery because we do not have something like swap-like
> entries to use at that page table level. It looked like it would require
> inserting a fake entry (easiest would be to point to a global page) that
> all page table walkers would recognise, blocking on it and teaching every
> page table walker to get it right.

Well something needs to cause a fault and stop accesses to the page.

> One can't do something simple like clear the entry out because then the
> no page handlers for GUP or faults insert the zero page behind and it goes
> to hell and we can't hold the page table lock across the migration copy.

Right you need to have special migration entry there. Same as for regular
sized pages.


> The patch flushes the TLBs as it is and future accesses are help up in the
> NUMA hinting fault handler. It's functionally similar to having a migration
> entry albeit it is special cased to handle just automatic NUMA balancing

Hmmm... Hopefully that will work. I'd rather see a clean extension of what
we use for regular pages. If we add functionality to huge pages to operate
more like regular ones then this could be an issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-09 16:59                         ` Christoph Lameter
@ 2013-12-09 21:01                           ` Rik van Riel
  0 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2013-12-09 21:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 11:59 AM, Christoph Lameter wrote:
> On Mon, 9 Dec 2013, Mel Gorman wrote:
> 
>> I looked at what would be required to implement migration entry support for
>> PMDs. It's major surgery because we do not have something like swap-like
>> entries to use at that page table level. It looked like it would require
>> inserting a fake entry (easiest would be to point to a global page) that
>> all page table walkers would recognise, blocking on it and teaching every
>> page table walker to get it right.
> 
> Well something needs to cause a fault and stop accesses to the page.

The NUMA patches introduce such a state: the pmd_numa state.

The "issue" is that the NUMA code can race with itself, and with
CMA.

The code that markes PMDs as NUMA ones will change a bunch of
PMDs at once, and will then flush the TLB. Until that flush,
CPUs that have the old translation cached in their TLBs may
continue accessing the page.

Meanwhile, the code that does the migration may start running on
a CPU that does not have an old entry in the TLB, and it may
start the page migration.

The fundamental issue is that moving the PMD state from valid
to the intermediate state consists of multiple operations, and
there will always be some time window between them.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-12-09 21:01 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-03  8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
2013-12-03  8:51 ` [PATCH 01/15] mm: numa: Do not batch handle PMD pages Mel Gorman
2013-12-03  8:51 ` [PATCH 02/15] mm: hugetlbfs: fix hugetlbfs optimization Mel Gorman
2013-12-03  8:51 ` [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page Mel Gorman
2013-12-04 16:59   ` Alex Thorlton
2013-12-05 13:35     ` Mel Gorman
2013-12-03  8:51 ` [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
2013-12-03 23:07   ` Rik van Riel
2013-12-03 23:54     ` Mel Gorman
2013-12-03  8:51 ` [PATCH 05/15] mm: numa: Call MMU notifiers on " Mel Gorman
2013-12-03  8:51 ` [PATCH 06/15] mm: Clear pmd_numa before invalidating Mel Gorman
2013-12-03  8:51 ` [PATCH 07/15] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
2013-12-03  8:51 ` [PATCH 08/15] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
2013-12-03  8:51 ` [PATCH 09/15] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
2013-12-03  8:51 ` [PATCH 10/15] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
2013-12-03  8:51 ` [PATCH 11/15] sched: numa: Skip inaccessible VMAs Mel Gorman
2013-12-03  8:51 ` [PATCH 12/15] Clear numa on mprotect Mel Gorman
2013-12-03  8:52 ` [PATCH 13/15] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
2013-12-03  8:52 ` [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update Mel Gorman
2013-12-03 23:07   ` Rik van Riel
2013-12-03 23:46     ` Mel Gorman
2013-12-04 14:33       ` Rik van Riel
2013-12-04 16:07         ` Mel Gorman
2013-12-05 15:40           ` Rik van Riel
2013-12-05 19:54             ` Mel Gorman
2013-12-05 20:05               ` Rik van Riel
2013-12-06  9:24                 ` Mel Gorman
2013-12-06 17:38                   ` Alex Thorlton
2013-12-06 18:32                     ` Mel Gorman
2013-12-06 19:13           ` [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range Rik van Riel
2013-12-06 20:32             ` Christoph Lameter
2013-12-06 21:21               ` Rik van Riel
2013-12-07  0:25                 ` Christoph Lameter
2013-12-07  3:14                   ` Rik van Riel
2013-12-09 16:00                     ` Christoph Lameter
2013-12-09 16:27                       ` Mel Gorman
2013-12-09 16:59                         ` Christoph Lameter
2013-12-09 21:01                           ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).