All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] NUMA balancing segmentation fault fixes and misc followups v3
@ 2013-12-09  7:08 ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Alex Thorlton reported segementation faults when NUMA balancing is enabled
on large machines. There is no obvious explanation from the console what the
problem but similar problems have been observed by Rik van Riel and myself
if migration was aggressive enough. Alex, this series is against 3.13-rc2,
a verification that the fix addresses your problem would be appreciated.

This series starts with a range of patches aimed at addressing the
segmentation fault problem while offsetting some of the cost to avoid badly
regressing performance in -stable. Those that are cc'd to stable (patches
1-12) should be merged ASAP. The rest of the series is relatively minor
stuff that fell out during the course of development that is ok to wait
for the next merge window but should help with the continued development
of NUMA balancing.

 arch/sparc/include/asm/pgtable_64.h |   4 +-
 arch/x86/include/asm/pgtable.h      |  11 +++-
 arch/x86/mm/gup.c                   |  13 +++++
 include/asm-generic/pgtable.h       |   2 +-
 include/linux/migrate.h             |   9 ++++
 include/linux/mm_types.h            |  44 +++++++++++++++
 include/linux/mmzone.h              |   5 +-
 include/trace/events/migrate.h      |  26 +++++++++
 include/trace/events/sched.h        |  93 ++++++++++++++++++++++++++++++++
 kernel/fork.c                       |   1 +
 kernel/sched/core.c                 |   2 +
 kernel/sched/fair.c                 |  15 +++++-
 mm/huge_memory.c                    |  45 ++++++++++++----
 mm/migrate.c                        | 103 ++++++++++++++++++++++++++++--------
 mm/mprotect.c                       |  15 ++++--
 mm/pgtable-generic.c                |   8 ++-
 16 files changed, 348 insertions(+), 48 deletions(-)

-- 
1.8.4


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 00/18] NUMA balancing segmentation fault fixes and misc followups v3
@ 2013-12-09  7:08 ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Alex Thorlton reported segementation faults when NUMA balancing is enabled
on large machines. There is no obvious explanation from the console what the
problem but similar problems have been observed by Rik van Riel and myself
if migration was aggressive enough. Alex, this series is against 3.13-rc2,
a verification that the fix addresses your problem would be appreciated.

This series starts with a range of patches aimed at addressing the
segmentation fault problem while offsetting some of the cost to avoid badly
regressing performance in -stable. Those that are cc'd to stable (patches
1-12) should be merged ASAP. The rest of the series is relatively minor
stuff that fell out during the course of development that is ok to wait
for the next merge window but should help with the continued development
of NUMA balancing.

 arch/sparc/include/asm/pgtable_64.h |   4 +-
 arch/x86/include/asm/pgtable.h      |  11 +++-
 arch/x86/mm/gup.c                   |  13 +++++
 include/asm-generic/pgtable.h       |   2 +-
 include/linux/migrate.h             |   9 ++++
 include/linux/mm_types.h            |  44 +++++++++++++++
 include/linux/mmzone.h              |   5 +-
 include/trace/events/migrate.h      |  26 +++++++++
 include/trace/events/sched.h        |  93 ++++++++++++++++++++++++++++++++
 kernel/fork.c                       |   1 +
 kernel/sched/core.c                 |   2 +
 kernel/sched/fair.c                 |  15 +++++-
 mm/huge_memory.c                    |  45 ++++++++++++----
 mm/migrate.c                        | 103 ++++++++++++++++++++++++++++--------
 mm/mprotect.c                       |  15 ++++--
 mm/pgtable-generic.c                |   8 ++-
 16 files changed, 348 insertions(+), 48 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:08   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against THP
migration and properly account for the NUMA hinting fault. On the migration
side the page table lock is taken for each PTE update.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c | 13 +++++++++++++
 mm/huge_memory.c  | 24 ++++++++++++++++--------
 mm/migrate.c      | 38 +++++++++++++++++++++++++++++++-------
 3 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..0596e8e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		pte_t pte = gup_get_pte(ptep);
 		struct page *page;
 
+		/* Similar to the PMD case, NUMA hinting must take slow path */
+		if (pte_numa(pte)) {
+			pte_unmap(ptep);
+			return 0;
+		}
+
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
@@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bccd5a6..deae592 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1243,6 +1243,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1323,23 +1327,27 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
 		page_nid = -1;
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	get_page(page);
 	spin_unlock(ptl);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
diff --git a/mm/migrate.c b/mm/migrate.c
index bb94004..2cabbd5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1722,6 +1722,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	pmd_t orig_entry;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1756,7 +1757,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Recheck the target PMD */
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
+fail_putback:
 		spin_unlock(ptl);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1786,16 +1788,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 */
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
+	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
-	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
+
+	if (page_count(page) != 2) {
+		set_pmd_at(mm, haddr, pmd, orig_entry);
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		update_mmu_cache_pmd(vma, address, &entry);
+		page_remove_rmap(new_page);
+		goto fail_putback;
+	}
+
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1820,9 +1840,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	ptl = pmd_lock(mm, pmd);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(ptl);
 
 	unlock_page(page);
 	put_page(page);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration
@ 2013-12-09  7:08   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against THP
migration and properly account for the NUMA hinting fault. On the migration
side the page table lock is taken for each PTE update.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c | 13 +++++++++++++
 mm/huge_memory.c  | 24 ++++++++++++++++--------
 mm/migrate.c      | 38 +++++++++++++++++++++++++++++++-------
 3 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..0596e8e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		pte_t pte = gup_get_pte(ptep);
 		struct page *page;
 
+		/* Similar to the PMD case, NUMA hinting must take slow path */
+		if (pte_numa(pte)) {
+			pte_unmap(ptep);
+			return 0;
+		}
+
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
@@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bccd5a6..deae592 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1243,6 +1243,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1323,23 +1327,27 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
 		page_nid = -1;
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	get_page(page);
 	spin_unlock(ptl);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
diff --git a/mm/migrate.c b/mm/migrate.c
index bb94004..2cabbd5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1722,6 +1722,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	pmd_t orig_entry;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1756,7 +1757,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Recheck the target PMD */
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
+fail_putback:
 		spin_unlock(ptl);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1786,16 +1788,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 */
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
+	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
-	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
+
+	if (page_count(page) != 2) {
+		set_pmd_at(mm, haddr, pmd, orig_entry);
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		update_mmu_cache_pmd(vma, address, &entry);
+		page_remove_rmap(new_page);
+		goto fail_putback;
+	}
+
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1820,9 +1840,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	ptl = pmd_lock(mm, pmd);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(ptl);
 
 	unlock_page(page);
 	put_page(page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:08   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

MMU notifiers must be called on THP page migration or secondary MMUs will
get very confused.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2cabbd5..be787d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/balloon_compaction.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -1716,12 +1717,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 				struct page *page, int node)
 {
 	spinlock_t *ptl;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	unsigned long mmun_start = address & HPAGE_PMD_MASK;
+	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 	pmd_t orig_entry;
 
 	/*
@@ -1756,10 +1758,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1800,15 +1804,16 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-	page_add_new_anon_rmap(new_page, vma, haddr);
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	flush_cache_range(vma, mmun_start, mmun_end);
+	page_add_new_anon_rmap(new_page, vma, mmun_start);
+	pmdp_clear_flush(vma, mmun_start, pmd);
+	set_pmd_at(mm, mmun_start, pmd, entry);
+	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, haddr, pmd, orig_entry);
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		set_pmd_at(mm, mmun_start, pmd, orig_entry);
+		flush_tlb_range(vma, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1823,6 +1828,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	unlock_page(new_page);
 	unlock_page(page);
@@ -1843,7 +1849,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, haddr, pmd, entry);
+		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration
@ 2013-12-09  7:08   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

MMU notifiers must be called on THP page migration or secondary MMUs will
get very confused.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2cabbd5..be787d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/balloon_compaction.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -1716,12 +1717,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 				struct page *page, int node)
 {
 	spinlock_t *ptl;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	unsigned long mmun_start = address & HPAGE_PMD_MASK;
+	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 	pmd_t orig_entry;
 
 	/*
@@ -1756,10 +1758,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1800,15 +1804,16 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-	page_add_new_anon_rmap(new_page, vma, haddr);
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	flush_cache_range(vma, mmun_start, mmun_end);
+	page_add_new_anon_rmap(new_page, vma, mmun_start);
+	pmdp_clear_flush(vma, mmun_start, pmd);
+	set_pmd_at(mm, mmun_start, pmd, entry);
+	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, haddr, pmd, orig_entry);
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		set_pmd_at(mm, mmun_start, pmd, orig_entry);
+		flush_tlb_range(vma, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1823,6 +1828,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	unlock_page(new_page);
 	unlock_page(page);
@@ -1843,7 +1849,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, haddr, pmd, entry);
+		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/18] mm: Clear pmd_numa before invalidating
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:08   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

pmdp_invalidate clears the present bit without taking into account that it
might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
pmd_numa before invalidating.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/pgtable-generic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index cbb3854..e84cad2 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
+	pmd_t entry = *pmdp;
+	if (pmd_numa(entry))
+		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/18] mm: Clear pmd_numa before invalidating
@ 2013-12-09  7:08   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

pmdp_invalidate clears the present bit without taking into account that it
might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
pmd_numa before invalidating.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/pgtable-generic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index cbb3854..e84cad2 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
+	pmd_t entry = *pmdp;
+	if (pmd_numa(entry))
+		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:08   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If the PMD is flushed then a parallel fault in handle_mm_fault() will enter
the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt
to insert a huge zero page. This is wasteful so the patch avoids clearing
the PMD when setting pmd_numa.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index deae592..5a5da50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
-				entry = pmdp_get_and_clear(mm, addr, pmd);
+				entry = *pmd;
 				entry = pmd_mknuma(entry);
 				ret = HPAGE_PMD_NR;
 			}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan
@ 2013-12-09  7:08   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If the PMD is flushed then a parallel fault in handle_mm_fault() will enter
the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt
to insert a huge zero page. This is wasteful so the patch avoids clearing
the PMD when setting pmd_numa.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index deae592..5a5da50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
-				entry = pmdp_get_and_clear(mm, addr, pmd);
+				entry = *pmd;
 				entry = pmd_mknuma(entry);
 				ret = HPAGE_PMD_NR;
 			}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:08   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The TLB must be flushed if the PTE is updated but change_pte_range is clearing
the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
reinserts the same entry. Without the flush, it's conceivable that two processors
have different TLBs for the same virtual address and at the very least it would
generate spurious faults. This patch only unmaps the pages in change_pte_range for
a full protection change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2666797..0a07e2d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 			bool updated = false;
 
-			ptent = ptep_modify_prot_start(mm, addr, pte);
 			if (!prot_numa) {
+				ptent = ptep_modify_prot_start(mm, addr, pte);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
 				struct page *page;
 
+				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					if (!pte_numa(oldpte)) {
@@ -79,7 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-			ptep_modify_prot_commit(mm, addr, pte, ptent);
+
+			/* Only !prot_numa always clears the pte */
+			if (!prot_numa)
+				ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update
@ 2013-12-09  7:08   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The TLB must be flushed if the PTE is updated but change_pte_range is clearing
the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
reinserts the same entry. Without the flush, it's conceivable that two processors
have different TLBs for the same virtual address and at the very least it would
generate spurious faults. This patch only unmaps the pages in change_pte_range for
a full protection change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2666797..0a07e2d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 			bool updated = false;
 
-			ptent = ptep_modify_prot_start(mm, addr, pte);
 			if (!prot_numa) {
+				ptent = ptep_modify_prot_start(mm, addr, pte);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
 				struct page *page;
 
+				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					if (!pte_numa(oldpte)) {
@@ -79,7 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-			ptep_modify_prot_commit(mm, addr, pte, ptent);
+
+			/* Only !prot_numa always clears the pte */
+			if (!prot_numa)
+				ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The anon_vma lock prevents parallel THP splits and any associated complexity
that arises when handling splits during THP migration. This patch checks
if the lock was successfully acquired and bails from THP migration if it
failed for any reason.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a5da50..0f00b96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,6 +1359,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
+	/* Bail if we fail to protect against THP splits for any reason */
+	if (unlikely(!anon_vma)) {
+		put_page(page);
+		page_nid = -1;
+		goto clear_pmdnuma;
+	}
+
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The anon_vma lock prevents parallel THP splits and any associated complexity
that arises when handling splits during THP migration. This patch checks
if the lock was successfully acquired and bails from THP migration if it
failed for any reason.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a5da50..0f00b96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,6 +1359,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
+	/* Bail if we fail to protect against THP splits for any reason */
+	if (unlikely(!anon_vma)) {
+		put_page(page);
+		page_nid = -1;
+		goto clear_pmdnuma;
+	}
+
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be787d5..a987525 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1780,7 +1780,8 @@ fail_putback:
 		putback_lru_page(page);
 		mod_zone_page_state(page_zone(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-		goto out_fail;
+
+		goto out_unlock;
 	}
 
 	/*
@@ -1854,6 +1855,7 @@ out_dropref:
 	}
 	spin_unlock(ptl);
 
+out_unlock:
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be787d5..a987525 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1780,7 +1780,8 @@ fail_putback:
 		putback_lru_page(page);
 		mod_zone_page_state(page_zone(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-		goto out_fail;
+
+		goto out_unlock;
 	}
 
 	/*
@@ -1854,6 +1855,7 @@ out_dropref:
 	}
 	spin_unlock(ptl);
 
+out_unlock:
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/18] sched: numa: Skip inaccessible VMAs
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8b652e..1ce1615 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1752,6 +1752,13 @@ void task_numa_work(struct callback_head *work)
 		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
 			continue;
 
+		/*
+		 * Skip inaccessible VMAs to avoid any confusion between
+		 * PROT_NONE and NUMA hinting ptes
+		 */
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/18] sched: numa: Skip inaccessible VMAs
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8b652e..1ce1615 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1752,6 +1752,13 @@ void task_numa_work(struct callback_head *work)
 		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
 			continue;
 
+		/*
+		 * Skip inaccessible VMAs to avoid any confusion between
+		 * PROT_NONE and NUMA hinting ptes
+		 */
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

On a protection change it is no longer clear if the page should be still
accessible.  This patch clears the NUMA hinting fault bits on a protection
change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 2 ++
 mm/mprotect.c    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f00b96..0ecaba2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1522,6 +1522,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		ret = 1;
 		if (!prot_numa) {
 			entry = pmdp_get_and_clear(mm, addr, pmd);
+			if (pmd_numa(entry))
+				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0a07e2d..eb2f349 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (!prot_numa) {
 				ptent = ptep_modify_prot_start(mm, addr, pte);
+				if (pte_numa(ptent))
+					ptent = pte_mknonnuma(ptent);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

On a protection change it is no longer clear if the page should be still
accessible.  This patch clears the NUMA hinting fault bits on a protection
change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 2 ++
 mm/mprotect.c    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f00b96..0ecaba2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1522,6 +1522,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		ret = 1;
 		if (!prot_numa) {
 			entry = pmdp_get_and_clear(mm, addr, pmd);
+			if (pmd_numa(entry))
+				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0a07e2d..eb2f349 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (!prot_numa) {
 				ptent = ptep_modify_prot_start(mm, addr, pte);
+				if (pte_numa(ptent))
+					ptent = pte_mknonnuma(ptent);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

do_huge_pmd_numa_page() handles the case where there is parallel THP
migration.  However, by the time it is checked the NUMA hinting information
has already been disrupted. This patch adds an earlier check with some helpers.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  9 +++++++++
 mm/huge_memory.c        | 22 ++++++++++++++++------
 mm/migrate.c            | 12 ++++++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f5096b5..b7717d7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,10 +90,19 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
+extern bool pmd_trans_migrating(pmd_t pmd);
+extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
+static inline bool pmd_trans_migrating(pmd_t pmd)
+{
+	return false;
+}
+static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ecaba2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,6 +882,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		ret = 0;
 		goto out_unlock;
 	}
+
+	/* mmap_sem prevents this happening but warn if that changes */
+	WARN_ON(pmd_trans_migrating(pmd));
+
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(src_ptl);
@@ -1299,6 +1303,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
+	/*
+	 * If there are potential migrations, wait for completion and retry
+	 * without disrupting NUMA hinting information. Do not relock and
+	 * check_same as the page may no longer be mapped.
+	 */
+	if (unlikely(pmd_trans_migrating(*pmdp))) {
+		spin_unlock(ptl);
+		wait_migrate_huge_page(vma->anon_vma, pmdp);
+		goto out;
+	}
+
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
@@ -1329,12 +1344,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto clear_pmdnuma;
 	}
 
-	/*
-	 * If there are potential migrations, wait for completion and retry. We
-	 * do not relock and check_same as the page may no longer be mapped.
-	 * Furtermore, even if the page is currently misplaced, there is no
-	 * guarantee it is still misplaced after the migration completes.
-	 */
+	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index a987525..cfb4190 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1655,6 +1655,18 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
+bool pmd_trans_migrating(pmd_t pmd)
+{
+	struct page *page = pmd_page(pmd);
+	return PageLocked(page);
+}
+
+void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+	struct page *page = pmd_page(*pmd);
+	wait_on_page_locked(page);
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

do_huge_pmd_numa_page() handles the case where there is parallel THP
migration.  However, by the time it is checked the NUMA hinting information
has already been disrupted. This patch adds an earlier check with some helpers.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  9 +++++++++
 mm/huge_memory.c        | 22 ++++++++++++++++------
 mm/migrate.c            | 12 ++++++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f5096b5..b7717d7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,10 +90,19 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
+extern bool pmd_trans_migrating(pmd_t pmd);
+extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
+static inline bool pmd_trans_migrating(pmd_t pmd)
+{
+	return false;
+}
+static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ecaba2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,6 +882,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		ret = 0;
 		goto out_unlock;
 	}
+
+	/* mmap_sem prevents this happening but warn if that changes */
+	WARN_ON(pmd_trans_migrating(pmd));
+
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(src_ptl);
@@ -1299,6 +1303,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
+	/*
+	 * If there are potential migrations, wait for completion and retry
+	 * without disrupting NUMA hinting information. Do not relock and
+	 * check_same as the page may no longer be mapped.
+	 */
+	if (unlikely(pmd_trans_migrating(*pmdp))) {
+		spin_unlock(ptl);
+		wait_migrate_huge_page(vma->anon_vma, pmdp);
+		goto out;
+	}
+
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
@@ -1329,12 +1344,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto clear_pmdnuma;
 	}
 
-	/*
-	 * If there are potential migrations, wait for completion and retry. We
-	 * do not relock and check_same as the page may no longer be mapped.
-	 * Furtermore, even if the page is currently misplaced, there is no
-	 * guarantee it is still misplaced after the migration completes.
-	 */
+	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index a987525..cfb4190 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1655,6 +1655,18 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
+bool pmd_trans_migrating(pmd_t pmd)
+{
+	struct page *page = pmd_page(pmd);
+	return PageLocked(page);
+}
+
+void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+	struct page *page = pmd_page(*pmd);
+	wait_on_page_locked(page);
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the process.

This only affects x86, which has a somewhat optimistic pte_accessible. All
other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions (SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/sparc/include/asm/pgtable_64.h |  4 ++--
 arch/x86/include/asm/pgtable.h      | 11 ++++++++--
 include/asm-generic/pgtable.h       |  2 +-
 include/linux/mm_types.h            | 44 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c                       |  1 +
 mm/huge_memory.c                    |  7 ++++++
 mm/mprotect.c                       |  2 ++
 mm/pgtable-generic.c                |  5 +++--
 8 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 8358dc1..0f9e945 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -619,7 +619,7 @@ static inline unsigned long pte_present(pte_t pte)
 }
 
 #define pte_accessible pte_accessible
-static inline unsigned long pte_accessible(pte_t a)
+static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a)
 {
 	return pte_val(a) & _PAGE_VALID;
 }
@@ -847,7 +847,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
-	if (likely(mm != &init_mm) && pte_accessible(orig))
+	if (likely(mm != &init_mm) && pte_accessible(mm, orig))
 		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3d19994..48cab4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -452,9 +452,16 @@ static inline int pte_present(pte_t a)
 }
 
 #define pte_accessible pte_accessible
-static inline int pte_accessible(pte_t a)
+static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
-	return pte_flags(a) & _PAGE_PRESENT;
+	if (pte_flags(a) & _PAGE_PRESENT)
+		return true;
+
+	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+			tlb_flush_pending(mm))
+		return true;
+
+	return false;
 }
 
 static inline int pte_hidden(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f330d28..b12079a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -217,7 +217,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif
 
 #ifndef pte_accessible
-# define pte_accessible(pte)		((void)(pte),1)
+# define pte_accessible(mm, pte)	((void)(pte), 1)
 #endif
 
 #ifndef flush_tlb_fix_spurious_fault
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bd29941..c122bb1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,6 +443,14 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+	/*
+	 * An operation with batched TLB flushing is going on. Anything that
+	 * can move process memory needs to flush the TLB when moving a
+	 * PROT_NONE or PROT_NUMA mapped page.
+	 */
+	bool tlb_flush_pending;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
@@ -459,4 +467,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+/*
+ * Memory barriers to keep this state in sync are graciously provided by
+ * the page table locks, outside of which no page table modifications happen.
+ * The barriers below prevent the compiler from re-ordering the instructions
+ * around the memory barriers that are already present in the code.
+ */
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	return mm->tlb_flush_pending;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+	mm->tlb_flush_pending = true;
+	barrier();
+}
+/* Clearing is done after a TLB flush, which also provides a barrier. */
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	mm->tlb_flush_pending = false;
+}
+#else
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	return false;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+#endif
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 728d5be..5721f0e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -537,6 +537,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	clear_tlb_flush_pending(mm);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3b6a75..e3a5ee2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,6 +1377,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
+	 * The page_table_lock above provides a memory barrier
+	 * with change_protection_range.
+	 */
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index eb2f349..9b1be30 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -187,6 +187,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	set_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -198,6 +199,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
+	clear_tlb_flush_pending(mm);
 
 	return pages;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e84cad2..a8b9199 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -110,9 +110,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pte_t *ptep)
 {
+	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	if (pte_accessible(pte))
+	pte = ptep_get_and_clear(mm, address, ptep);
+	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	return pte;
 }
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the process.

This only affects x86, which has a somewhat optimistic pte_accessible. All
other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions (SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/sparc/include/asm/pgtable_64.h |  4 ++--
 arch/x86/include/asm/pgtable.h      | 11 ++++++++--
 include/asm-generic/pgtable.h       |  2 +-
 include/linux/mm_types.h            | 44 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c                       |  1 +
 mm/huge_memory.c                    |  7 ++++++
 mm/mprotect.c                       |  2 ++
 mm/pgtable-generic.c                |  5 +++--
 8 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 8358dc1..0f9e945 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -619,7 +619,7 @@ static inline unsigned long pte_present(pte_t pte)
 }
 
 #define pte_accessible pte_accessible
-static inline unsigned long pte_accessible(pte_t a)
+static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a)
 {
 	return pte_val(a) & _PAGE_VALID;
 }
@@ -847,7 +847,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
-	if (likely(mm != &init_mm) && pte_accessible(orig))
+	if (likely(mm != &init_mm) && pte_accessible(mm, orig))
 		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3d19994..48cab4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -452,9 +452,16 @@ static inline int pte_present(pte_t a)
 }
 
 #define pte_accessible pte_accessible
-static inline int pte_accessible(pte_t a)
+static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
-	return pte_flags(a) & _PAGE_PRESENT;
+	if (pte_flags(a) & _PAGE_PRESENT)
+		return true;
+
+	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+			tlb_flush_pending(mm))
+		return true;
+
+	return false;
 }
 
 static inline int pte_hidden(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f330d28..b12079a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -217,7 +217,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif
 
 #ifndef pte_accessible
-# define pte_accessible(pte)		((void)(pte),1)
+# define pte_accessible(mm, pte)	((void)(pte), 1)
 #endif
 
 #ifndef flush_tlb_fix_spurious_fault
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bd29941..c122bb1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,6 +443,14 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+	/*
+	 * An operation with batched TLB flushing is going on. Anything that
+	 * can move process memory needs to flush the TLB when moving a
+	 * PROT_NONE or PROT_NUMA mapped page.
+	 */
+	bool tlb_flush_pending;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
@@ -459,4 +467,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+/*
+ * Memory barriers to keep this state in sync are graciously provided by
+ * the page table locks, outside of which no page table modifications happen.
+ * The barriers below prevent the compiler from re-ordering the instructions
+ * around the memory barriers that are already present in the code.
+ */
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	return mm->tlb_flush_pending;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+	mm->tlb_flush_pending = true;
+	barrier();
+}
+/* Clearing is done after a TLB flush, which also provides a barrier. */
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	mm->tlb_flush_pending = false;
+}
+#else
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	return false;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+#endif
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 728d5be..5721f0e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -537,6 +537,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	clear_tlb_flush_pending(mm);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3b6a75..e3a5ee2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,6 +1377,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
+	 * The page_table_lock above provides a memory barrier
+	 * with change_protection_range.
+	 */
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index eb2f349..9b1be30 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -187,6 +187,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	set_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -198,6 +199,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
+	clear_tlb_flush_pending(mm);
 
 	return pages;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e84cad2..a8b9199 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -110,9 +110,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pte_t *ptep)
 {
+	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	if (pte_accessible(pte))
+	pte = ptep_get_and_clear(mm, address, ptep);
+	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	return pte;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

THP migration can fail for a variety of reasons. Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 -------
 mm/migrate.c     | 6 ++++++
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3a5ee2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,13 +1377,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * The page_table_lock above provides a memory barrier
-	 * with change_protection_range.
-	 */
-	if (tlb_flush_pending(mm))
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
-	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index cfb4190..5372521 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1759,6 +1759,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_fail;
 	}
 
+	/* PTL provides a memory barrier with change_protection_range */
+	ptl = pmd_lock(mm, pmd);
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, mmun_start, mmun_end);
+	spin_unlock(ptl);
+
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
 	SetPageSwapBacked(new_page);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

THP migration can fail for a variety of reasons. Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 -------
 mm/migrate.c     | 6 ++++++
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3a5ee2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,13 +1377,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * The page_table_lock above provides a memory barrier
-	 * with change_protection_range.
-	 */
-	if (tlb_flush_pending(mm))
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
-	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index cfb4190..5372521 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1759,6 +1759,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_fail;
 	}
 
+	/* PTL provides a memory barrier with change_protection_range */
+	ptl = pmd_lock(mm, pmd);
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, mmun_start, mmun_end);
+	spin_unlock(ptl);
+
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
 	SetPageSwapBacked(new_page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
in mm/migrate.c. This patch makes them static.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5372521..77147bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node)
 }
 
 /* Returns true if the node is migrate rate-limited after the update */
-bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
+static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
+					unsigned long nr_pages)
 {
 	bool rate_limited = false;
 
@@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
 	return rate_limited;
 }
 
-int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
in mm/migrate.c. This patch makes them static.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5372521..77147bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node)
 }
 
 /* Returns true if the node is migrate rate-limited after the update */
-bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
+static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
+					unsigned long nr_pages)
 {
 	bool rate_limited = false;
 
@@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
 	return rate_limited;
 }
 
-int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

NUMA migrate rate limiting protects a migration counter and window using
a lock but in some cases this can be a contended lock. It is not
critical that the number of pages be perfect, lost updates are
acceptable. Reduce the importance of this lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  5 +----
 mm/migrate.c           | 21 ++++++++++++---------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..b835d3f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,10 +758,7 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * Lock serializing the per destination node AutoNUMA memory
-	 * migration rate limiting data.
-	 */
+	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
 
 	/* Rate limiting time interval */
diff --git a/mm/migrate.c b/mm/migrate.c
index 77147bd..8b560d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,26 +1596,29 @@ bool migrate_ratelimited(int node)
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 					unsigned long nr_pages)
 {
-	bool rate_limited = false;
-
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	spin_lock(&pgdat->numabalancing_migrate_lock);
 	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
+		spin_lock(&pgdat->numabalancing_migrate_lock);
 		pgdat->numabalancing_migrate_nr_pages = 0;
 		pgdat->numabalancing_migrate_next_window = jiffies +
 			msecs_to_jiffies(migrate_interval_millisecs);
+		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
-		rate_limited = true;
-	else
-		pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	spin_unlock(&pgdat->numabalancing_migrate_lock);
-	
-	return rate_limited;
+		return true;
+
+	/*
+	 * This is an unlocked non-atomic update so errors are possible.
+	 * The consequences are failing to migrate when we potentiall should
+	 * have which is not severe enough to warrant locking. If it is ever
+	 * a problem, it can be converted to a per-cpu counter.
+	 */
+	pgdat->numabalancing_migrate_nr_pages += nr_pages;
+	return false;
 }
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

NUMA migrate rate limiting protects a migration counter and window using
a lock but in some cases this can be a contended lock. It is not
critical that the number of pages be perfect, lost updates are
acceptable. Reduce the importance of this lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  5 +----
 mm/migrate.c           | 21 ++++++++++++---------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..b835d3f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,10 +758,7 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * Lock serializing the per destination node AutoNUMA memory
-	 * migration rate limiting data.
-	 */
+	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
 
 	/* Rate limiting time interval */
diff --git a/mm/migrate.c b/mm/migrate.c
index 77147bd..8b560d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,26 +1596,29 @@ bool migrate_ratelimited(int node)
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 					unsigned long nr_pages)
 {
-	bool rate_limited = false;
-
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	spin_lock(&pgdat->numabalancing_migrate_lock);
 	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
+		spin_lock(&pgdat->numabalancing_migrate_lock);
 		pgdat->numabalancing_migrate_nr_pages = 0;
 		pgdat->numabalancing_migrate_next_window = jiffies +
 			msecs_to_jiffies(migrate_interval_millisecs);
+		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
-		rate_limited = true;
-	else
-		pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	spin_unlock(&pgdat->numabalancing_migrate_lock);
-	
-	return rate_limited;
+		return true;
+
+	/*
+	 * This is an unlocked non-atomic update so errors are possible.
+	 * The consequences are failing to migrate when we potentiall should
+	 * have which is not severe enough to warrant locking. If it is ever
+	 * a problem, it can be converted to a per-cpu counter.
+	 */
+	pgdat->numabalancing_migrate_nr_pages += nr_pages;
+	return false;
 }
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when migration
fails due to migration rate limitation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++
 mm/migrate.c                   |  5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index ec2a6cc..3075ffb 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -45,6 +45,32 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
 
+TRACE_EVENT(mm_numa_migrate_ratelimit,
+
+	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
+
+	TP_ARGS(p, dst_nid, nr_pages),
+
+	TP_STRUCT__entry(
+		__array(	char,		comm,	TASK_COMM_LEN)
+		__field(	pid_t,		pid)
+		__field(	int,		dst_nid)
+		__field(	unsigned long,	nr_pages)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->dst_nid	= dst_nid;
+		__entry->nr_pages	= nr_pages;
+	),
+
+	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
+		__entry->comm,
+		__entry->pid,
+		__entry->dst_nid,
+		__entry->nr_pages)
+);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 8b560d5..9f53c00 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1608,8 +1608,11 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 			msecs_to_jiffies(migrate_interval_millisecs);
 		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
+	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
+		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
+								nr_pages);
 		return true;
+	}
 
 	/*
 	 * This is an unlocked non-atomic update so errors are possible.
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when migration
fails due to migration rate limitation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++
 mm/migrate.c                   |  5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index ec2a6cc..3075ffb 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -45,6 +45,32 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
 
+TRACE_EVENT(mm_numa_migrate_ratelimit,
+
+	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
+
+	TP_ARGS(p, dst_nid, nr_pages),
+
+	TP_STRUCT__entry(
+		__array(	char,		comm,	TASK_COMM_LEN)
+		__field(	pid_t,		pid)
+		__field(	int,		dst_nid)
+		__field(	unsigned long,	nr_pages)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->dst_nid	= dst_nid;
+		__entry->nr_pages	= nr_pages;
+	),
+
+	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
+		__entry->comm,
+		__entry->pid,
+		__entry->dst_nid,
+		__entry->nr_pages)
+);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 8b560d5..9f53c00 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1608,8 +1608,11 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 			msecs_to_jiffies(migrate_interval_millisecs);
 		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
+	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
+		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
+								nr_pages);
 		return true;
+	}
 
 	/*
 	 * This is an unlocked non-atomic update so errors are possible.
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

KSM pages can be shared between tasks that are not necessarily related
to each other from a NUMA perspective. This patch causes those pages to
be ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9b1be30..c258137 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -23,6 +23,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/ksm.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -63,7 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
-				if (page) {
+				if (page && !PageKsm(page)) {
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

KSM pages can be shared between tasks that are not necessarily related
to each other from a NUMA perspective. This patch causes those pages to
be ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9b1be30..c258137 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -23,6 +23,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/ksm.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -63,7 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
-				if (page) {
+				if (page && !PageKsm(page)) {
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/18] sched: Tracepoint task movement
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

move_task() is called from move_one_task and move_tasks and is an
approximation of load balancer activity. We should be able to track
tasks that move between CPUs frequently. If the tracepoint included node
information then we could distinguish between in-node and between-node
traffic for load balancer decisions. The tracepoint allows us to track
local migrations, remote migrations and average task migrations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/sched.h | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  2 ++
 2 files changed, 37 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 04c3084..cf1694c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+TRACE_EVENT(sched_move_task,
+
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	pid			)
+		__field( pid_t,	tgid			)
+		__field( pid_t,	ngid			)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task_pid_nr(tsk);
+		__entry->tgid		= task_tgid_nr(tsk);
+		__entry->ngid		= task_numa_group_id(tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
+			__entry->pid, __entry->tgid, __entry->ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ce1615..41021c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+
+	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
 }
 
 /*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/18] sched: Tracepoint task movement
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

move_task() is called from move_one_task and move_tasks and is an
approximation of load balancer activity. We should be able to track
tasks that move between CPUs frequently. If the tracepoint included node
information then we could distinguish between in-node and between-node
traffic for load balancer decisions. The tracepoint allows us to track
local migrations, remote migrations and average task migrations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/sched.h | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  2 ++
 2 files changed, 37 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 04c3084..cf1694c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+TRACE_EVENT(sched_move_task,
+
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	pid			)
+		__field( pid_t,	tgid			)
+		__field( pid_t,	ngid			)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task_pid_nr(tsk);
+		__entry->tgid		= task_tgid_nr(tsk);
+		__entry->ngid		= task_numa_group_id(tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
+			__entry->pid, __entry->tgid, __entry->ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ce1615..41021c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+
+	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
 }
 
 /*
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/18] sched: Add tracepoints related to NUMA task migration
  2013-12-09  7:08 ` Mel Gorman
@ 2013-12-09  7:09   ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

This patch adds three tracepoints
 o trace_sched_move_numa	when a task is moved to a node
 o trace_sched_swap_numa	when a task is swapped with another task
 o trace_sched_stick_numa	when a numa-related migration fails

The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated

 o NUMA migrated stuck	 nr trace_sched_stick_numa
 o NUMA migrated idle	 nr trace_sched_move_numa
 o NUMA migrated swapped nr trace_sched_swap_numa
 o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
 o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
 o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
			 Maybe a small number of these are acceptable
			 but a high number would be a major surprise.
			 It would be even worse if bounces are frequent.
 o NUMA avg task migs.	 Average number of migrations for tasks
 o NUMA stddev task mig	 Self-explanatory
 o NUMA max task migs.	 Maximum number of migrations for a single task

In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount of
useless work.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/core.c          |  2 ++
 kernel/sched/fair.c          |  6 ++--
 3 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cf1694c..f0c54e3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -443,11 +443,7 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
-/*
- * Tracks migration of tasks from one runqueue to another. Can be used to
- * detect if automatic NUMA balancing is bouncing between nodes
- */
-TRACE_EVENT(sched_move_task,
+DECLARE_EVENT_CLASS(sched_move_task_template,
 
 	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
 
@@ -478,6 +474,68 @@ TRACE_EVENT(sched_move_task,
 			__entry->src_cpu, __entry->src_nid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+DEFINE_EVENT(sched_move_task_template, sched_move_task,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+DEFINE_EVENT(sched_move_task_template, sched_move_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+TRACE_EVENT(sched_swap_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	src_pid			)
+		__field( pid_t,	src_tgid		)
+		__field( pid_t,	src_ngid		)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( pid_t,	dst_pid			)
+		__field( pid_t,	dst_tgid		)
+		__field( pid_t,	dst_ngid		)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->src_pid	= task_pid_nr(src_tsk);
+		__entry->src_tgid	= task_tgid_nr(src_tsk);
+		__entry->src_ngid	= task_numa_group_id(src_tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_pid	= task_pid_nr(dst_tsk);
+		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
+		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
+			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..3980110 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
 		goto out;
 
+	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
 out:
@@ -4091,6 +4092,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	/* TODO: This is not properly updating schedstats */
 
+	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41021c8..aac8c65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
 	p->numa_scan_period = task_scan_min(p);
 
 	if (env.best_task == NULL) {
-		int ret = migrate_task_to(p, env.best_cpu);
+		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
+			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	if ((ret = migrate_swap(p, env.best_task)) != 0);
+		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
 	return ret;
 }
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/18] sched: Add tracepoints related to NUMA task migration
@ 2013-12-09  7:09   ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

This patch adds three tracepoints
 o trace_sched_move_numa	when a task is moved to a node
 o trace_sched_swap_numa	when a task is swapped with another task
 o trace_sched_stick_numa	when a numa-related migration fails

The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated

 o NUMA migrated stuck	 nr trace_sched_stick_numa
 o NUMA migrated idle	 nr trace_sched_move_numa
 o NUMA migrated swapped nr trace_sched_swap_numa
 o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
 o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
 o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
			 Maybe a small number of these are acceptable
			 but a high number would be a major surprise.
			 It would be even worse if bounces are frequent.
 o NUMA avg task migs.	 Average number of migrations for tasks
 o NUMA stddev task mig	 Self-explanatory
 o NUMA max task migs.	 Maximum number of migrations for a single task

In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount of
useless work.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/core.c          |  2 ++
 kernel/sched/fair.c          |  6 ++--
 3 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cf1694c..f0c54e3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -443,11 +443,7 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
-/*
- * Tracks migration of tasks from one runqueue to another. Can be used to
- * detect if automatic NUMA balancing is bouncing between nodes
- */
-TRACE_EVENT(sched_move_task,
+DECLARE_EVENT_CLASS(sched_move_task_template,
 
 	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
 
@@ -478,6 +474,68 @@ TRACE_EVENT(sched_move_task,
 			__entry->src_cpu, __entry->src_nid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+DEFINE_EVENT(sched_move_task_template, sched_move_task,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+DEFINE_EVENT(sched_move_task_template, sched_move_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+TRACE_EVENT(sched_swap_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	src_pid			)
+		__field( pid_t,	src_tgid		)
+		__field( pid_t,	src_ngid		)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( pid_t,	dst_pid			)
+		__field( pid_t,	dst_tgid		)
+		__field( pid_t,	dst_ngid		)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->src_pid	= task_pid_nr(src_tsk);
+		__entry->src_tgid	= task_tgid_nr(src_tsk);
+		__entry->src_ngid	= task_numa_group_id(src_tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_pid	= task_pid_nr(dst_tsk);
+		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
+		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
+			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..3980110 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
 		goto out;
 
+	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
 out:
@@ -4091,6 +4092,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	/* TODO: This is not properly updating schedstats */
 
+	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41021c8..aac8c65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
 	p->numa_scan_period = task_scan_min(p);
 
 	if (env.best_task == NULL) {
-		int ret = migrate_task_to(p, env.best_cpu);
+		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
+			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	if ((ret = migrate_swap(p, env.best_task)) != 0);
+		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
 	return ret;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-09  7:09   ` Mel Gorman
  (?)
@ 2013-12-09  7:20   ` Wanpeng Li
  -1 siblings, 0 replies; 91+ messages in thread
From: Wanpeng Li @ 2013-12-09  7:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

Hi Mel,
On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
>numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
>in mm/migrate.c. This patch makes them static.
>

I have already send out patches to fix this issue yesterday. ;-)

http://marc.info/?l=linux-mm&m=138648332222847&w=2
http://marc.info/?l=linux-mm&m=138648332422848&w=2

Regards,
Wanpeng Li 

>Signed-off-by: Mel Gorman <mgorman@suse.de>
>---
> mm/migrate.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
>diff --git a/mm/migrate.c b/mm/migrate.c
>index 5372521..77147bd 100644
>--- a/mm/migrate.c
>+++ b/mm/migrate.c
>@@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node)
> }
>
> /* Returns true if the node is migrate rate-limited after the update */
>-bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
>+static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
>+					unsigned long nr_pages)
> {
> 	bool rate_limited = false;
>
>@@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
> 	return rate_limited;
> }
>
>-int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>+static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> {
> 	int page_lru;
>
>-- 
>1.8.4
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
       [not found]   ` <20131209072010.GA3716@hacker.(null)>
@ 2013-12-09  8:46       ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  8:46 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
> Hi Mel,
> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> >in mm/migrate.c. This patch makes them static.
> >
> 
> I have already send out patches to fix this issue yesterday. ;-)
> 
> http://marc.info/?l=linux-mm&m=138648332222847&w=2
> http://marc.info/?l=linux-mm&m=138648332422848&w=2
> 

I know. I had written the patch some time ago waiting to go out with
the TLB flush fix and just didn't bother dropping it in response to your
series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
@ 2013-12-09  8:46       ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  8:46 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
> Hi Mel,
> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> >in mm/migrate.c. This patch makes them static.
> >
> 
> I have already send out patches to fix this issue yesterday. ;-)
> 
> http://marc.info/?l=linux-mm&m=138648332222847&w=2
> http://marc.info/?l=linux-mm&m=138648332422848&w=2
> 

I know. I had written the patch some time ago waiting to go out with
the TLB flush fix and just didn't bother dropping it in response to your
series.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-09  8:46       ` Mel Gorman
  (?)
@ 2013-12-09  8:57       ` Wanpeng Li
  -1 siblings, 0 replies; 91+ messages in thread
From: Wanpeng Li @ 2013-12-09  8:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 08:46:59AM +0000, Mel Gorman wrote:
>On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
>> Hi Mel,
>> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
>> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
>> >in mm/migrate.c. This patch makes them static.
>> >
>> 
>> I have already send out patches to fix this issue yesterday. ;-)
>> 
>> http://marc.info/?l=linux-mm&m=138648332222847&w=2
>> http://marc.info/?l=linux-mm&m=138648332422848&w=2
>> 
>
>I know. I had written the patch some time ago waiting to go out with
>the TLB flush fix and just didn't bother dropping it in response to your
>series.

Ok, could you review my patchset v3? Thanks in advance. ;-)

Regards,
Wanpeng Li 

>
>-- 
>Mel Gorman
>SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
       [not found]       ` <20131209085720.GA16251@hacker.(null)>
@ 2013-12-09  9:08           ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  9:08 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 04:57:20PM +0800, Wanpeng Li wrote:
> On Mon, Dec 09, 2013 at 08:46:59AM +0000, Mel Gorman wrote:
> >On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
> >> Hi Mel,
> >> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
> >> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> >> >in mm/migrate.c. This patch makes them static.
> >> >
> >> 
> >> I have already send out patches to fix this issue yesterday. ;-)
> >> 
> >> http://marc.info/?l=linux-mm&m=138648332222847&w=2
> >> http://marc.info/?l=linux-mm&m=138648332422848&w=2
> >> 
> >
> >I know. I had written the patch some time ago waiting to go out with
> >the TLB flush fix and just didn't bother dropping it in response to your
> >series.
> 
> Ok, could you review my patchset v3? Thanks in advance. ;-)
> 

Glanced through it this morning and saw nothing wrong. I expect it'll
get picked up in due course.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
@ 2013-12-09  9:08           ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-09  9:08 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 04:57:20PM +0800, Wanpeng Li wrote:
> On Mon, Dec 09, 2013 at 08:46:59AM +0000, Mel Gorman wrote:
> >On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
> >> Hi Mel,
> >> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
> >> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> >> >in mm/migrate.c. This patch makes them static.
> >> >
> >> 
> >> I have already send out patches to fix this issue yesterday. ;-)
> >> 
> >> http://marc.info/?l=linux-mm&m=138648332222847&w=2
> >> http://marc.info/?l=linux-mm&m=138648332422848&w=2
> >> 
> >
> >I know. I had written the patch some time ago waiting to go out with
> >the TLB flush fix and just didn't bother dropping it in response to your
> >series.
> 
> Ok, could you review my patchset v3? Thanks in advance. ;-)
> 

Glanced through it this morning and saw nothing wrong. I expect it'll
get picked up in due course.

Thanks

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-09  9:08           ` Mel Gorman
  (?)
@ 2013-12-09  9:13           ` Wanpeng Li
  -1 siblings, 0 replies; 91+ messages in thread
From: Wanpeng Li @ 2013-12-09  9:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Mon, Dec 09, 2013 at 09:08:34AM +0000, Mel Gorman wrote:
>On Mon, Dec 09, 2013 at 04:57:20PM +0800, Wanpeng Li wrote:
>> On Mon, Dec 09, 2013 at 08:46:59AM +0000, Mel Gorman wrote:
>> >On Mon, Dec 09, 2013 at 03:20:10PM +0800, Wanpeng Li wrote:
>> >> Hi Mel,
>> >> On Mon, Dec 09, 2013 at 07:09:07AM +0000, Mel Gorman wrote:
>> >> >numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
>> >> >in mm/migrate.c. This patch makes them static.
>> >> >
>> >> 
>> >> I have already send out patches to fix this issue yesterday. ;-)
>> >> 
>> >> http://marc.info/?l=linux-mm&m=138648332222847&w=2
>> >> http://marc.info/?l=linux-mm&m=138648332422848&w=2
>> >> 
>> >
>> >I know. I had written the patch some time ago waiting to go out with
>> >the TLB flush fix and just didn't bother dropping it in response to your
>> >series.
>> 
>> Ok, could you review my patchset v3? Thanks in advance. ;-)
>> 
>
>Glanced through it this morning and saw nothing wrong. I expect it'll
>get picked up in due course.

Thanks Mel. ;-)

Regards,
Wanpeng Li 

>
>Thanks
>
>-- 
>Mel Gorman
>SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-09  7:08   ` Mel Gorman
@ 2013-12-09 14:08     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> Base pages are unmapped and flushed from cache and TLB during normal page
> migration and replaced with a migration entry that causes any parallel or
> gup to block until migration completes. THP does not unmap pages due to
> a lack of support for migration entries at a PMD level. This allows races
> with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
> Close races between THP migration and PMD numa clearing") made worse by
> introducing a pmd_clear_flush().
> 
> This patch forces get_user_page (fast and normal) on a pmd_numa page to
> go through the slow get_user_page path where it will serialise against THP
> migration and properly account for the NUMA hinting fault. On the migration
> side the page table lock is taken for each PTE update.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration
@ 2013-12-09 14:08     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> Base pages are unmapped and flushed from cache and TLB during normal page
> migration and replaced with a migration entry that causes any parallel or
> gup to block until migration completes. THP does not unmap pages due to
> a lack of support for migration entries at a PMD level. This allows races
> with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
> Close races between THP migration and PMD numa clearing") made worse by
> introducing a pmd_clear_flush().
> 
> This patch forces get_user_page (fast and normal) on a pmd_numa page to
> go through the slow get_user_page path where it will serialise against THP
> migration and properly account for the NUMA hinting fault. On the migration
> side the page table lock is taken for each PTE update.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration
  2013-12-09  7:08   ` Mel Gorman
@ 2013-12-09 14:09     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> MMU notifiers must be called on THP page migration or secondary MMUs will
> get very confused.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration
@ 2013-12-09 14:09     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> MMU notifiers must be called on THP page migration or secondary MMUs will
> get very confused.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/18] mm: Clear pmd_numa before invalidating
  2013-12-09  7:08   ` Mel Gorman
@ 2013-12-09 14:14     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> pmdp_invalidate clears the present bit without taking into account that it
> might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
> pmd_numa before invalidating.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/18] mm: Clear pmd_numa before invalidating
@ 2013-12-09 14:14     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> pmdp_invalidate clears the present bit without taking into account that it
> might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
> pmd_numa before invalidating.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan
  2013-12-09  7:08   ` Mel Gorman
@ 2013-12-09 14:22     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> If the PMD is flushed then a parallel fault in handle_mm_fault() will enter
> the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt
> to insert a huge zero page. This is wasteful so the patch avoids clearing
> the PMD when setting pmd_numa.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan
@ 2013-12-09 14:22     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> If the PMD is flushed then a parallel fault in handle_mm_fault() will enter
> the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt
> to insert a huge zero page. This is wasteful so the patch avoids clearing
> the PMD when setting pmd_numa.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update
  2013-12-09  7:08   ` Mel Gorman
@ 2013-12-09 14:31     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> The TLB must be flushed if the PTE is updated but change_pte_range is clearing
> the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
> reinserts the same entry. Without the flush, it's conceivable that two processors
> have different TLBs for the same virtual address and at the very least it would
> generate spurious faults. This patch only unmaps the pages in change_pte_range for
> a full protection change.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update
@ 2013-12-09 14:31     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:08 AM, Mel Gorman wrote:
> The TLB must be flushed if the PTE is updated but change_pte_range is clearing
> the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
> reinserts the same entry. Without the flush, it's conceivable that two processors
> have different TLBs for the same virtual address and at the very least it would
> generate spurious faults. This patch only unmaps the pages in change_pte_range for
> a full protection change.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 14:34     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> The anon_vma lock prevents parallel THP splits and any associated complexity
> that arises when handling splits during THP migration. This patch checks
> if the lock was successfully acquired and bails from THP migration if it
> failed for any reason.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
@ 2013-12-09 14:34     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> The anon_vma lock prevents parallel THP splits and any associated complexity
> that arises when handling splits during THP migration. This patch checks
> if the lock was successfully acquired and bails from THP migration if it
> failed for any reason.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 14:42     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> If a PMD changes during a THP migration then migration aborts but the
> failure path is doing more work than is necessary.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path
@ 2013-12-09 14:42     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> If a PMD changes during a THP migration then migration aborts but the
> failure path is doing more work than is necessary.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/18] sched: numa: Skip inaccessible VMAs
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 14:50     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> Inaccessible VMA should not be trapping NUMA hint faults. Skip them.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/18] sched: numa: Skip inaccessible VMAs
@ 2013-12-09 14:50     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 14:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> Inaccessible VMA should not be trapping NUMA hint faults. Skip them.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 15:57     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 15:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> On a protection change it is no longer clear if the page should be still
> accessible.  This patch clears the NUMA hinting fault bits on a protection
> change.

I had to think about this one, because my first thought was
"wait, aren't NUMA ptes inaccessible already?".

Then I thought about doing things like adding read or write
permission in the mprotect, eg. changing from PROT_NONE to
PROT_READ ... and it became unclear what to do with the NUMA
bit in that case...

This patch clears up some confusing situations :)

> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect
@ 2013-12-09 15:57     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 15:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> On a protection change it is no longer clear if the page should be still
> accessible.  This patch clears the NUMA hinting fault bits on a protection
> change.

I had to think about this one, because my first thought was
"wait, aren't NUMA ptes inaccessible already?".

Then I thought about doing things like adding read or write
permission in the mprotect, eg. changing from PROT_NONE to
PROT_READ ... and it became unclear what to do with the NUMA
bit in that case...

This patch clears up some confusing situations :)

> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:10     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> do_huge_pmd_numa_page() handles the case where there is parallel THP
> migration.  However, by the time it is checked the NUMA hinting information
> has already been disrupted. This patch adds an earlier check with some helpers.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
@ 2013-12-09 16:10     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> do_huge_pmd_numa_page() handles the case where there is parallel THP
> migration.  However, by the time it is checked the NUMA hinting information
> has already been disrupted. This patch adds an earlier check with some helpers.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:13     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:

> diff --git a/mm/migrate.c b/mm/migrate.c
> index cfb4190..5372521 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1759,6 +1759,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  		goto out_fail;
>  	}
>  
> +	/* PTL provides a memory barrier with change_protection_range */
> +	ptl = pmd_lock(mm, pmd);
> +	if (tlb_flush_pending(mm))
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +	spin_unlock(ptl);
> +
>  	/* Prepare a page as a migration target */
>  	__set_page_locked(new_page);
>  	SetPageSwapBacked(new_page);

I don't think there is a need for that extra memory barrier.

On the "set_tlb_flush_pending, turn ptes into NUMA ones" side, we
have a barrier in the form of the page table lock.

We only end up in this code path if the pte/pmd already is a NUMA
one, and we take several spinlocks along the way to doing this test.
That provides for the memory barrier in this code path.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
@ 2013-12-09 16:13     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:

> diff --git a/mm/migrate.c b/mm/migrate.c
> index cfb4190..5372521 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1759,6 +1759,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  		goto out_fail;
>  	}
>  
> +	/* PTL provides a memory barrier with change_protection_range */
> +	ptl = pmd_lock(mm, pmd);
> +	if (tlb_flush_pending(mm))
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +	spin_unlock(ptl);
> +
>  	/* Prepare a page as a migration target */
>  	__set_page_locked(new_page);
>  	SetPageSwapBacked(new_page);

I don't think there is a need for that extra memory barrier.

On the "set_tlb_flush_pending, turn ptes into NUMA ones" side, we
have a barrier in the form of the page table lock.

We only end up in this code path if the pte/pmd already is a NUMA
one, and we take several spinlocks along the way to doing this test.
That provides for the memory barrier in this code path.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:14     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> in mm/migrate.c. This patch makes them static.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
@ 2013-12-09 16:14     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
> in mm/migrate.c. This patch makes them static.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:47     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> NUMA migrate rate limiting protects a migration counter and window using
> a lock but in some cases this can be a contended lock. It is not
> critical that the number of pages be perfect, lost updates are
> acceptable. Reduce the importance of this lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting
@ 2013-12-09 16:47     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> NUMA migrate rate limiting protects a migration counter and window using
> a lock but in some cases this can be a contended lock. It is not
> critical that the number of pages be perfect, lost updates are
> acceptable. Reduce the importance of this lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:57     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> A low local/remote numa hinting fault ratio is potentially explained by
> failed migrations. This patch adds a tracepoint that fires when migration
> fails due to migration rate limitation.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting
@ 2013-12-09 16:57     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> A low local/remote numa hinting fault ratio is potentially explained by
> failed migrations. This patch adds a tracepoint that fires when migration
> fails due to migration rate limitation.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 16:57     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> KSM pages can be shared between tasks that are not necessarily related
> to each other from a NUMA perspective. This patch causes those pages to
> be ignored by automatic NUMA balancing so they do not migrate and do not
> cause unrelated tasks to be grouped together.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages
@ 2013-12-09 16:57     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 16:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> KSM pages can be shared between tasks that are not necessarily related
> to each other from a NUMA perspective. This patch causes those pages to
> be ignored by automatic NUMA balancing so they do not migrate and do not
> cause unrelated tasks to be grouped together.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 18:54     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 18:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Drew Jones

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> move_task() is called from move_one_task and move_tasks and is an
> approximation of load balancer activity. We should be able to track
> tasks that move between CPUs frequently. If the tracepoint included node
> information then we could distinguish between in-node and between-node
> traffic for load balancer decisions. The tracepoint allows us to track
> local migrations, remote migrations and average task migrations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Does this replicate the task_sched_migrate_task tracepoint in
set_task_cpu() ?

I know Drew has been using that tracepoint in his (still experimental)
numatop script. Drew, does this tracepoint look better than the trace
point that you are currently using, or is it similar enough that we do
not really benefit from this addition?

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 04c3084..cf1694c 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
>  );
>  #endif /* CONFIG_DETECT_HUNG_TASK */
>  
> +/*
> + * Tracks migration of tasks from one runqueue to another. Can be used to
> + * detect if automatic NUMA balancing is bouncing between nodes
> + */
> +TRACE_EVENT(sched_move_task,
> +
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu),
> +
> +	TP_STRUCT__entry(
> +		__field( pid_t,	pid			)
> +		__field( pid_t,	tgid			)
> +		__field( pid_t,	ngid			)
> +		__field( int,	src_cpu			)
> +		__field( int,	src_nid			)
> +		__field( int,	dst_cpu			)
> +		__field( int,	dst_nid			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pid		= task_pid_nr(tsk);
> +		__entry->tgid		= task_tgid_nr(tsk);
> +		__entry->ngid		= task_numa_group_id(tsk);
> +		__entry->src_cpu	= src_cpu;
> +		__entry->src_nid	= cpu_to_node(src_cpu);
> +		__entry->dst_cpu	= dst_cpu;
> +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> +	),
> +
> +	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
> +			__entry->pid, __entry->tgid, __entry->ngid,
> +			__entry->src_cpu, __entry->src_nid,
> +			__entry->dst_cpu, __entry->dst_nid)
> +);
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1ce1615..41021c8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
>  	set_task_cpu(p, env->dst_cpu);
>  	activate_task(env->dst_rq, p, 0);
>  	check_preempt_curr(env->dst_rq, p, 0);
> +
> +	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
>  }
>  
>  /*
> 


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
@ 2013-12-09 18:54     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 18:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Drew Jones

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> move_task() is called from move_one_task and move_tasks and is an
> approximation of load balancer activity. We should be able to track
> tasks that move between CPUs frequently. If the tracepoint included node
> information then we could distinguish between in-node and between-node
> traffic for load balancer decisions. The tracepoint allows us to track
> local migrations, remote migrations and average task migrations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Does this replicate the task_sched_migrate_task tracepoint in
set_task_cpu() ?

I know Drew has been using that tracepoint in his (still experimental)
numatop script. Drew, does this tracepoint look better than the trace
point that you are currently using, or is it similar enough that we do
not really benefit from this addition?

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 04c3084..cf1694c 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
>  );
>  #endif /* CONFIG_DETECT_HUNG_TASK */
>  
> +/*
> + * Tracks migration of tasks from one runqueue to another. Can be used to
> + * detect if automatic NUMA balancing is bouncing between nodes
> + */
> +TRACE_EVENT(sched_move_task,
> +
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu),
> +
> +	TP_STRUCT__entry(
> +		__field( pid_t,	pid			)
> +		__field( pid_t,	tgid			)
> +		__field( pid_t,	ngid			)
> +		__field( int,	src_cpu			)
> +		__field( int,	src_nid			)
> +		__field( int,	dst_cpu			)
> +		__field( int,	dst_nid			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pid		= task_pid_nr(tsk);
> +		__entry->tgid		= task_tgid_nr(tsk);
> +		__entry->ngid		= task_numa_group_id(tsk);
> +		__entry->src_cpu	= src_cpu;
> +		__entry->src_nid	= cpu_to_node(src_cpu);
> +		__entry->dst_cpu	= dst_cpu;
> +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> +	),
> +
> +	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
> +			__entry->pid, __entry->tgid, __entry->ngid,
> +			__entry->src_cpu, __entry->src_nid,
> +			__entry->dst_cpu, __entry->dst_nid)
> +);
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1ce1615..41021c8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
>  	set_task_cpu(p, env->dst_cpu);
>  	activate_task(env->dst_rq, p, 0);
>  	check_preempt_curr(env->dst_rq, p, 0);
> +
> +	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
>  }
>  
>  /*
> 


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 18/18] sched: Add tracepoints related to NUMA task migration
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-09 19:06     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 19:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> This patch adds three tracepoints
>  o trace_sched_move_numa	when a task is moved to a node
>  o trace_sched_swap_numa	when a task is swapped with another task
>  o trace_sched_stick_numa	when a numa-related migration fails
> 
> The tracepoints allow the NUMA scheduler activity to be monitored and the
> following high-level metrics can be calculated
> 
>  o NUMA migrated stuck	 nr trace_sched_stick_numa
>  o NUMA migrated idle	 nr trace_sched_move_numa
>  o NUMA migrated swapped nr trace_sched_swap_numa
>  o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
>  o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
>  o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
> 			 Maybe a small number of these are acceptable
> 			 but a high number would be a major surprise.
> 			 It would be even worse if bounces are frequent.
>  o NUMA avg task migs.	 Average number of migrations for tasks
>  o NUMA stddev task mig	 Self-explanatory
>  o NUMA max task migs.	 Maximum number of migrations for a single task
> 
> In general the intent of the tracepoints is to help diagnose problems
> where automatic NUMA balancing appears to be doing an excessive amount of
> useless work.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index cf1694c..f0c54e3 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -443,11 +443,7 @@ TRACE_EVENT(sched_process_hang,
>  );
>  #endif /* CONFIG_DETECT_HUNG_TASK */
>  
> -/*
> - * Tracks migration of tasks from one runqueue to another. Can be used to
> - * detect if automatic NUMA balancing is bouncing between nodes
> - */
> -TRACE_EVENT(sched_move_task,
> +DECLARE_EVENT_CLASS(sched_move_task_template,
>  
>  	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
>  
> @@ -478,6 +474,68 @@ TRACE_EVENT(sched_move_task,
>  			__entry->src_cpu, __entry->src_nid,
>  			__entry->dst_cpu, __entry->dst_nid)
>  );
> +
> +/*
> + * Tracks migration of tasks from one runqueue to another. Can be used to
> + * detect if automatic NUMA balancing is bouncing between nodes
> + */
> +DEFINE_EVENT(sched_move_task_template, sched_move_task,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +DEFINE_EVENT(sched_move_task_template, sched_move_numa,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +TRACE_EVENT(sched_swap_numa,
> +
> +	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
> +		 struct task_struct *dst_tsk, int dst_cpu),
> +
> +	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
> +
> +	TP_STRUCT__entry(
> +		__field( pid_t,	src_pid			)
> +		__field( pid_t,	src_tgid		)
> +		__field( pid_t,	src_ngid		)
> +		__field( int,	src_cpu			)
> +		__field( int,	src_nid			)
> +		__field( pid_t,	dst_pid			)
> +		__field( pid_t,	dst_tgid		)
> +		__field( pid_t,	dst_ngid		)
> +		__field( int,	dst_cpu			)
> +		__field( int,	dst_nid			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->src_pid	= task_pid_nr(src_tsk);
> +		__entry->src_tgid	= task_tgid_nr(src_tsk);
> +		__entry->src_ngid	= task_numa_group_id(src_tsk);
> +		__entry->src_cpu	= src_cpu;
> +		__entry->src_nid	= cpu_to_node(src_cpu);
> +		__entry->dst_pid	= task_pid_nr(dst_tsk);
> +		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
> +		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
> +		__entry->dst_cpu	= dst_cpu;
> +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> +	),
> +
> +	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
> +			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
> +			__entry->src_cpu, __entry->src_nid,
> +			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> +			__entry->dst_cpu, __entry->dst_nid)
> +);
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c180860..3980110 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
>  	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
>  		goto out;
>  
> +	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
>  	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
>  
>  out:
> @@ -4091,6 +4092,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
>  
>  	/* TODO: This is not properly updating schedstats */
>  
> +	trace_sched_move_numa(p, curr_cpu, target_cpu);
>  	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
>  }
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41021c8..aac8c65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
>  	p->numa_scan_period = task_scan_min(p);
>  
>  	if (env.best_task == NULL) {
> -		int ret = migrate_task_to(p, env.best_cpu);
> +		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
> +			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
>  		return ret;
>  	}
>  
> -	ret = migrate_swap(p, env.best_task);
> +	if ((ret = migrate_swap(p, env.best_task)) != 0);
> +		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
>  	put_task_struct(env.best_task);
>  	return ret;
>  }
> 


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 18/18] sched: Add tracepoints related to NUMA task migration
@ 2013-12-09 19:06     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-09 19:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/09/2013 02:09 AM, Mel Gorman wrote:
> This patch adds three tracepoints
>  o trace_sched_move_numa	when a task is moved to a node
>  o trace_sched_swap_numa	when a task is swapped with another task
>  o trace_sched_stick_numa	when a numa-related migration fails
> 
> The tracepoints allow the NUMA scheduler activity to be monitored and the
> following high-level metrics can be calculated
> 
>  o NUMA migrated stuck	 nr trace_sched_stick_numa
>  o NUMA migrated idle	 nr trace_sched_move_numa
>  o NUMA migrated swapped nr trace_sched_swap_numa
>  o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
>  o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
>  o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
> 			 Maybe a small number of these are acceptable
> 			 but a high number would be a major surprise.
> 			 It would be even worse if bounces are frequent.
>  o NUMA avg task migs.	 Average number of migrations for tasks
>  o NUMA stddev task mig	 Self-explanatory
>  o NUMA max task migs.	 Maximum number of migrations for a single task
> 
> In general the intent of the tracepoints is to help diagnose problems
> where automatic NUMA balancing appears to be doing an excessive amount of
> useless work.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index cf1694c..f0c54e3 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -443,11 +443,7 @@ TRACE_EVENT(sched_process_hang,
>  );
>  #endif /* CONFIG_DETECT_HUNG_TASK */
>  
> -/*
> - * Tracks migration of tasks from one runqueue to another. Can be used to
> - * detect if automatic NUMA balancing is bouncing between nodes
> - */
> -TRACE_EVENT(sched_move_task,
> +DECLARE_EVENT_CLASS(sched_move_task_template,
>  
>  	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
>  
> @@ -478,6 +474,68 @@ TRACE_EVENT(sched_move_task,
>  			__entry->src_cpu, __entry->src_nid,
>  			__entry->dst_cpu, __entry->dst_nid)
>  );
> +
> +/*
> + * Tracks migration of tasks from one runqueue to another. Can be used to
> + * detect if automatic NUMA balancing is bouncing between nodes
> + */
> +DEFINE_EVENT(sched_move_task_template, sched_move_task,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +DEFINE_EVENT(sched_move_task_template, sched_move_numa,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
> +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +
> +	TP_ARGS(tsk, src_cpu, dst_cpu)
> +);
> +
> +TRACE_EVENT(sched_swap_numa,
> +
> +	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
> +		 struct task_struct *dst_tsk, int dst_cpu),
> +
> +	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
> +
> +	TP_STRUCT__entry(
> +		__field( pid_t,	src_pid			)
> +		__field( pid_t,	src_tgid		)
> +		__field( pid_t,	src_ngid		)
> +		__field( int,	src_cpu			)
> +		__field( int,	src_nid			)
> +		__field( pid_t,	dst_pid			)
> +		__field( pid_t,	dst_tgid		)
> +		__field( pid_t,	dst_ngid		)
> +		__field( int,	dst_cpu			)
> +		__field( int,	dst_nid			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->src_pid	= task_pid_nr(src_tsk);
> +		__entry->src_tgid	= task_tgid_nr(src_tsk);
> +		__entry->src_ngid	= task_numa_group_id(src_tsk);
> +		__entry->src_cpu	= src_cpu;
> +		__entry->src_nid	= cpu_to_node(src_cpu);
> +		__entry->dst_pid	= task_pid_nr(dst_tsk);
> +		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
> +		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
> +		__entry->dst_cpu	= dst_cpu;
> +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> +	),
> +
> +	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
> +			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
> +			__entry->src_cpu, __entry->src_nid,
> +			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> +			__entry->dst_cpu, __entry->dst_nid)
> +);
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c180860..3980110 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
>  	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
>  		goto out;
>  
> +	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
>  	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
>  
>  out:
> @@ -4091,6 +4092,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
>  
>  	/* TODO: This is not properly updating schedstats */
>  
> +	trace_sched_move_numa(p, curr_cpu, target_cpu);
>  	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
>  }
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41021c8..aac8c65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
>  	p->numa_scan_period = task_scan_min(p);
>  
>  	if (env.best_task == NULL) {
> -		int ret = migrate_task_to(p, env.best_cpu);
> +		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
> +			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
>  		return ret;
>  	}
>  
> -	ret = migrate_swap(p, env.best_task);
> +	if ((ret = migrate_swap(p, env.best_task)) != 0);
> +		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
>  	put_task_struct(env.best_task);
>  	return ret;
>  }
> 


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
  2013-12-09 18:54     ` Rik van Riel
@ 2013-12-10  8:42       ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-10  8:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Drew Jones

On Mon, Dec 09, 2013 at 01:54:51PM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > move_task() is called from move_one_task and move_tasks and is an
> > approximation of load balancer activity. We should be able to track
> > tasks that move between CPUs frequently. If the tracepoint included node
> > information then we could distinguish between in-node and between-node
> > traffic for load balancer decisions. The tracepoint allows us to track
> > local migrations, remote migrations and average task migrations.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does this replicate the task_sched_migrate_task tracepoint in
> set_task_cpu() ?
> 

There is significant overlap but bits missing. We do not necessarily know
where the task was previously running and whether this is a local->remote
migration. We also cannot tell the difference between load balancer activity,
numa balancing and try_to_wake_up. Still, you're right, this patch is not
painting a full picture either. I'll drop it for now and look at improving
the existing task_sched_migrate_task tracepoint.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
@ 2013-12-10  8:42       ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-10  8:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Drew Jones

On Mon, Dec 09, 2013 at 01:54:51PM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > move_task() is called from move_one_task and move_tasks and is an
> > approximation of load balancer activity. We should be able to track
> > tasks that move between CPUs frequently. If the tracepoint included node
> > information then we could distinguish between in-node and between-node
> > traffic for load balancer decisions. The tracepoint allows us to track
> > local migrations, remote migrations and average task migrations.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does this replicate the task_sched_migrate_task tracepoint in
> set_task_cpu() ?
> 

There is significant overlap but bits missing. We do not necessarily know
where the task was previously running and whether this is a local->remote
migration. We also cannot tell the difference between load balancer activity,
numa balancing and try_to_wake_up. Still, you're right, this patch is not
painting a full picture either. I'll drop it for now and look at improving
the existing task_sched_migrate_task tracepoint.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
  2013-12-09 18:54     ` Rik van Riel
@ 2013-12-10  9:06       ` Andrew Jones
  -1 siblings, 0 replies; 91+ messages in thread
From: Andrew Jones @ 2013-12-10  9:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Andrew Morton, Alex Thorlton, Linux-MM, LKML

On Mon, Dec 09, 2013 at 01:54:51PM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > move_task() is called from move_one_task and move_tasks and is an
> > approximation of load balancer activity. We should be able to track
> > tasks that move between CPUs frequently. If the tracepoint included node
> > information then we could distinguish between in-node and between-node
> > traffic for load balancer decisions. The tracepoint allows us to track
> > local migrations, remote migrations and average task migrations.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does this replicate the task_sched_migrate_task tracepoint in
> set_task_cpu() ?
> 
> I know Drew has been using that tracepoint in his (still experimental)
> numatop script. Drew, does this tracepoint look better than the trace
> point that you are currently using, or is it similar enough that we do
> not really benefit from this addition?

Right, sched::sched_migrate_task only gives us pid, orig_cpu, and
dest_cpu, but all the fields below are important. The numamigtop script
has been extracting/using all that information as well, but by using the
pid and /proc, plus a cpu-node map built from /sys info. I agree with Mel
that enhancing the tracepoint is a good idea. Doing so, would allow trace
data of this sort to be analyzed without a tool, or at least with a much
simpler tool.

drew

> 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 04c3084..cf1694c 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
> >  );
> >  #endif /* CONFIG_DETECT_HUNG_TASK */
> >  
> > +/*
> > + * Tracks migration of tasks from one runqueue to another. Can be used to
> > + * detect if automatic NUMA balancing is bouncing between nodes
> > + */
> > +TRACE_EVENT(sched_move_task,
> > +
> > +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> > +
> > +	TP_ARGS(tsk, src_cpu, dst_cpu),
> > +
> > +	TP_STRUCT__entry(
> > +		__field( pid_t,	pid			)
> > +		__field( pid_t,	tgid			)
> > +		__field( pid_t,	ngid			)
> > +		__field( int,	src_cpu			)
> > +		__field( int,	src_nid			)
> > +		__field( int,	dst_cpu			)
> > +		__field( int,	dst_nid			)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->pid		= task_pid_nr(tsk);
> > +		__entry->tgid		= task_tgid_nr(tsk);
> > +		__entry->ngid		= task_numa_group_id(tsk);
> > +		__entry->src_cpu	= src_cpu;
> > +		__entry->src_nid	= cpu_to_node(src_cpu);
> > +		__entry->dst_cpu	= dst_cpu;
> > +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> > +	),
> > +
> > +	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
> > +			__entry->pid, __entry->tgid, __entry->ngid,
> > +			__entry->src_cpu, __entry->src_nid,
> > +			__entry->dst_cpu, __entry->dst_nid)
> > +);
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1ce1615..41021c8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
> >  	set_task_cpu(p, env->dst_cpu);
> >  	activate_task(env->dst_rq, p, 0);
> >  	check_preempt_curr(env->dst_rq, p, 0);
> > +
> > +	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
> >  }
> >  
> >  /*
> > 
> 
> 
> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/18] sched: Tracepoint task movement
@ 2013-12-10  9:06       ` Andrew Jones
  0 siblings, 0 replies; 91+ messages in thread
From: Andrew Jones @ 2013-12-10  9:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Andrew Morton, Alex Thorlton, Linux-MM, LKML

On Mon, Dec 09, 2013 at 01:54:51PM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > move_task() is called from move_one_task and move_tasks and is an
> > approximation of load balancer activity. We should be able to track
> > tasks that move between CPUs frequently. If the tracepoint included node
> > information then we could distinguish between in-node and between-node
> > traffic for load balancer decisions. The tracepoint allows us to track
> > local migrations, remote migrations and average task migrations.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does this replicate the task_sched_migrate_task tracepoint in
> set_task_cpu() ?
> 
> I know Drew has been using that tracepoint in his (still experimental)
> numatop script. Drew, does this tracepoint look better than the trace
> point that you are currently using, or is it similar enough that we do
> not really benefit from this addition?

Right, sched::sched_migrate_task only gives us pid, orig_cpu, and
dest_cpu, but all the fields below are important. The numamigtop script
has been extracting/using all that information as well, but by using the
pid and /proc, plus a cpu-node map built from /sys info. I agree with Mel
that enhancing the tracepoint is a good idea. Doing so, would allow trace
data of this sort to be analyzed without a tool, or at least with a much
simpler tool.

drew

> 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 04c3084..cf1694c 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -443,6 +443,41 @@ TRACE_EVENT(sched_process_hang,
> >  );
> >  #endif /* CONFIG_DETECT_HUNG_TASK */
> >  
> > +/*
> > + * Tracks migration of tasks from one runqueue to another. Can be used to
> > + * detect if automatic NUMA balancing is bouncing between nodes
> > + */
> > +TRACE_EVENT(sched_move_task,
> > +
> > +	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> > +
> > +	TP_ARGS(tsk, src_cpu, dst_cpu),
> > +
> > +	TP_STRUCT__entry(
> > +		__field( pid_t,	pid			)
> > +		__field( pid_t,	tgid			)
> > +		__field( pid_t,	ngid			)
> > +		__field( int,	src_cpu			)
> > +		__field( int,	src_nid			)
> > +		__field( int,	dst_cpu			)
> > +		__field( int,	dst_nid			)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->pid		= task_pid_nr(tsk);
> > +		__entry->tgid		= task_tgid_nr(tsk);
> > +		__entry->ngid		= task_numa_group_id(tsk);
> > +		__entry->src_cpu	= src_cpu;
> > +		__entry->src_nid	= cpu_to_node(src_cpu);
> > +		__entry->dst_cpu	= dst_cpu;
> > +		__entry->dst_nid	= cpu_to_node(dst_cpu);
> > +	),
> > +
> > +	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
> > +			__entry->pid, __entry->tgid, __entry->ngid,
> > +			__entry->src_cpu, __entry->src_nid,
> > +			__entry->dst_cpu, __entry->dst_nid)
> > +);
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1ce1615..41021c8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4770,6 +4770,8 @@ static void move_task(struct task_struct *p, struct lb_env *env)
> >  	set_task_cpu(p, env->dst_cpu);
> >  	activate_task(env->dst_rq, p, 0);
> >  	check_preempt_curr(env->dst_rq, p, 0);
> > +
> > +	trace_sched_move_task(p, env->src_cpu, env->dst_cpu);
> >  }
> >  
> >  /*
> > 
> 
> 
> -- 
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-09  7:09   ` Mel Gorman
@ 2013-12-10 14:25     ` Rik van Riel
  -1 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-10 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Paul E. McKenney,
	Peter Zijlstra

On 12/09/2013 02:09 AM, Mel Gorman wrote:

After reading the locking thread that Paul McKenney started,
I wonder if I got the barriers wrong in these functions...

> +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> +/*
> + * Memory barriers to keep this state in sync are graciously provided by
> + * the page table locks, outside of which no page table modifications happen.
> + * The barriers below prevent the compiler from re-ordering the instructions
> + * around the memory barriers that are already present in the code.
> + */
> +static inline bool tlb_flush_pending(struct mm_struct *mm)
> +{
> +	barrier();

Should this be smp_mb__after_unlock_lock(); ?

> +	return mm->tlb_flush_pending;
> +}
> +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	mm->tlb_flush_pending = true;
> +	barrier();
> +}
> +/* Clearing is done after a TLB flush, which also provides a barrier. */
> +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	barrier();
> +	mm->tlb_flush_pending = false;
> +}

And these smp_mb__before_spinlock() ?

Paul? Peter?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
@ 2013-12-10 14:25     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2013-12-10 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Paul E. McKenney,
	Peter Zijlstra

On 12/09/2013 02:09 AM, Mel Gorman wrote:

After reading the locking thread that Paul McKenney started,
I wonder if I got the barriers wrong in these functions...

> +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> +/*
> + * Memory barriers to keep this state in sync are graciously provided by
> + * the page table locks, outside of which no page table modifications happen.
> + * The barriers below prevent the compiler from re-ordering the instructions
> + * around the memory barriers that are already present in the code.
> + */
> +static inline bool tlb_flush_pending(struct mm_struct *mm)
> +{
> +	barrier();

Should this be smp_mb__after_unlock_lock(); ?

> +	return mm->tlb_flush_pending;
> +}
> +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	mm->tlb_flush_pending = true;
> +	barrier();
> +}
> +/* Clearing is done after a TLB flush, which also provides a barrier. */
> +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	barrier();
> +	mm->tlb_flush_pending = false;
> +}

And these smp_mb__before_spinlock() ?

Paul? Peter?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-10 14:25     ` Rik van Riel
@ 2013-12-10 17:19       ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-10 17:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Paul E. McKenney,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 09:25:39AM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> 
> After reading the locking thread that Paul McKenney started,
> I wonder if I got the barriers wrong in these functions...
> 

If Documentation/memory-barriers.txt could not be used to frighten small
children before, it certainly can now.

> > +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> > +/*
> > + * Memory barriers to keep this state in sync are graciously provided by
> > + * the page table locks, outside of which no page table modifications happen.
> > + * The barriers below prevent the compiler from re-ordering the instructions
> > + * around the memory barriers that are already present in the code.
> > + */
> > +static inline bool tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	barrier();
> 
> Should this be smp_mb__after_unlock_lock(); ?
> 

I think this is still ok. Minimally, it's missing the unlock/lock pair that
would cause smp_mb__after_unlock_lock() to be treated as a full barrier
on architectures that care. The CPU executing this code as already seen
the pmd_numa update if it's in the fault handler so it just needs to be
sure to not reorder the check with respect to the page copy.

> > +	return mm->tlb_flush_pending;
> > +}
> > +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	mm->tlb_flush_pending = true;
> > +	barrier();
> > +}

That now needs an smp_mb_before_spinlock to guarantee that the store
mm->tlb_flush_pending does not leak into the section updating the page
tables and get re-ordered. The result would pair with tlb_flush_pending
to guarantee that a pagetable update that starts in parallel will be
visible to flush the TLB before the cop

> > +/* Clearing is done after a TLB flush, which also provides a barrier. */
> > +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	barrier();
> > +	mm->tlb_flush_pending = false;
> > +}
> 

This should be ok. Stores updating page tables complete before the ptl
unlock in addition to the TLB flush itself being a barrier that
guarantees the this update takes place afterwards.

Peter/Paul?

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c122bb1..33e5519 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
 	mm->tlb_flush_pending = true;
-	barrier();
+
+	/*
+	 * Guarantee that the tlb_flush_pending store does not leak into the
+	 * critical section updating the page tables
+	 */
+	smp_mb_before_spinlock();
 }
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void clear_tlb_flush_pending(struct mm_struct *mm)

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
@ 2013-12-10 17:19       ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-10 17:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, Paul E. McKenney,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 09:25:39AM -0500, Rik van Riel wrote:
> On 12/09/2013 02:09 AM, Mel Gorman wrote:
> 
> After reading the locking thread that Paul McKenney started,
> I wonder if I got the barriers wrong in these functions...
> 

If Documentation/memory-barriers.txt could not be used to frighten small
children before, it certainly can now.

> > +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> > +/*
> > + * Memory barriers to keep this state in sync are graciously provided by
> > + * the page table locks, outside of which no page table modifications happen.
> > + * The barriers below prevent the compiler from re-ordering the instructions
> > + * around the memory barriers that are already present in the code.
> > + */
> > +static inline bool tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	barrier();
> 
> Should this be smp_mb__after_unlock_lock(); ?
> 

I think this is still ok. Minimally, it's missing the unlock/lock pair that
would cause smp_mb__after_unlock_lock() to be treated as a full barrier
on architectures that care. The CPU executing this code as already seen
the pmd_numa update if it's in the fault handler so it just needs to be
sure to not reorder the check with respect to the page copy.

> > +	return mm->tlb_flush_pending;
> > +}
> > +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	mm->tlb_flush_pending = true;
> > +	barrier();
> > +}

That now needs an smp_mb_before_spinlock to guarantee that the store
mm->tlb_flush_pending does not leak into the section updating the page
tables and get re-ordered. The result would pair with tlb_flush_pending
to guarantee that a pagetable update that starts in parallel will be
visible to flush the TLB before the cop

> > +/* Clearing is done after a TLB flush, which also provides a barrier. */
> > +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > +{
> > +	barrier();
> > +	mm->tlb_flush_pending = false;
> > +}
> 

This should be ok. Stores updating page tables complete before the ptl
unlock in addition to the TLB flush itself being a barrier that
guarantees the this update takes place afterwards.

Peter/Paul?

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c122bb1..33e5519 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
 	mm->tlb_flush_pending = true;
-	barrier();
+
+	/*
+	 * Guarantee that the tlb_flush_pending store does not leak into the
+	 * critical section updating the page tables
+	 */
+	smp_mb_before_spinlock();
 }
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void clear_tlb_flush_pending(struct mm_struct *mm)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-10 17:19       ` Mel Gorman
@ 2013-12-10 18:02         ` Paul E. McKenney
  -1 siblings, 0 replies; 91+ messages in thread
From: Paul E. McKenney @ 2013-12-10 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrew Morton, Alex Thorlton, Linux-MM, LKML,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 05:19:36PM +0000, Mel Gorman wrote:
> On Tue, Dec 10, 2013 at 09:25:39AM -0500, Rik van Riel wrote:
> > On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > 
> > After reading the locking thread that Paul McKenney started,
> > I wonder if I got the barriers wrong in these functions...
> 
> If Documentation/memory-barriers.txt could not be used to frighten small
> children before, it certainly can now.

Depends on the children.  Some might find it quite attractive, sort of
like running while carrying a knife.

> > > +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> > > +/*
> > > + * Memory barriers to keep this state in sync are graciously provided by
> > > + * the page table locks, outside of which no page table modifications happen.
> > > + * The barriers below prevent the compiler from re-ordering the instructions
> > > + * around the memory barriers that are already present in the code.
> > > + */
> > > +static inline bool tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	barrier();
> > 
> > Should this be smp_mb__after_unlock_lock(); ?
> 
> I think this is still ok. Minimally, it's missing the unlock/lock pair that
> would cause smp_mb__after_unlock_lock() to be treated as a full barrier
> on architectures that care. The CPU executing this code as already seen
> the pmd_numa update if it's in the fault handler so it just needs to be
> sure to not reorder the check with respect to the page copy.

You really do need a lock operation somewhere shortly before the
smp_mb__after_unlock_lock().

> > > +	return mm->tlb_flush_pending;
> > > +}
> > > +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	mm->tlb_flush_pending = true;
> > > +	barrier();
> > > +}
> 
> That now needs an smp_mb_before_spinlock to guarantee that the store
> mm->tlb_flush_pending does not leak into the section updating the page
> tables and get re-ordered. The result would pair with tlb_flush_pending
> to guarantee that a pagetable update that starts in parallel will be
> visible to flush the TLB before the cop

That would be required even if UNLOCK+LOCK continued being a full barrier.
A lock acquisition by itself never was guaranteed to be a full barrier.

							Thanx, Paul

> > > +/* Clearing is done after a TLB flush, which also provides a barrier. */
> > > +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	barrier();
> > > +	mm->tlb_flush_pending = false;
> > > +}
> > 
> 
> This should be ok. Stores updating page tables complete before the ptl
> unlock in addition to the TLB flush itself being a barrier that
> guarantees the this update takes place afterwards.
> 
> Peter/Paul?
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c122bb1..33e5519 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
>  static inline void set_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	mm->tlb_flush_pending = true;
> -	barrier();
> +
> +	/*
> +	 * Guarantee that the tlb_flush_pending store does not leak into the
> +	 * critical section updating the page tables
> +	 */
> +	smp_mb_before_spinlock();
>  }
>  /* Clearing is done after a TLB flush, which also provides a barrier. */
>  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
@ 2013-12-10 18:02         ` Paul E. McKenney
  0 siblings, 0 replies; 91+ messages in thread
From: Paul E. McKenney @ 2013-12-10 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrew Morton, Alex Thorlton, Linux-MM, LKML,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 05:19:36PM +0000, Mel Gorman wrote:
> On Tue, Dec 10, 2013 at 09:25:39AM -0500, Rik van Riel wrote:
> > On 12/09/2013 02:09 AM, Mel Gorman wrote:
> > 
> > After reading the locking thread that Paul McKenney started,
> > I wonder if I got the barriers wrong in these functions...
> 
> If Documentation/memory-barriers.txt could not be used to frighten small
> children before, it certainly can now.

Depends on the children.  Some might find it quite attractive, sort of
like running while carrying a knife.

> > > +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
> > > +/*
> > > + * Memory barriers to keep this state in sync are graciously provided by
> > > + * the page table locks, outside of which no page table modifications happen.
> > > + * The barriers below prevent the compiler from re-ordering the instructions
> > > + * around the memory barriers that are already present in the code.
> > > + */
> > > +static inline bool tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	barrier();
> > 
> > Should this be smp_mb__after_unlock_lock(); ?
> 
> I think this is still ok. Minimally, it's missing the unlock/lock pair that
> would cause smp_mb__after_unlock_lock() to be treated as a full barrier
> on architectures that care. The CPU executing this code as already seen
> the pmd_numa update if it's in the fault handler so it just needs to be
> sure to not reorder the check with respect to the page copy.

You really do need a lock operation somewhere shortly before the
smp_mb__after_unlock_lock().

> > > +	return mm->tlb_flush_pending;
> > > +}
> > > +static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	mm->tlb_flush_pending = true;
> > > +	barrier();
> > > +}
> 
> That now needs an smp_mb_before_spinlock to guarantee that the store
> mm->tlb_flush_pending does not leak into the section updating the page
> tables and get re-ordered. The result would pair with tlb_flush_pending
> to guarantee that a pagetable update that starts in parallel will be
> visible to flush the TLB before the cop

That would be required even if UNLOCK+LOCK continued being a full barrier.
A lock acquisition by itself never was guaranteed to be a full barrier.

							Thanx, Paul

> > > +/* Clearing is done after a TLB flush, which also provides a barrier. */
> > > +static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > > +{
> > > +	barrier();
> > > +	mm->tlb_flush_pending = false;
> > > +}
> > 
> 
> This should be ok. Stores updating page tables complete before the ptl
> unlock in addition to the TLB flush itself being a barrier that
> guarantees the this update takes place afterwards.
> 
> Peter/Paul?
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c122bb1..33e5519 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
>  static inline void set_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	mm->tlb_flush_pending = true;
> -	barrier();
> +
> +	/*
> +	 * Guarantee that the tlb_flush_pending store does not leak into the
> +	 * critical section updating the page tables
> +	 */
> +	smp_mb_before_spinlock();
>  }
>  /* Clearing is done after a TLB flush, which also provides a barrier. */
>  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-10 18:02         ` Paul E. McKenney
@ 2013-12-11 11:21           ` Mel Gorman
  -1 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-11 11:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rik van Riel, Andrew Morton, Alex Thorlton, Linux-MM, LKML,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 10:02:08AM -0800, Paul E. McKenney wrote:
> > > Should this be smp_mb__after_unlock_lock(); ?
> > 
> > I think this is still ok. Minimally, it's missing the unlock/lock pair that
> > would cause smp_mb__after_unlock_lock() to be treated as a full barrier
> > on architectures that care. The CPU executing this code as already seen
> > the pmd_numa update if it's in the fault handler so it just needs to be
> > sure to not reorder the check with respect to the page copy.
> 
> You really do need a lock operation somewhere shortly before the
> smp_mb__after_unlock_lock().
> 

My badly phrased point was that there was no unlock/lock operation nearby
that needs to be ordered with respect to the tlb_flush_pending check. I
do not see a need for smp_mb__after_unlock_lock() here and just this
hunk is required.

> > index c122bb1..33e5519 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
> >  static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >  	mm->tlb_flush_pending = true;
> > -	barrier();
> > +
> > +	/*
> > +	 * Guarantee that the tlb_flush_pending store does not leak into the
> > +	 * critical section updating the page tables
> > +	 */
> > +	smp_mb_before_spinlock();
> >  }
> >  /* Clearing is done after a TLB flush, which also provides a barrier. */
> >  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > 

A double check would be nice please.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
@ 2013-12-11 11:21           ` Mel Gorman
  0 siblings, 0 replies; 91+ messages in thread
From: Mel Gorman @ 2013-12-11 11:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rik van Riel, Andrew Morton, Alex Thorlton, Linux-MM, LKML,
	Peter Zijlstra

On Tue, Dec 10, 2013 at 10:02:08AM -0800, Paul E. McKenney wrote:
> > > Should this be smp_mb__after_unlock_lock(); ?
> > 
> > I think this is still ok. Minimally, it's missing the unlock/lock pair that
> > would cause smp_mb__after_unlock_lock() to be treated as a full barrier
> > on architectures that care. The CPU executing this code as already seen
> > the pmd_numa update if it's in the fault handler so it just needs to be
> > sure to not reorder the check with respect to the page copy.
> 
> You really do need a lock operation somewhere shortly before the
> smp_mb__after_unlock_lock().
> 

My badly phrased point was that there was no unlock/lock operation nearby
that needs to be ordered with respect to the tlb_flush_pending check. I
do not see a need for smp_mb__after_unlock_lock() here and just this
hunk is required.

> > index c122bb1..33e5519 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
> >  static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >  	mm->tlb_flush_pending = true;
> > -	barrier();
> > +
> > +	/*
> > +	 * Guarantee that the tlb_flush_pending store does not leak into the
> > +	 * critical section updating the page tables
> > +	 */
> > +	smp_mb_before_spinlock();
> >  }
> >  /* Clearing is done after a TLB flush, which also provides a barrier. */
> >  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> > 

A double check would be nice please.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2013-12-11 11:21 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-09  7:08 [PATCH 00/18] NUMA balancing segmentation fault fixes and misc followups v3 Mel Gorman
2013-12-09  7:08 ` Mel Gorman
2013-12-09  7:08 ` [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
2013-12-09  7:08   ` Mel Gorman
2013-12-09 14:08   ` Rik van Riel
2013-12-09 14:08     ` Rik van Riel
2013-12-09  7:08 ` [PATCH 02/18] mm: numa: Call MMU notifiers on " Mel Gorman
2013-12-09  7:08   ` Mel Gorman
2013-12-09 14:09   ` Rik van Riel
2013-12-09 14:09     ` Rik van Riel
2013-12-09  7:08 ` [PATCH 03/18] mm: Clear pmd_numa before invalidating Mel Gorman
2013-12-09  7:08   ` Mel Gorman
2013-12-09 14:14   ` Rik van Riel
2013-12-09 14:14     ` Rik van Riel
2013-12-09  7:08 ` [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
2013-12-09  7:08   ` Mel Gorman
2013-12-09 14:22   ` Rik van Riel
2013-12-09 14:22     ` Rik van Riel
2013-12-09  7:08 ` [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
2013-12-09  7:08   ` Mel Gorman
2013-12-09 14:31   ` Rik van Riel
2013-12-09 14:31     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 14:34   ` Rik van Riel
2013-12-09 14:34     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 14:42   ` Rik van Riel
2013-12-09 14:42     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 08/18] sched: numa: Skip inaccessible VMAs Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 14:50   ` Rik van Riel
2013-12-09 14:50     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 15:57   ` Rik van Riel
2013-12-09 15:57     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 16:10   ` Rik van Riel
2013-12-09 16:10     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-10 14:25   ` Rik van Riel
2013-12-10 14:25     ` Rik van Riel
2013-12-10 17:19     ` Mel Gorman
2013-12-10 17:19       ` Mel Gorman
2013-12-10 18:02       ` Paul E. McKenney
2013-12-10 18:02         ` Paul E. McKenney
2013-12-11 11:21         ` Mel Gorman
2013-12-11 11:21           ` Mel Gorman
2013-12-09  7:09 ` [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 16:13   ` Rik van Riel
2013-12-09 16:13     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09  7:20   ` Wanpeng Li
     [not found]   ` <20131209072010.GA3716@hacker.(null)>
2013-12-09  8:46     ` Mel Gorman
2013-12-09  8:46       ` Mel Gorman
2013-12-09  8:57       ` Wanpeng Li
     [not found]       ` <20131209085720.GA16251@hacker.(null)>
2013-12-09  9:08         ` Mel Gorman
2013-12-09  9:08           ` Mel Gorman
2013-12-09  9:13           ` Wanpeng Li
2013-12-09 16:14   ` Rik van Riel
2013-12-09 16:14     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 16:47   ` Rik van Riel
2013-12-09 16:47     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 15/18] mm: numa: Trace tasks that fail migration due to " Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 16:57   ` Rik van Riel
2013-12-09 16:57     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 16:57   ` Rik van Riel
2013-12-09 16:57     ` Rik van Riel
2013-12-09  7:09 ` [PATCH 17/18] sched: Tracepoint task movement Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 18:54   ` Rik van Riel
2013-12-09 18:54     ` Rik van Riel
2013-12-10  8:42     ` Mel Gorman
2013-12-10  8:42       ` Mel Gorman
2013-12-10  9:06     ` Andrew Jones
2013-12-10  9:06       ` Andrew Jones
2013-12-09  7:09 ` [PATCH 18/18] sched: Add tracepoints related to NUMA task migration Mel Gorman
2013-12-09  7:09   ` Mel Gorman
2013-12-09 19:06   ` Rik van Riel
2013-12-09 19:06     ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.