linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
@ 2012-11-13 11:12 Mel Gorman
  2012-11-13 11:12 ` [PATCH 01/31] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
                   ` (31 more replies)
  0 siblings, 32 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

(Since I wrote this changelog there has been another release of schednuma.
I had delayed releasing this series long enough and decided not to delay
further. Of course, I plan to dig into that new revision and see what
has changed.)

This is V2 of the series which attempts to layer parts of autonuma's
placement policy on top of the balancenuma foundation. Unfortunately a few
bugs were discovered very late in the foundation. This forced me to discard
all test results and a number of patches which I could no longer depend on
as a result of the bugs. I'll have to redo and resend later but decided to
send this series as-is as it had been delayed enough already.  This series
is still very much a WIP but I wanted to show where things currently stand
in terms of pulling material from both schednuma and autonuma.

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two competing approaches to implement support for
automatically migrating pages to optimise NUMA locality.  Performance results
are available for both but review highlighted different problems in both.
They are not compatible with each other even though some fundamental
mechanics should have been the same.

This series addresses part of the integration and sharing problem
by implementing a foundation that either the policy for schednuma or
autonuma can be rebased on. The initial policy it implements is a very
basic greedy policy called "Migrate On Reference Of pte_numa Node (MORON)"
and is later replaced by a variation of the home-node policy and renamed.
I expect to build upon this revised policy and rename it to something
more sensible that reflects what it means.

Patches 1-3 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ.

Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends. It implements them for x86, handles GUP and preserves
	the _PAGE_NUMA bit across THP splits.

Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patches 9-11 add a migrate-on-fault mode that applications can specifically
	ask for. Applications can take advantage of this if they wish. It
	also meanst that if automatic balancing was broken for some workload
	that the application could disable the automatic stuff but still
	get some advantage.

Patch 12 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 13 migrates the page on fault if mpol_misplaced() says to do so.

Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 15 sets pte_numa within the context of the scheduler.

Patch 16 avoids calling task_numa_placement if the page is not misplaced as later
	in the series that becomes a very heavy function.

Patch 17 tries to avoid double faulting after migrating a page

Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 20 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 21 implements the MORON policy.

Patches 22-24 brings in some TLB flush reduction patches. It was pointed
	out that try_to_unmap_one still incurs a TLB flush and this is true.
	An initial patch to cover this looked promising but was suspected
	of a stability issue. It was likely triggered by another corruption
	bug that has since been fixed and needs to be revisited.

Patches 25-28 introduces the concept of a home-node that the scheduler tries
	to keep processes on. It's advisory only and not particularly strict.
	There may be a problem with this whereby the load balancer is not
	pushing processes back to their home node because there are no
	idle CPUs available. It might need to be more aggressive about
	swapping two tasks that are both running off their home node.

Patch 29 implements a CPU follow memory policy. It builds statistics
	on faults on a per-task and per-mm basis and decides if a tasks
	home node should be updated on that basis.

Patch 30-31 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Some notes.

The MPOL_LAZY policy is still be exposed to userspace. It has been asked that
this be dropped until the series has solidifed. I'm happy to do this but kept
it in this release. If I hear no objections I'll drop it in the next release.

This still is missing a mechanism for disabling from the command-line.

Documentation is sorely missing at this point.

Although the results the observation is based on are unusable, I noticed
one interesting thing in the profiles is how mutex_spin_on_owner()
changes which is ordinarily a sensible heuristic. On autonumabench
NUMA01_THREADLOCAL, the patches spend more time spinning in there and more
time in intel_idle implying that other users are waiting for the pte_numa
updates to complete. In the autonumabenchmark cases, the other contender
could be khugepaged. In the specjbb case there is also a lot of spinning
and it could be due to the JVM calling mprotect(). One way or the other,
it needs to be pinned down if the pte_numa updates are the problem and
if so how we might work around the requirement to hold mmap_sem while the
pte_numa update takes place.

 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/include/asm/pgtable.h       |   65 ++-
 arch/x86/include/asm/pgtable_types.h |   20 +
 arch/x86/mm/gup.c                    |   13 +-
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |   12 +
 include/linux/huge_mm.h              |   10 +
 include/linux/init_task.h            |    8 +
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   21 +-
 include/linux/mm.h                   |   33 ++
 include/linux/mm_types.h             |   44 ++
 include/linux/sched.h                |   52 +++
 include/linux/vm_event_item.h        |   12 +-
 include/trace/events/migrate.h       |   51 +++
 include/uapi/linux/mempolicy.h       |   24 +-
 init/Kconfig                         |   14 +
 kernel/fork.c                        |   18 +
 kernel/sched/core.c                  |   60 ++-
 kernel/sched/debug.c                 |    3 +
 kernel/sched/fair.c                  |  743 ++++++++++++++++++++++++++++++++--
 kernel/sched/features.h              |   25 ++
 kernel/sched/sched.h                 |   36 ++
 kernel/sysctl.c                      |   38 +-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   53 +++
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  167 +++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  360 ++++++++++++++--
 mm/migrate.c                         |  130 +++++-
 mm/page_alloc.c                      |    5 +-
 mm/pgtable-generic.c                 |    6 +-
 mm/vmstat.c                          |   16 +-
 34 files changed, 1985 insertions(+), 92 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/31] mm: compaction: Move migration fail/success stats to migrate.c
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 02/31] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    4 +++-
 mm/compaction.c               |    4 ----
 mm/migrate.c                  |    6 ++++++
 mm/vmstat.c                   |    7 ++++---
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..8aa7cb9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,8 +38,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MIGRATION
+		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
+#endif
 #ifdef CONFIG_COMPACTION
-		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..00ad883 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,10 +994,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
-		count_vm_event(COMPACTBLOCKS);
-		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
-		if (nr_remaining)
-			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
 		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
 						nr_remaining);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..04687f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -962,6 +962,7 @@ int migrate_pages(struct list_head *from,
 {
 	int retry = 1;
 	int nr_failed = 0;
+	int nr_succeeded = 0;
 	int pass = 0;
 	struct page *page;
 	struct page *page2;
@@ -988,6 +989,7 @@ int migrate_pages(struct list_head *from,
 				retry++;
 				break;
 			case 0:
+				nr_succeeded++;
 				break;
 			default:
 				/* Permanent failure */
@@ -998,6 +1000,10 @@ int migrate_pages(struct list_head *from,
 	}
 	rc = 0;
 out:
+	if (nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (nr_failed)
+		count_vm_events(PGMIGRATE_FAIL, nr_failed);
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..89a7fd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,10 +774,11 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_MIGRATION
+	"pgmigrate_success",
+	"pgmigrate_fail",
+#endif
 #ifdef CONFIG_COMPACTION
-	"compact_blocks_moved",
-	"compact_pages_moved",
-	"compact_pagemigrate_failed",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/31] mm: migrate: Add a tracepoint for migrate_pages
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
  2012-11-13 11:12 ` [PATCH 01/31] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 03/31] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h        |   13 ++++++++--
 include/trace/events/migrate.h |   51 ++++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                |    3 ++-
 mm/memory-failure.c            |    3 ++-
 mm/memory_hotplug.c            |    3 ++-
 mm/mempolicy.c                 |    6 +++--
 mm/migrate.c                   |   10 ++++++--
 mm/page_alloc.c                |    3 ++-
 8 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/migrate.h

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9d1c159 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,15 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+enum migrate_reason {
+	MR_COMPACTION,
+	MR_MEMORY_FAILURE,
+	MR_MEMORY_HOTPLUG,
+	MR_SYSCALL,		/* also applies to cpusets */
+	MR_MEMPOLICY_MBIND,
+	MR_CMA
+};
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
@@ -14,7 +23,7 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			enum migrate_mode mode);
+			enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
@@ -35,7 +44,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		enum migrate_mode mode) { return -ENOSYS; }
+		enum migrate_mode mode, int reason) { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
new file mode 100644
index 0000000..ec2a6cc
--- /dev/null
+++ b/include/trace/events/migrate.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM migrate
+
+#if !defined(_TRACE_MIGRATE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MIGRATE_H
+
+#define MIGRATE_MODE						\
+	{MIGRATE_ASYNC,		"MIGRATE_ASYNC"},		\
+	{MIGRATE_SYNC_LIGHT,	"MIGRATE_SYNC_LIGHT"},		\
+	{MIGRATE_SYNC,		"MIGRATE_SYNC"}		
+
+#define MIGRATE_REASON						\
+	{MR_COMPACTION,		"compaction"},			\
+	{MR_MEMORY_FAILURE,	"memory_failure"},		\
+	{MR_MEMORY_HOTPLUG,	"memory_hotplug"},		\
+	{MR_SYSCALL,		"syscall_or_cpuset"},		\
+	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
+	{MR_CMA,		"cma"}
+
+TRACE_EVENT(mm_migrate_pages,
+
+	TP_PROTO(unsigned long succeeded, unsigned long failed,
+		 enum migrate_mode mode, int reason),
+
+	TP_ARGS(succeeded, failed, mode, reason),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,		succeeded)
+		__field(	unsigned long,		failed)
+		__field(	enum migrate_mode,	mode)
+		__field(	int,			reason)
+	),
+
+	TP_fast_assign(
+		__entry->succeeded	= succeeded;
+		__entry->failed		= failed;
+		__entry->mode		= mode;
+		__entry->reason		= reason;
+	),
+
+	TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s",
+		__entry->succeeded,
+		__entry->failed,
+		__print_symbolic(__entry->mode, MIGRATE_MODE),
+		__print_symbolic(__entry->reason, MIGRATE_REASON))
+);
+
+#endif /* _TRACE_MIGRATE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/compaction.c b/mm/compaction.c
index 00ad883..2c077a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
+				MR_COMPACTION);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..ddb68a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1558,7 +1558,8 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_MEMORY_FAILURE);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 56b758a..af60ce7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -819,7 +819,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * migrate_pages returns # of failed pages.
 		 */
 		ret = migrate_pages(&source, alloc_migrate_target, 0,
-							true, MIGRATE_SYNC);
+							true, MIGRATE_SYNC,
+							MR_MEMORY_HOTPLUG);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..66e90ec 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -961,7 +961,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1202,7 +1203,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						false, MIGRATE_SYNC,
+						MR_MEMPOLICY_MBIND);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 04687f6..27be9c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,6 +38,9 @@
 
 #include <asm/tlbflush.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/migrate.h>
+
 #include "internal.h"
 
 /*
@@ -958,7 +961,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+		enum migrate_mode mode, int reason)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1004,6 +1007,8 @@ out:
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
 	if (nr_failed)
 		count_vm_events(PGMIGRATE_FAIL, nr_failed);
+	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
@@ -1145,7 +1150,8 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, MIGRATE_SYNC);
+				(unsigned long)pm, 0, MIGRATE_SYNC,
+				MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..4681fc4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5707,7 +5707,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages,
 				    alloc_migrate_target,
-				    0, false, MIGRATE_SYNC);
+				    0, false, MIGRATE_SYNC,
+				    MR_CMA);
 	}
 
 	putback_lru_pages(&cc->migratepages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/31] mm: compaction: Add scanned and isolated counters for compaction
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
  2012-11-13 11:12 ` [PATCH 01/31] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
  2012-11-13 11:12 ` [PATCH 02/31] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 04/31] mm: numa: define _PAGE_NUMA Mel Gorman
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u:	a word	sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation    = (Ca + Wi)
	where Wi is a constant that should reflect the approximate
	cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free    scanning = Ca

Overall cost =	(Csm * compact_migrate_scanned) +
	      	(Csf * compact_free_scanned)    +
	      	(Ci  * compact_isolated)	+
		(Cmc * pgmigrate_success)	+
		(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/compaction.c               |    8 ++++++++
 mm/vmstat.c                   |    3 +++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
+		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (blockpfn == end_pfn)
 		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
+	count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+	if (total_isolated)
+		count_vm_events(COMPACTISOLATED, total_isolated);
+
 	return total_isolated;
 }
 
@@ -646,6 +650,10 @@ next_pageblock:
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+	count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+	if (nr_isolated)
+		count_vm_events(COMPACTISOLATED, nr_isolated);
+
 	return low_pfn;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
 	"pgmigrate_fail",
 #endif
 #ifdef CONFIG_COMPACTION
+	"compact_migrate_scanned",
+	"compact_free_scanned",
+	"compact_isolated",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/31] mm: numa: define _PAGE_NUMA
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (2 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 03/31] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 05/31] mm: numa: pte_numa() and pmd_numa() Mel Gorman
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA	_PAGE_PROTNONE
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/31] mm: numa: pte_numa() and pmd_numa()
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (3 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 04/31] mm: numa: define _PAGE_NUMA Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 06/31] mm: numa: teach gup_fast about pmd_numa Mel Gorman
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   65 ++++++++++++++++++++++++++++++++++++++--
 include/asm-generic/pgtable.h  |   12 ++++++++
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..e075d57 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA);
+}
+
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_BALANCE_NUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..896667e 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -554,6 +554,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_BALANCE_NUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/31] mm: numa: teach gup_fast about pmd_numa
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (4 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 05/31] mm: numa: pte_numa() and pmd_numa() Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

When scanning pmds, the pmd may be of numa type (_PAGE_PRESENT not set),
however the pte might be present. Therefore, gup_pmd_range() must return
0 in this case to avoid losing a NUMA hinting page fault during gup_fast.

Note: gup_fast will skip over non present ptes (like numa types), so
no explicit check is needed for the pte_numa case. gup_fast will also
skip over THP when the trans huge pmd is non present. So, the pmd_numa
case will also be correctly skipped with no additional code changes
required.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..02c5ec5 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -163,8 +163,19 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * can't because it has irq disabled and
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
+		 *
+		 * The pmd_numa() check is needed because the code
+		 * doesn't check the _PAGE_PRESENT bit of the pmd if
+		 * the gup_pte_range() path is taken. NOTE: not all
+		 * gup_fast users will will access the page contents
+		 * using the CPU through the NUMA memory channels like
+		 * KVM does. So we're forced to trigger NUMA hinting
+		 * page faults unconditionally for all gup_fast users
+		 * even though NUMA hinting page faults aren't useful
+		 * to I/O drivers that will access the page with DMA
+		 * and not with the CPU.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (5 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 06/31] mm: numa: teach gup_fast about pmd_numa Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 17:13   ` Rik van Riel
  2012-11-13 11:12 ` [PATCH 08/31] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3aaf242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,6 +1363,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(*pmd))
+				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/31] mm: numa: Create basic numa page hinting infrastructure
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (6 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 09/31] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: This patch started as "mm/mpol: Create special PROT_NONE
	infrastructure" and preserves the basic idea but steals *very*
	heavily from "autonuma: numa hinting page faults entry points" for
	the actual fault handlers without the migration parts.	The end
	result is barely recognisable as either patch so all Signed-off
	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
	this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/huge_mm.h |   10 +++++
 mm/huge_memory.c        |   21 ++++++++++
 mm/memory.c             |   98 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 126 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..a13ebb1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,10 @@ static inline struct page *compound_trans_head(struct page *page)
 	}
 	return page;
 }
+
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				  pmd_t pmd, pmd_t *pmdp);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +199,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 {
 	return 0;
 }
+
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp);
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3aaf242..92a64d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1017,6 +1017,27 @@ out:
 	return page;
 }
 
+/* NUMA hinting page fault entry point for trans huge pmds */
+int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				pmd_t pmd, pmd_t *pmdp)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
+	page = pmd_page(pmd);
+	pmd = pmd_mknonnuma(pmd);
+	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+	VM_BUG_ON(pmd_numa(*pmdp));
+	update_mmu_cache_pmd(vma, addr, ptep);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return 0;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..0d9d539 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3433,6 +3433,89 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto out_unlock;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	update_mmu_cache(vma, addr, ptep);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte, *orig_pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return 0;
+
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page *page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	return 0;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3471,6 +3554,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(entry))
+		return do_numa_page(mm, vma, address, entry, pte, pmd);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3539,9 +3625,11 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(*pmd))
+				return do_huge_pmd_numa_page(mm, address,
+							     orig_pmd, pmd);
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3553,10 +3641,14 @@ retry:
 					goto retry;
 				return ret;
 			}
+
 			return 0;
 		}
 	}
 
+	if (pmd_numa(*pmd))
+		return do_pmd_numa_page(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/31] mm: mempolicy: Make MPOL_LOCAL a real policy
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (7 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 08/31] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 10/31] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66e90ec..54bd3e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2399,7 +2403,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2452,12 +2455,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/31] mm: mempolicy: Add MPOL_MF_NOOP
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (8 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 09/31] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 11/31] mm: mempolicy: Check for misplaced page Mel Gorman
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: I have not yet addressed by own review feedback of this patch. At
	this point I'm trying to construct a baseline tree and will apply
	my own review feedback later and then fold it in.

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 54bd3e5..c21e914 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1147,7 +1147,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2409,7 +2409,8 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
-	[MPOL_LOCAL]      = "local"
+	[MPOL_LOCAL]      = "local",
+	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2460,7 +2461,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX)
+	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/31] mm: mempolicy: Check for misplaced page
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (9 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 10/31] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 12/31] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy.  So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h      |    8 +++++
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   76 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c21e914..df1466d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2181,6 +2181,82 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/31] mm: migrate: Introduce migrate_misplaced_page()
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (10 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 11/31] mm: mempolicy: Check for misplaced page Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 13/31] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: This was originally based on Peter's patch "mm/migrate: Introduce
	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
	collection". The end result is barely recognisable so signed-offs
	had to be dropped. If original authors are ok with it, I'll
	re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |    8 ++++
 mm/migrate.c            |  104 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9d1c159..69f60b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -13,6 +13,7 @@ enum migrate_reason {
 	MR_MEMORY_HOTPLUG,
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
+	MR_NUMA_MISPLACED,
 	MR_CMA
 };
 
@@ -39,6 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -72,5 +74,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 27be9c9..4a92808 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -282,7 +282,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
 	if (!mapping) {
@@ -1415,4 +1415,104 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  	}
  	return err;
 }
-#endif
+
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
+	}
+
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC,
+				MR_NUMA_MISPLACED);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
+out:
+	return isolated;
+}
+
+#endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/31] mm: mempolicy: Use _PAGE_NUMA to migrate pages
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (11 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 12/31] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 14/31] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
	sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    8 ++++----
 mm/huge_memory.c        |   32 +++++++++++++++++++++++++++++---
 mm/memory.c             |   44 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a13ebb1..406f81c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,8 +160,8 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				  pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
@@ -200,8 +200,8 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-					pmd_t pmd, pmd_t *pmdp);
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92a64d2..1453c30 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1018,16 +1019,39 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page;
+	struct page *page = NULL;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int target_nid;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1)
+		goto clear_pmdnuma;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return 0;
+
+clear_pmdnuma:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
@@ -1035,6 +1059,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 0d9d539..b41f89c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3436,8 +3437,9 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page;
+	struct page *page = NULL;
 	spinlock_t *ptl;
+	int current_nid, target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3452,14 +3454,48 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*ptep, pte)))
 		goto out_unlock;
-	pte = pte_mknonnuma(pte);
-	set_pte_at(mm, addr, ptep, pte);
+
 	page = vm_normal_page(vma, addr, pte);
 	BUG_ON(!page);
+
+	get_page(page);
+	current_nid = page_to_nid(page);
+	target_nid = mpol_misplaced(page, vma, addr);
+	if (target_nid == -1) {
+		/*
+		 * Account for the fault against the current node if it not
+		 * being replaced regardless of where the page is located.
+		 */
+		current_nid = numa_node_id();
+		goto clear_pmdnuma;
+	}
+	pte_unmap_unlock(ptep, ptl);
+
+	/* Migrate to the requested node */
+	if (migrate_misplaced_page(page, target_nid)) {
+		/*
+		 * If the page was migrated then the pte_same check below is
+		 * guaranteed to fail so just retry the entire fault.
+		 */
+		current_nid = target_nid;
+		goto out;
+	}
+	page = NULL;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!pte_same(*ptep, pte))
+		goto out_unlock;
+
+clear_pmdnuma:
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
+	if (page)
+		put_page(page);
+out:
 	return 0;
 }
 
@@ -3626,7 +3662,7 @@ retry:
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
 			if (pmd_numa(*pmd))
-				return do_huge_pmd_numa_page(mm, address,
+				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
 			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/31] mm: mempolicy: Add MPOL_MF_LAZY
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (12 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 13/31] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 15/31] mm: numa: Add fault driven placement and migration Mel Gorman
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: Once again there is a lot of patch stealing and the end result
	is sufficiently different that I had to drop the signed-offs.
	Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h             |    3 +
 include/uapi/linux/mempolicy.h |   13 ++-
 mm/mempolicy.c                 |  177 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 175 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..eed70f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1548,6 +1548,9 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+void change_prot_numa(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..11052ea 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -566,6 +567,137 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 }
 
 /*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		int page_nid;
+		ret = HPAGE_PMD_NR;
+
+		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page = pmd_page(*pmd);
+
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page_nid = page_to_nid(page);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		ret += HPAGE_PMD_NR;
+		/* defer TLB flush to lower the overhead */
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		goto out;
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+/* Assumes mmap_sem is held */
+void
+change_prot_numa(struct vm_area_struct *vma,
+			unsigned long address, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int progress = 0;
+
+	while (address < end) {
+		VM_BUG_ON(address < vma->vm_start ||
+			  address + PAGE_SIZE > vma->vm_end);
+
+		progress += change_prot_numa_range(mm, vma, address);
+		address = (address + PMD_SIZE) & PMD_MASK;
+	}
+
+	/*
+	 * Flush the TLB for the mm to start the NUMA hinting
+	 * page faults after we finish scanning this vma part
+	 * if there were any PTE updates
+	 */
+	if (progress) {
+		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
+		flush_tlb_range(vma, address, end);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+	}
+}
+
+/*
  * Check if all pages in a range are on a set of nodes.
  * If pagelist != NULL then isolate pages from the LRU and
  * put them on the pagelist.
@@ -583,22 +715,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_numa(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -606,6 +748,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1138,8 +1281,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1162,6 +1304,9 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1198,13 +1343,15 @@ static long do_mbind(unsigned long start, unsigned long len,
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
 						false, MIGRATE_SYNC,
@@ -1213,7 +1360,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 15/31] mm: numa: Add fault driven placement and migration
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (13 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 14/31] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages Mel Gorman
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

NOTE: This patch is based on "sched, numa, mm: Add fault driven
	placement and migration policy" but as it throws away all the policy
	to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on.  In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/sh/mm/Kconfig       |    1 +
 include/linux/mm_types.h |   11 ++++
 include/linux/sched.h    |   20 ++++++++
 init/Kconfig             |   14 ++++++
 kernel/sched/core.c      |   13 +++++
 kernel/sched/fair.c      |  125 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |    7 +++
 kernel/sched/sched.h     |    6 +++
 kernel/sysctl.c          |   24 ++++++++-
 mm/huge_memory.c         |    5 +-
 mm/memory.c              |   15 +++++-
 11 files changed, 237 insertions(+), 4 deletions(-)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..ddbcfe7 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select NUMA_VARIABLE_LOCALITY
 	default n
 	help
 	  Some SH systems have many various memories scattered around
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..d82accb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,6 +398,17 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * numa_next_scan is the next time when the PTEs will me marked
+	 * pte_numa to gather statistics and migrate pages to new nodes
+	 * if necessary
+	 */
+	unsigned long numa_next_scan;
+
+	/* numa_scan_seq prevents two threads setting pte_numa */
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..ac71181 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,14 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
+	struct callback_head numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1553,6 +1561,14 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_BALANCE_NUMA
+extern void task_numa_fault(int node, int pages);
+#else
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_period_min;
+extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..aaba45d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config NUMA_VARIABLE_LOCALITY
+	bool
+
+config BALANCE_NUMA
+	bool "Memory placement aware NUMA scheduler"
+	default n
+	depends on SMP && NUMA && MIGRATION && !NUMA_VARIABLE_LOCALITY
+	help
+	  This option adds support for automatic NUMA aware memory/task placement.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..81fa185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..e8bdaef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,8 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -776,6 +778,126 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_balance_numa_scan_period_min = 5000;
+unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+
+static void task_numa_placement(struct task_struct *p)
+{
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+	if (p->numa_scan_seq == seq)
+		return;
+	p->numa_scan_seq = seq;
+
+	/* FIXME: Scheduling placement policy hints go here */
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+	struct task_struct *p = current;
+
+	/* FIXME: Allocate task-specific structure for placement policy here */
+
+	task_numa_placement(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	if (WARN_ON_ONCE(p->numa_scan_period) == 0)
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+
+	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_prot_numa(vma, vma->vm_start, vma->vm_end);
+		}
+		up_read(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#else
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4954,6 +5076,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..7cfd289 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+/*
+ * Apply the automatic NUMA scheduling policy
+ */
+#ifdef CONFIG_BALANCE_NUMA
+SCHED_FEAT(NUMA,	true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..9a43241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..1359f51 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_BALANCE_NUMA
+	{
+		.procname	= "balance_numa_scan_period_min_ms",
+		.data		= &sysctl_balance_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "balance_numa_scan_period_max_ms",
+		.data		= &sysctl_balance_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1453c30..ccff412 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1045,6 +1045,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	split_huge_page(page);
 	put_page(page);
+
 	return 0;
 
 clear_pmdnuma:
@@ -1059,8 +1060,10 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+	if (page) {
 		put_page(page);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index b41f89c..cd348fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,7 +3439,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid, target_nid;
+	int current_nid = -1;
+	int target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3469,6 +3470,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = numa_node_id();
 		goto clear_pmdnuma;
 	}
+
 	pte_unmap_unlock(ptep, ptl);
 
 	/* Migrate to the requested node */
@@ -3496,6 +3498,7 @@ out_unlock:
 	if (page)
 		put_page(page);
 out:
+	task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3531,6 +3534,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
+		int curr_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (addr >= vma->vm_end) {
@@ -3546,6 +3550,15 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
+		/* only check non-shared pages */
+		if (unlikely(page_mapcount(page) != 1))
+			continue;
+		pte_unmap_unlock(pte, ptl);
+
+		curr_nid = page_to_nid(page);
+		task_numa_fault(curr_nid, 1);
+
+		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (14 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 15/31] mm: numa: Add fault driven placement and migration Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 17:58   ` Rik van Riel
  2012-11-13 11:12 ` [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

task_numa_placement is potentially very expensive so limit it to being
called when a page is misplaced. How necessary this is depends on
the placement policy.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |    4 ++--
 kernel/sched/fair.c   |    9 +++++++--
 mm/huge_memory.c      |    2 +-
 mm/memory.c           |    6 ++++--
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac71181..241e4f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1562,9 +1562,9 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_BALANCE_NUMA
-extern void task_numa_fault(int node, int pages);
+extern void task_numa_fault(int node, int pages, bool was_misplaced);
 #else
-static inline void task_numa_fault(int node, int pages)
+static inline void task_numa_fault(int node, int pages, bool was_misplaced)
 {
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8bdaef..9ea13e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -799,13 +799,18 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages)
+void task_numa_fault(int node, int pages, bool misplaced)
 {
 	struct task_struct *p = current;
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
-	task_numa_placement(p);
+	/*
+	 * task_numa_placement can be expensive so only call it if pages were
+	 * misplaced
+	 */
+	if (misplaced)
+		task_numa_placement(p);
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ccff412..833a601 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1062,7 +1062,7 @@ out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (page) {
 		put_page(page);
-		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR, false);
 	}
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index cd348fd..ab9fbcf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3441,6 +3441,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
+	bool misplaced = false;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3470,6 +3471,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = numa_node_id();
 		goto clear_pmdnuma;
 	}
+	misplaced = true;
 
 	pte_unmap_unlock(ptep, ptl);
 
@@ -3498,7 +3500,7 @@ out_unlock:
 	if (page)
 		put_page(page);
 out:
-	task_numa_fault(current_nid, 1);
+	task_numa_fault(current_nid, 1, misplaced);
 	return 0;
 }
 
@@ -3556,7 +3558,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 
 		curr_nid = page_to_nid(page);
-		task_numa_fault(curr_nid, 1);
+		task_numa_fault(curr_nid, 1, false);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (15 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 18:00   ` Rik van Riel
  2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
                   ` (14 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The pte_same check after a misplaced page is successfully migrated will
never succeed and force a double fault to fix it up as pointed out by Rik
van Riel. This was the "safe" option but it's expensive.

This patch uses the migration allocation callback to record the location
of the newly migrated page. If the page is the same when the PTE lock is
reacquired it is assumed that it is safe to complete the pte_numa fault
without incurring a double fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |    4 ++--
 mm/memory.c             |   28 +++++++++++++++++-----------
 mm/migrate.c            |   27 ++++++++++++++++++---------
 3 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 69f60b5..e5ab5db 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -40,7 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern struct page *migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -75,7 +75,7 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define fail_migrate_page NULL
 
 static inline
-int migrate_misplaced_page(struct page *page, int node)
+struct page *migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index ab9fbcf..73fa203 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3437,7 +3437,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page = NULL;
+	struct page *page = NULL, *newpage = NULL;
 	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
@@ -3476,19 +3476,26 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 
 	/* Migrate to the requested node */
-	if (migrate_misplaced_page(page, target_nid)) {
-		/*
-		 * If the page was migrated then the pte_same check below is
-		 * guaranteed to fail so just retry the entire fault.
-		 */
+	newpage = migrate_misplaced_page(page, target_nid);
+	if (newpage)
 		current_nid = target_nid;
-		goto out;
-	}
 	page = NULL;
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
-	if (!pte_same(*ptep, pte))
-		goto out_unlock;
+
+	/*
+	 * If we failed to migrate, we have to check the PTE has not changed during
+	 * the migration attempt. If it has, retry the fault. If it has migrated,
+	 * relookup the ptep and confirm it's the same page to avoid double faulting.
+	 */
+	if (!newpage) {
+		if (!pte_same(*ptep, pte))
+			goto out_unlock;
+	} else {
+		pte = *ptep;
+		if (!pte_numa(pte) || vm_normal_page(vma, addr, pte) != newpage)
+			goto out_unlock;
+	}
 
 clear_pmdnuma:
 	pte = pte_mknonnuma(pte);
@@ -3499,7 +3506,6 @@ out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 	if (page)
 		put_page(page);
-out:
 	task_numa_fault(current_nid, 1, misplaced);
 	return 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 4a92808..631b2c5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,19 +1444,23 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 	return false;
 }
 
+struct misplaced_request
+{
+	int nid;		/* Node to migrate to */
+	struct page *newpage;	/* New location of page */
+};
+
 static struct page *alloc_misplaced_dst_page(struct page *page,
 					   unsigned long data,
 					   int **result)
 {
-	int nid = (int) data;
-	struct page *newpage;
-
-	newpage = alloc_pages_exact_node(nid,
+	struct misplaced_request *req = (struct misplaced_request *)data;
+	req->newpage = alloc_pages_exact_node(req->nid,
 					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
 					  __GFP_NOMEMALLOC | __GFP_NORETRY |
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
-	return newpage;
+	return req->newpage;
 }
 
 /*
@@ -1464,8 +1468,12 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+struct page *migrate_misplaced_page(struct page *page, int node)
 {
+	struct misplaced_request req = {
+		.nid = node,
+		.newpage = NULL,
+	};
 	int isolated = 0;
 	LIST_HEAD(migratepages);
 
@@ -1503,16 +1511,17 @@ int migrate_misplaced_page(struct page *page, int node)
 
 		nr_remaining = migrate_pages(&migratepages,
 				alloc_misplaced_dst_page,
-				node, false, MIGRATE_ASYNC,
+				(unsigned long)&req,
+				false, MIGRATE_ASYNC,
 				MR_NUMA_MISPLACED);
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
-			isolated = 0;
+			req.newpage = NULL;
 		}
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
-	return isolated;
+	return req.newpage;
 }
 
 #endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (16 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 19/31] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: The scan period is much larger than it was in the original patch.
	The reason was because the system CPU usage went through the roof
	with a sample period of 500ms but it was unsuitable to have a
	situation where a large process could stall for excessively long
	updating pte_numa. This may need to be tuned again if a placement
	policy converges too slowly.

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 2 seconds up to just once per 32 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 32
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 ++++++
 4 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 241e4f7..6b8a14f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ea13e9..6df5620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 2000;
+unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
+	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
 	}
+	mm->numa_scan_offset = offset;
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 19/31] mm: sched: numa: Implement slow start for working set sampling
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (17 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 20/31] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |    5 +++++
 kernel/sysctl.c       |    7 +++++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b8a14f..51e2944 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,6 +2006,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81fa185..047e3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,7 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
-	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6df5620..a97498e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -788,6 +788,9 @@ unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
 
+/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
+unsigned int sysctl_balance_numa_scan_delay = 1000;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -924,6 +927,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
+		if (!curr->node_stamp)
+			curr->numa_scan_period = sysctl_balance_numa_scan_period_min;
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d191203..5ee587d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_BALANCE_NUMA
 	{
+		.procname	= "balance_numa_scan_delay_ms",
+		.data		= &sysctl_balance_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_min_ms",
 		.data		= &sysctl_balance_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 20/31] mm: numa: Add pte updates, hinting and migration stats
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (18 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 19/31] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 21/31] mm: numa: Migrate on reference policy Mel Gorman
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u    = basic unit = sizeof(void *)
Ca   = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
	where Cpte is incurred twice for a read and a write and Wlock
	is a constant representing the cost of taking or releasing a
	lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
	where Wi is a constant that should reflect the approximate cost
	of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
	would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
		Cnumahint * numa_hint_faults +
		Ci * numa_pages_migrated +
		Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

	benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

	convergence = numa_hint_faults_local / numa_hint_faults
		where this is measured for the last N number of
		numa hints recorded. When the workload is fully
		converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    6 ++++++
 mm/huge_memory.c              |    1 +
 mm/memory.c                   |   12 ++++++++++++
 mm/mempolicy.c                |    5 +++++
 mm/migrate.c                  |    3 ++-
 mm/vmstat.c                   |    6 ++++++
 6 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a1f750b..dded0af 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_BALANCE_NUMA
+		NUMA_PTE_UPDATES,
+		NUMA_HINT_FAULTS,
+		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_PAGE_MIGRATE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 833a601..f45f25b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1033,6 +1033,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
+	count_vm_event(NUMA_HINT_FAULTS);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1)
diff --git a/mm/memory.c b/mm/memory.c
index 73fa203..95c9abb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3457,11 +3457,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pte_same(*ptep, pte)))
 		goto out_unlock;
 
+	count_vm_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	BUG_ON(!page);
 
 	get_page(page);
 	current_nid = page_to_nid(page);
+	if (current_nid == numa_node_id())
+		count_vm_event(NUMA_HINT_FAULTS_LOCAL);
 	target_nid = mpol_misplaced(page, vma, addr);
 	if (target_nid == -1) {
 		/*
@@ -3520,6 +3523,9 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int local_nid = numa_node_id();
+	unsigned long nr_faults = 0;
+	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3566,10 +3572,16 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		curr_nid = page_to_nid(page);
 		task_numa_fault(curr_nid, 1, false);
 
+		nr_faults++;
+		if (curr_nid == local_nid)
+			nr_faults_local++;
+
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
+	count_vm_events(NUMA_HINT_FAULTS, nr_faults);
+	count_vm_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 11052ea..860341e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,6 +583,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long _address, end;
 	spinlock_t *ptl;
 	int ret = 0;
+	int nr_pte_updates = 0;
 
 	VM_BUG_ON(address & ~PAGE_MASK);
 
@@ -626,6 +627,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
 		ret += HPAGE_PMD_NR;
+		nr_pte_updates++;
 		/* defer TLB flush to lower the overhead */
 		spin_unlock(&mm->page_table_lock);
 		goto out;
@@ -652,6 +654,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 
 		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+		nr_pte_updates++;
 
 		/* defer TLB flush to lower the overhead */
 		ret++;
@@ -666,6 +669,8 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 out:
+	if (nr_pte_updates)
+		count_vm_events(NUMA_PTE_UPDATES, nr_pte_updates);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 631b2c5..a890429 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1517,7 +1517,8 @@ struct page *migrate_misplaced_page(struct page *page, int node)
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
 			req.newpage = NULL;
-		}
+		} else
+			count_vm_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a067fa..cfa386da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,6 +774,12 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_BALANCE_NUMA
+	"numa_pte_updates",
+	"numa_hint_faults",
+	"numa_hint_faults_local",
+	"numa_pages_migrated",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 21/31] mm: numa: Migrate on reference policy
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (19 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 20/31] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 22/31] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

This is the dumbest possible policy that still does something of note.
When a pte_numa is faulted, it is moved immediately. Any replacement
policy must at least do better than this and in all likelihood this
policy regresses normal workloads.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   41 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..b25064f 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 860341e..f2111b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,6 +118,26 @@ static struct mempolicy default_policy = {
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = numa_node_id();
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+
+		/* preferred_node_policy is not initialised early in boot */
+		if (!pol->mode)
+			pol = NULL;
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1704,7 +1724,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -2127,7 +2147,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2401,6 +2421,14 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	default:
 		BUG();
 	}
+
+	/*
+	 * Moronic node selection policy. Migrate the page to the node that is
+	 * currently referencing it
+	 */
+	if (pol->flags & MPOL_F_MORON)
+		polnid = numa_node_id();
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2589,6 +2617,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 22/31] x86: mm: only do a local tlb flush in ptep_set_access_flags()
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (20 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 21/31] mm: numa: Migrate on reference policy Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 23/31] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags() is only ever invoked to set access
flags or add write permission on a PTE.  The write bit is only ever set
together with the dirty bit.

Because we only ever upgrade a PTE, it is safe to skip flushing entries on
remote TLBs. The worst that can happen is a spurious page fault on other
CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally is
(much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 23/31] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (21 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 22/31] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 24/31] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/mm/pgtable.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 24/31] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (22 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 23/31] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 25/31] sched: numa: Introduce tsk_home_node() Mel Gorman
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 25/31] sched: numa: Introduce tsk_home_node()
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (23 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 24/31] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 26/31] sched: numa: Make mempolicy home-node aware Mel Gorman
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely soft preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/sched/core.c       |   36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..fdf0692 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_BALANCE_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.home_node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 51e2944..2677f22 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,7 @@ struct task_struct {
 	short pref_node_fork;
 #endif
 #ifdef CONFIG_BALANCE_NUMA
+	int home_node;
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
@@ -1569,6 +1570,15 @@ static inline void task_numa_fault(int node, int pages, bool was_misplaced)
 }
 #endif
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_BALANCE_NUMA
+	return p->home_node;
+#else
+	return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 047e3c7..55dcf53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5972,6 +5972,42 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_BALANCE_NUMA
+
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->home_node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_BALANCE_NUMA */
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 26/31] sched: numa: Make mempolicy home-node aware
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (24 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 25/31] sched: numa: Introduce tsk_home_node() Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 18:22   ` Rik van Riel
  2012-11-13 11:12 ` [PATCH 27/31] sched: numa: Make find_busiest_queue() a method Mel Gorman
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add another layer of fallback policy to make the home node concept
useful from a memory allocation PoV.

This changes the mpol order to:

 - vma->vm_ops->get_policy	[if applicable]
 - vma->vm_policy		[if applicable]
 - task->mempolicy
 - tsk_home_node() preferred	[NEW]
 - default_policy

Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
facilitate efficient on-demand memory migration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    9 ++++++++-
 mm/mempolicy.c                 |   30 ++++++++++++++++++++----------
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index b25064f..bc7b611 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -69,7 +69,14 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
-#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
+#define MPOL_F_HOME	(1 << 4) /*
+				  * Migrate towards referencing node.
+				  * By building up stats on faults, the
+				  * scheduler will reinforce the choice
+				  * by identifying a home node and
+				  * queueing the task on that node
+				  * where possible.
+				  */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f2111b7..076f8f8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -126,9 +126,10 @@ static struct mempolicy *get_task_policy(struct task_struct *p)
 	int node;
 
 	if (!pol) {
-		node = numa_node_id();
-		if (node != -1)
-			pol = &preferred_node_policy[node];
+		node = tsk_home_node(p);
+		if (node == -1)
+			node = numa_node_id();
+		pol = &preferred_node_policy[node];
 
 		/* preferred_node_policy is not initialised early in boot */
 		if (!pol->mode)
@@ -2422,12 +2423,21 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		BUG();
 	}
 
-	/*
-	 * Moronic node selection policy. Migrate the page to the node that is
-	 * currently referencing it
-	 */
-	if (pol->flags & MPOL_F_MORON)
-		polnid = numa_node_id();
+	/* Migrate pages towards their home node or the referencing node */
+	if (pol->flags & MPOL_F_HOME) {
+		/*
+		 * Make a placement decision based on the home node.
+		 * NOTE: Potentially this can result in a remote->remote
+		 * copy but it's not migrated now the numa_fault will
+		 * be lost or accounted for incorrectly making it a rock
+		 * and a hard place.
+		 */
+		polnid = tsk_home_node(current);
+		if (polnid == -1) {
+			/* No home node, migrate to the referencing node */
+			polnid = numa_node_id();
+		}
+	}
 
 	if (curnid != polnid)
 		ret = polnid;
@@ -2621,7 +2631,7 @@ void __init numa_policy_init(void)
 		preferred_node_policy[nid] = (struct mempolicy) {
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
-			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.flags = MPOL_F_MOF | MPOL_F_HOME,
 			.v = { .preferred_node = nid, },
 		};
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 27/31] sched: numa: Make find_busiest_queue() a method
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (25 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 26/31] sched: numa: Make mempolicy home-node aware Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 18:25   ` Rik van Riel
  2012-11-13 11:12 ` [PATCH 28/31] sched: numa: Implement home-node awareness Mel Gorman
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in the next patch to conditionally use a queue.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a97498e..9c242e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3230,6 +3230,9 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *);
 };
 
 /*
@@ -4403,13 +4406,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
 
 	struct lb_env env = {
-		.sd		= sd,
-		.dst_cpu	= this_cpu,
-		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
-		.idle		= idle,
-		.loop_break	= sched_nr_migrate_break,
-		.cpus		= cpus,
+		.sd		    = sd,
+		.dst_cpu	    = this_cpu,
+		.dst_rq		    = this_rq,
+		.dst_grpmask        = sched_group_cpus(sd->groups),
+		.idle		    = idle,
+		.loop_break	    = sched_nr_migrate_break,
+		.cpus		    = cpus,
+		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4428,7 +4432,7 @@ redo:
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(&env, group);
+	busiest = env.find_busiest_queue(&env, group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 28/31] sched: numa: Implement home-node awareness
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (26 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 27/31] sched: numa: Make find_busiest_queue() a method Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 29/31] sched: numa: CPU follows memory Mel Gorman
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: Entirely on "sched, numa, mm: Implement home-node awareness" but
	only a subset of it. There was stuff in there that was disabled
	by default and generally did slightly more than what I felt was
	necessary at this stage. In particular the random queue selection
	logic is gone because it looks broken but it does mean that the
	last CPU in a node may see increased scheduling pressure which
	is almost certainly the wrong thing to do. Needs re-examination
	Signed-offs removed as a result but will re-add if authors are ok.

Implement home node preference in the scheduler's load-balancer.

- task_numa_hot(); make it harder to migrate tasks away from their
  home-node, controlled using the NUMA_HOMENODE_PREFERRED feature flag.

- load_balance(); during the regular pull load-balance pass, try
  pulling tasks that are on the wrong node first with a preference of
  moving them nearer to their home-node through task_numa_hot(), controlled
  through the NUMA_PULL feature flag.

- load_balance(); when the balancer finds no imbalance, introduce
  some such that it still prefers to move tasks towards their home-node,
  using active load-balance if needed, controlled through the NUMA_PULL_BIAS
  feature flag.

  In particular, only introduce this BIAS if the system is otherwise properly
  (weight) balanced and we either have an offnode or !numa task to trade
  for it.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h   |    3 +
 kernel/sched/core.c     |   14 ++-
 kernel/sched/debug.c    |    3 +
 kernel/sched/fair.c     |  298 +++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |   18 +++
 kernel/sched/sched.h    |   16 +++
 6 files changed, 324 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2677f22..7ebf32e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -1481,6 +1482,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_BALANCE_NUMA
 	int home_node;
+	unsigned long numa_contrib;
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
@@ -2104,6 +2106,7 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern void sched_setnode(struct task_struct *p, int node);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55dcf53..3d9fc26 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5978,9 +5978,9 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol
  * Requeues a task ensuring its on the right load-balance list so
  * that it might get migrated to its new home.
  *
- * Note that we cannot actively migrate ourselves since our callers
- * can be from atomic context. We rely on the regular load-balance
- * mechanisms to move us around -- its all preference anyway.
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
  */
 void sched_setnode(struct task_struct *p, int node)
 {
@@ -6053,6 +6053,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
@@ -6914,7 +6915,12 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
-
+#ifdef CONFIG_BALANCE_NUMA
+		INIT_LIST_HEAD(&rq->offnode_tasks);
+		rq->onnode_running = 0;
+		rq->offnode_running = 0;
+		rq->offnode_weight = 0;
+#endif
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_flags = 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..2474a02 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	SEQ_printf(m, " %d/%d", p->home_node, cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c242e8..a816bbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -775,6 +775,51 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 /**************************************************
+ * Scheduling class numa methods.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_BALANCE_NUMA
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct list_head *tasks = &rq->cfs_tasks;
+
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		p->numa_contrib = task_h_load(p);
+		rq->offnode_weight += p->numa_contrib;
+		rq->offnode_running++;
+		tasks = &rq->offnode_tasks;
+	} else
+		rq->onnode_running++;
+
+	return tasks;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		rq->offnode_weight -= p->numa_contrib;
+		rq->offnode_running--;
+	} else
+		rq->onnode_running--;
+}
+#else
+#ifdef CONFIG_SMP
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	return NULL;
+}
+#endif
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
+/**************************************************
  * Scheduling class queueing methods:
  */
 
@@ -950,9 +995,17 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct task_struct *p = task_of(se);
+		struct list_head *tasks = &rq->cfs_tasks;
+
+		if (tsk_home_node(p) != -1)
+			tasks = account_numa_enqueue(rq, p);
+
+		list_add(&se->group_node, tasks);
+	}
+#endif /* CONFIG_SMP */
 	cfs_rq->nr_running++;
 }
 
@@ -962,8 +1015,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+
 		list_del_init(&se->group_node);
+
+		if (tsk_home_node(p) != -1)
+			account_numa_dequeue(rq_of(cfs_rq), p);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -3227,6 +3286,8 @@ struct lb_env {
 
 	unsigned int		flags;
 
+	struct list_head	*tasks;
+
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
@@ -3248,10 +3309,32 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 }
 
 /*
+ * Returns true if task should stay on the current node. The intent is that
+ * a task that is running on a node identified as the "home node" should
+ * stay there if possible
+ */
+static bool task_numa_hot(struct task_struct *p, struct lb_env *env)
+{
+	int from_dist, to_dist;
+	int node = tsk_home_node(p);
+
+	if (!sched_feat_numa(NUMA_HOMENODE_PREFERRED) || node == -1)
+		return false; /* no node preference */
+
+	from_dist = node_distance(cpu_to_node(env->src_cpu), node);
+	to_dist = node_distance(cpu_to_node(env->dst_cpu), node);
+
+	if (to_dist < from_dist)
+		return false; /* getting closer is ok */
+
+	return true; /* stick to where we are */
+}
+
+/*
  * Is this task likely cache-hot:
  */
 static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
@@ -3274,7 +3357,7 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
-	delta = now - p->se.exec_start;
+	delta = env->src_rq->clock_task - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
 }
@@ -3331,7 +3414,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	tsk_cache_hot = task_hot(p, env);
+	if (env->idle == CPU_NOT_IDLE)
+		tsk_cache_hot |= task_numa_hot(p, env);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
@@ -3353,15 +3438,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 /*
  * move_one_task tries to move exactly one task from busiest to this_rq, as
  * part of active balancing operations within "domain".
- * Returns 1 if successful and 0 otherwise.
+ * Returns true if successful and false otherwise.
  *
  * Called with both runqueues locked.
  */
-static int move_one_task(struct lb_env *env)
+static bool __move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
-	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3375,12 +3460,25 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
+static bool move_one_task(struct lb_env *env)
+{
+	if (sched_feat_numa(NUMA_HOMENODE_PULL)) {
+		env->tasks = offnode_tasks(env->src_rq);
+		if (__move_one_task(env))
+			return true;
+	}
+
+	env->tasks = &env->src_rq->cfs_tasks;
+	if (__move_one_task(env))
+		return true;
+
+	return false;
+}
 
 static const unsigned int sched_nr_migrate_break = 32;
 
@@ -3393,7 +3491,6 @@ static const unsigned int sched_nr_migrate_break = 32;
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3401,8 +3498,9 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+	while (!list_empty(env->tasks)) {
+		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3413,7 +3511,7 @@ static int move_tasks(struct lb_env *env)
 		if (env->loop > env->loop_break) {
 			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3441,7 +3539,7 @@ static int move_tasks(struct lb_env *env)
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3449,13 +3547,20 @@ static int move_tasks(struct lb_env *env)
 		 * weighted load.
 		 */
 		if (env->imbalance <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, env->tasks);
 	}
 
+	if (env->tasks == offnode_tasks(env->src_rq)) {
+		env->tasks = &env->src_rq->cfs_tasks;
+		env->loop = 0;
+		goto again;
+	}
+
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3574,12 +3679,13 @@ static inline void update_shares(int cpu)
 static inline void update_h_load(long cpu)
 {
 }
-
+#ifdef CONFIG_SMP
 static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.load.weight;
 }
 #endif
+#endif
 
 /********** Helpers for find_busiest_group ************************/
 /*
@@ -3610,6 +3716,14 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+#ifdef CONFIG_BALANCE_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+
+	unsigned long this_offnode_running;
+	unsigned long this_onnode_running;
+#endif
 };
 
 /*
@@ -3625,6 +3739,11 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_BALANCE_NUMA
+	unsigned long numa_offnode_weight;
+	unsigned long numa_offnode_running;
+	unsigned long numa_onnode_running;
+#endif
 };
 
 /**
@@ -3653,6 +3772,121 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->numa_offnode_weight += rq->offnode_weight;
+	sgs->numa_offnode_running += rq->offnode_running;
+	sgs->numa_onnode_running += rq->onnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ *
+ * Select a random group that has offnode tasks as sds->numa_group
+ */
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group) {
+		sds->this_offnode_running = sgs->numa_offnode_running;
+		sds->this_onnode_running  = sgs->numa_onnode_running;
+		return;
+	}
+
+	if (!sgs->numa_offnode_running)
+		return;
+
+	if (!sds->numa_group) {
+		sds->numa_group = group;
+		sds->numa_group_weight = sgs->numa_offnode_weight;
+		sds->numa_group_running = sgs->numa_offnode_running;
+	}
+}
+
+/*
+ * Pick a random queue from the group that has offnode tasks.
+ */
+static struct rq *find_busiest_numa_queue(struct lb_env *env,
+					  struct sched_group *group)
+{
+	struct rq *busiest = NULL, *rq;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(group), env->cpus) {
+		rq = cpu_rq(cpu);
+		if (!rq->offnode_running)
+			continue;
+		if (!busiest)
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+/*
+ * Called in case of no other imbalance. Returns true if there is a queue
+ * running offnode tasks which pretends we are imbalanced anyway to nudge these
+ * tasks towards their home node.
+ */
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	if (!sched_feat(NUMA_HOMENODE_PULL_BIAS))
+		return false;
+
+	if (!sds->numa_group)
+		return false;
+
+	/*
+	 * Only pull an offnode task home if we've got offnode or !numa tasks to trade for it.
+	 */
+	if (!sds->this_offnode_running &&
+	    !(sds->this_nr_running - sds->this_onnode_running - sds->this_offnode_running))
+		return false;
+
+	env->imbalance = sds->numa_group_weight / sds->numa_group_running;
+	sds->busiest = sds->numa_group;
+	env->find_busiest_queue = find_busiest_numa_queue;
+	return true;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return env->find_busiest_queue == find_busiest_numa_queue &&
+			env->src_rq->offnode_running == 1 &&
+			env->src_rq->nr_running == 1;
+}
+
+#else /* CONFIG_BALANCE_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline bool check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	return false;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return false;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -3868,6 +4102,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		update_sg_numa_stats(sgs, rq);
 	}
 
 	/*
@@ -4021,6 +4257,8 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4251,7 +4489,7 @@ find_busiest_group(struct lb_env *env, int *balance)
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || sds.busiest_nr_running == 0)
-		goto out_balanced;
+		goto ret;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
@@ -4273,14 +4511,14 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 * don't try and pull any tasks.
 	 */
 	if (sds.this_load >= sds.max_load)
-		goto out_balanced;
+		goto ret;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
 	if (sds.this_load >= sds.avg_load)
-		goto out_balanced;
+		goto ret;
 
 	if (env->idle == CPU_IDLE) {
 		/*
@@ -4307,6 +4545,9 @@ force_balance:
 	return sds.busiest;
 
 out_balanced:
+	if (check_numa_busiest_group(env, &sds))
+		return sds.busiest;
+
 ret:
 	env->imbalance = 0;
 	return NULL;
@@ -4385,6 +4626,9 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (need_active_numa_balance(env))
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -4437,6 +4681,8 @@ redo:
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
+	env.src_rq  = busiest;
+	env.src_cpu = busiest->cpu;
 
 	BUG_ON(busiest == env.dst_rq);
 
@@ -4455,6 +4701,10 @@ redo:
 		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		if (sched_feat_numa(NUMA_HOMENODE_PULL))
+			env.tasks = offnode_tasks(busiest);
+		else
+			env.tasks = &busiest->cfs_tasks;
 
 		update_h_load(env.src_cpu);
 more_balance:
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7cfd289..4ae02cb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,22 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_BALANCE_NUMA
 SCHED_FEAT(NUMA,	true)
+
+/* Keep tasks running on their home node if possible */
+SCHED_FEAT(NUMA_HOMENODE_PREFERRED, true)
+
+/*
+ * During the regular pull load-balance pass, try pulling tasks that are
+ * running off their home node first with a preference to moving them
+ * nearer their home node through task_numa_hot.
+ */
+SCHED_FEAT(NUMA_HOMENODE_PULL, true)
+
+/*
+ * When the balancer finds no imbalance, introduce some such that it
+ * still prefers to move tasks towards their home node, using active
+ * load-balance if needed.
+ */
+SCHED_FEAT(NUMA_HOMENODE_PULL_BIAS, true)
+
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a43241..3f0e5a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -418,6 +418,13 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_BALANCE_NUMA
+	unsigned long    onnode_running;
+	unsigned long    offnode_running;
+	unsigned long	 offnode_weight;
+	struct list_head offnode_tasks;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -469,6 +476,15 @@ struct rq {
 #endif
 };
 
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+#ifdef CONFIG_BALANCE_NUMA
+	return &rq->offnode_tasks;
+#else
+	return NULL;
+#endif
+}
+
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 29/31] sched: numa: CPU follows memory
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (27 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 28/31] sched: numa: Implement home-node awareness Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-14 11:20   ` Mel Gorman
  2012-11-13 11:12 ` [PATCH 30/31] mm: numa: Introduce last_nid to the page frame Mel Gorman
                   ` (2 subsequent siblings)
  31 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: This is heavily based on "autonuma: CPU follows memory algorithm"
	and "autonuma: mm_autonuma and task_autonuma data structures"
	with bits taken but worked within the scheduler hooks and home
	node mechanism as defined by schednuma.

This patch adds per-mm and per-task data structures to track the number
of faults in total and on a per-nid basis. On each NUMA fault it
checks if the system would benefit if the current task was migrated
to another node. If the task should be migrated, its home node is
updated and the task is requeued.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |   26 +++++
 include/linux/sched.h    |   19 +++-
 kernel/fork.c            |   18 +++
 kernel/sched/core.c      |    3 +
 kernel/sched/fair.c      |  275 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |   14 +++
 6 files changed, 344 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b40f4ef..66172d6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -308,6 +308,29 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-mm structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults.
+ */
+struct mm_balancenuma {
+	/*
+	 * Number of pages that will trigger NUMA faults for this mm. Total
+	 * decays each time whether the home node should change to keep
+	 * track only of recent events
+	 */
+	unsigned long mm_numa_fault_tot;
+
+	/*
+	 * Number of pages that will trigger NUMA faults for each [nid].
+	 * Also decays.
+	 */
+	unsigned long mm_numa_fault[0];
+
+	/* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -411,6 +434,9 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
+
+	/* this is used by the scheduler and the page allocator */
+	struct mm_balancenuma *mm_balancenuma;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7ebf32e..336ec68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1188,6 +1188,23 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-task structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults. This structure is dynamically allocated
+ * when the first pte_numa fault is handled.
+ */
+struct task_balancenuma {
+	/* Total number of eligible pages that triggered NUMA faults */
+	unsigned long task_numa_fault_tot;
+
+	/* Number of pages that triggered NUMA faults for each [nid] */
+	unsigned long task_numa_fault[0];
+
+	/* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1488,6 +1505,7 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+	struct task_balancenuma *task_balancenuma;
 #endif /* CONFIG_BALANCE_NUMA */
 
 	struct rcu_head rcu;
@@ -2022,7 +2040,6 @@ extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
-extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..c8752f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -525,6 +525,20 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+	if (mm->mm_balancenuma)
+		kfree(mm->mm_balancenuma);
+
+	mm->mm_balancenuma = NULL;
+}
+#else
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+}
+#endif
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
 	atomic_set(&mm->mm_users, 1);
@@ -539,6 +553,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
+	mm->mm_balancenuma = NULL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 
@@ -548,6 +563,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_balancenuma(mm);
 	free_mm(mm);
 	return NULL;
 }
@@ -597,6 +613,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_balancenuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -854,6 +871,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_balancenuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d9fc26..9472d5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,6 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->task_balancenuma = NULL;
 	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
@@ -1787,6 +1788,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		free_task_balancenuma(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a816bbe..abcf7f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -836,15 +836,234 @@ unsigned int sysctl_balance_numa_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_balance_numa_scan_delay = 1000;
 
+#define BALANCENUMA_SCALE 1000
+static inline unsigned long balancenuma_weight(unsigned long nid_faults,
+					       unsigned long total_faults)
+{
+	if (nid_faults > total_faults)
+		nid_faults = total_faults;
+
+	return nid_faults * BALANCENUMA_SCALE / total_faults;
+}
+
+static inline unsigned long balancenuma_task_weight(struct task_struct *p,
+							int nid)
+{
+	struct task_balancenuma *task_balancenuma = p->task_balancenuma;
+	unsigned long nid_faults, total_faults;
+
+	nid_faults = task_balancenuma->task_numa_fault[nid];
+	total_faults = task_balancenuma->task_numa_fault_tot;
+	return balancenuma_weight(nid_faults, total_faults);
+}
+
+static inline unsigned long balancenuma_mm_weight(struct task_struct *p,
+							int nid)
+{
+	struct mm_balancenuma *mm_balancenuma = p->mm->mm_balancenuma;
+	unsigned long nid_faults, total_faults;
+
+	nid_faults = mm_balancenuma->mm_numa_fault[nid];
+	total_faults = mm_balancenuma->mm_numa_fault_tot;
+
+	/* It's possible for total_faults to decay to 0 in parallel so check */
+	return total_faults ? balancenuma_weight(nid_faults, total_faults) : 0;
+}
+
+/*
+ * Examines all other nodes examining remote tasks to see if there would
+ * be fewer remote numa faults if tasks swapped home nodes
+ */
+static void task_numa_find_placement(struct task_struct *p)
+{
+	struct cpumask *allowed = tsk_cpus_allowed(p);
+	int this_cpu = smp_processor_id();
+	int this_nid = numa_node_id();
+	long p_task_weight, p_mm_weight;
+	long weight_diff_max = 0;
+	struct task_struct *selected_task = NULL;
+	int selected_nid = -1;
+	int nid;
+
+	p_task_weight = balancenuma_task_weight(p, this_nid);
+	p_mm_weight = balancenuma_mm_weight(p, this_nid);
+
+	/* Examine a task on every other node */
+	for_each_online_node(nid) {
+		int cpu;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			struct rq *rq;
+			struct mm_struct *other_mm;
+			struct task_struct *other_task;
+			long this_weight, other_weight, p_weight;
+			long other_diff, this_diff;
+
+			if (!cpu_online(cpu) || idle_cpu(cpu))
+				continue;
+
+			/* Racy check if a task is running on the other rq */
+			rq = cpu_rq(cpu);
+			other_mm = rq->curr->mm;
+			if (!other_mm || !other_mm->mm_balancenuma)
+				continue;
+
+			/* Effectively pin the other task to get fault stats */
+			raw_spin_lock_irq(&rq->lock);
+			other_task = rq->curr;
+			other_mm = other_task->mm;
+
+			/* Ensure the other task has usable stats */
+			if (!other_task->task_balancenuma ||
+			    !other_task->task_balancenuma->task_numa_fault_tot ||
+			    !other_mm ||
+			    !other_mm->mm_balancenuma ||
+			    !other_mm->mm_balancenuma->mm_numa_fault_tot) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			/* Ensure the other task can be swapped */
+			if (!cpumask_test_cpu(this_cpu,
+					      tsk_cpus_allowed(other_task))) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			/*
+			 * Read the fault statistics. If the remote task is a
+			 * thread in the process then use the task statistics.
+			 * Otherwise use the per-mm statistics.
+			 */
+			if (other_mm == p->mm) {
+				this_weight = balancenuma_task_weight(p, nid);
+				other_weight = balancenuma_task_weight(other_task, nid);
+				p_weight = p_task_weight;
+			} else {
+				this_weight = balancenuma_mm_weight(p, nid);
+				other_weight = balancenuma_mm_weight(other_task, nid);
+				p_weight = p_mm_weight;
+			}
+
+			raw_spin_unlock_irq(&rq->lock);
+
+			/*
+			 * other_diff: How much does the current task perfer to
+			 * run on the remote node thn the task that is
+			 * currently running there?
+			 */
+			other_diff = this_weight - other_weight;
+
+			/*
+			 * this_diff: How much does the currrent task prefer to
+			 * run on the remote NUMA node compared to the current
+			 * node?
+			 */
+			this_diff = this_weight - p_weight;
+
+			/*
+			 * Would swapping the tasks reduce the overall
+			 * cross-node NUMA faults?
+			 */
+			if (other_diff > 0 && this_diff > 0) {
+				long weight_diff = other_diff + this_diff;
+
+				/* Remember the best candidate. */
+				if (weight_diff > weight_diff_max) {
+					weight_diff_max = weight_diff;
+					selected_nid = nid;
+					selected_task = other_task;
+				}
+			}
+		}
+	}
+
+	/* Swap the task on the selected target node */
+	if (selected_nid != -1) {
+		sched_setnode(p, selected_nid);
+		sched_setnode(selected_task, this_nid);
+	}
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
+	unsigned long task_total, mm_total;
+	struct mm_balancenuma *mm_balancenuma;
+	struct task_balancenuma *task_balancenuma;
+	unsigned long mm_max_weight, task_max_weight;
+	int this_nid, nid, mm_selected_nid, task_selected_nid;
+
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	this_nid = numa_node_id();
+	mm_balancenuma = p->mm->mm_balancenuma;
+	task_balancenuma = p->task_balancenuma;
+
+	/* If the task has no NUMA hinting page faults, use current nid */
+	mm_total = ACCESS_ONCE(mm_balancenuma->mm_numa_fault_tot);
+	if (!mm_total)
+		return;
+	task_total = task_balancenuma->task_numa_fault_tot;
+	if (!task_total)
+		return;
+
+	/*
+	 * Identify the NUMA node where this thread (task_struct), and
+	 * the process (mm_struct) as a whole, has the largest number
+	 * of NUMA faults
+	 */
+	mm_selected_nid = task_selected_nid = -1;
+	mm_max_weight = task_max_weight = 0;
+	for_each_online_node(nid) {
+		unsigned long mm_nid_fault, task_nid_fault;
+		unsigned long mm_numa_weight, task_numa_weight;
+
+		/* Read the number of task and mm faults on node */
+		mm_nid_fault = ACCESS_ONCE(mm_balancenuma->mm_numa_fault[nid]);
+		task_nid_fault = task_balancenuma->task_numa_fault[nid];
+
+		/*
+		 * The weights are the relative number of pte_numa faults that
+		 * were handled on this node in comparison to all pte_numa faults
+		 * overall
+		 */
+		mm_numa_weight = balancenuma_weight(mm_nid_fault, mm_total);
+		task_numa_weight = balancenuma_weight(task_nid_fault, task_total);
+		if (mm_numa_weight > mm_max_weight) {
+			mm_max_weight = mm_numa_weight;
+			mm_selected_nid = nid;
+		}
+		if (task_numa_weight > task_max_weight) {
+			task_max_weight = task_numa_weight;
+			task_selected_nid = nid;
+		}
+
+		/* Decay the stats by a factor of 2 */
+		p->mm->mm_balancenuma->mm_numa_fault[nid] >>= 1;
+	}
+
+	/* Recheck for a usable task_numa_fault_tot after decaying */
+	if (!task_balancenuma->task_numa_fault_tot ||
+	    !mm_balancenuma->mm_numa_fault_tot)
+		return;
+
+	/*
+	 * If this NUMA node is the selected one based on process
+	 * memory and task NUMA faults then set the home node.
+	 * There should be no need to requeue the task.
+	 */
+	if (task_selected_nid == this_nid && mm_selected_nid == this_nid) {
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+					  p->numa_scan_period * 2);
+		p->home_node = this_nid;
+		return;
+	}
+
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	task_numa_find_placement(p);
 }
 
 /*
@@ -854,7 +1073,30 @@ void task_numa_fault(int node, int pages, bool misplaced)
 {
 	struct task_struct *p = current;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	if (!p->task_balancenuma) {
+		int size = sizeof(struct task_balancenuma) +
+				(sizeof(unsigned long) * nr_node_ids);
+		p->task_balancenuma = kzalloc(size, GFP_KERNEL);
+		if (!p->task_balancenuma)
+			return;
+	}
+
+	if (!p->mm->mm_balancenuma) {
+		int size = sizeof(struct mm_balancenuma) +
+				(sizeof(unsigned long) * nr_node_ids);
+		p->mm->mm_balancenuma = kzalloc(size, GFP_KERNEL);
+		if (!p->mm->mm_balancenuma) {
+			kfree(p->task_balancenuma);
+			p->task_balancenuma = NULL;
+			return;
+		}
+	}
+
+	/* Record fault statistics */
+	p->task_balancenuma->task_numa_fault_tot++;
+	p->task_balancenuma->task_numa_fault[node]++;
+	p->mm->mm_balancenuma->mm_numa_fault_tot++;
+	p->mm->mm_balancenuma->mm_numa_fault[node]++;
 
 	/*
 	 * task_numa_placement can be expensive so only call it if pages were
@@ -864,6 +1106,21 @@ void task_numa_fault(int node, int pages, bool misplaced)
 		task_numa_placement(p);
 }
 
+static void reset_ptenuma_scan(struct task_struct *p)
+{
+	ACCESS_ONCE(p->mm->numa_scan_seq)++;
+	
+	if (p->mm && p->mm->mm_balancenuma)
+		p->mm->mm_balancenuma->mm_numa_fault_tot >>= 1;
+	if (p->task_balancenuma) {
+		int nid;
+		p->task_balancenuma->task_numa_fault_tot >>= 1;
+		for_each_online_node(nid) {
+			p->task_balancenuma->task_numa_fault[nid] >>= 1;
+		}
+	}
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -912,7 +1169,7 @@ void task_numa_work(struct callback_head *work)
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, offset);
 	if (!vma) {
-		ACCESS_ONCE(mm->numa_scan_seq)++;
+		reset_ptenuma_scan(p);
 		offset = 0;
 		vma = mm->mmap;
 	}
@@ -937,14 +1194,12 @@ void task_numa_work(struct callback_head *work)
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
 	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * so we must check it now.
 	 */
-	if (!vma) {
-		ACCESS_ONCE(mm->numa_scan_seq)++;
-		offset = 0;
-		vma = mm->mmap;
-	}
-	mm->numa_scan_offset = offset;
+	if (vma)
+		mm->numa_scan_offset = offset;
+	else
+		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3f0e5a1..92df3d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,6 +502,20 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+
+#ifdef CONFIG_BALANCE_NUMA
+static inline void free_task_balancenuma(struct task_struct *p)
+{
+	if (p->task_balancenuma)
+		kfree(p->task_balancenuma);
+	p->task_balancenuma = NULL;
+}
+#else
+static inline void free_task_balancenuma(struct task_struct *p)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 30/31] mm: numa: Introduce last_nid to the page frame
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (28 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 29/31] sched: numa: CPU follows memory Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  2012-11-13 11:13 ` [PATCH 31/31] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
  2012-11-13 15:14 ` [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Ingo Molnar
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h       |   30 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |    4 ++++
 mm/page_alloc.c          |    2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index eed70f8..d8ef261 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,36 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page->_last_nid;
+}
+static inline void reset_page_last_nid(struct page *page)
+{
+	page->_last_nid = -1;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}
+
+static inline void reset_page_last_nid(struct page *page)
+{
+}
+#endif
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 66172d6..9588a91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -175,6 +175,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4681fc4..7e337df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,6 +608,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	reset_page_last_nid(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3826,6 +3827,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
+		reset_page_last_nid(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 31/31] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (29 preceding siblings ...)
  2012-11-13 11:12 ` [PATCH 30/31] mm: numa: Introduce last_nid to the page frame Mel Gorman
@ 2012-11-13 11:13 ` Mel Gorman
  2012-11-13 15:14 ` [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Ingo Molnar
  31 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 11:13 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.

This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 076f8f8..89696d7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2425,6 +2425,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate pages towards their home node or the referencing node */
 	if (pol->flags & MPOL_F_HOME) {
+		int last_nid;
+
 		/*
 		 * Make a placement decision based on the home node.
 		 * NOTE: Potentially this can result in a remote->remote
@@ -2437,6 +2439,31 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 			/* No home node, migrate to the referencing node */
 			polnid = numa_node_id();
 		}
+
+		/*
+		 * Multi-stage node selection is used in conjunction
+		 * with a periodic migration fault to build a temporal
+		 * task<->page relation. By using a two-stage filter we
+		 * remove short/unlikely relations.
+		 *
+		 * Using P(p) ~ n_p / n_t as per frequentist
+		 * probability, we can equate a task's usage of a
+		 * particular page (n_p) per total usage of this
+		 * page (n_t) (in a given time-span) to a probability.
+		 *
+		 * Our periodic faults will sample this probability and
+		 * getting the same result twice in a row, given these
+		 * samples are fully independent, is then given by
+		 * P(n)^2, provided our sample period is sufficiently
+		 * short compared to the usage pattern.
+		 *
+		 * This quadric squishes small probabilities, making
+		 * it less likely we act on an unlikely task<->page
+		 * relation.
+		 */
+		last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
+			goto out;
 	}
 
 	if (curnid != polnid)
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
                   ` (30 preceding siblings ...)
  2012-11-13 11:13 ` [PATCH 31/31] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
@ 2012-11-13 15:14 ` Ingo Molnar
  2012-11-13 15:42   ` Mel Gorman
  31 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-11-13 15:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> (Since I wrote this changelog there has been another release 
> of schednuma. I had delayed releasing this series long enough 
> and decided not to delay further. Of course, I plan to dig 
> into that new revision and see what has changed.)

Thanks, I've picked up a number of cleanups from your series and 
propagated them into tip:numa/core tree.

FYI, in addition to the specific patches to which I replied to 
earier today, I've also propagated all your:

   CONFIG_SCHED_NUMA -> CONFIG_BALANCE_NUMA

renames thoughout the patches - I fundamentally agree that 
CONFIG_BALANCE_NUMA is a better, more generic name.

My structural criticism of the architecture specific bits of 
your patch-queue still applies to this version as well. That 
change inflicted much of the changes that you had to do to 
Peter's patches. It blew up the size of your tree and forks the 
code into per architecture variants for no good reason.

Had you not done that and had you kept the code generic you'd 
essentially end up close to where tip:numa/core is today.

So if we can clear that core issue up we'll have quite a bit of 
agreement.

I'd also like to add another, structural side note: you mixed 
new vm-stats bits into the whole queue, needlessly blowing up 
the size and the mm/ specific portions of the tree. I'd suggest 
to post and keep those bits separately, preferably on top of 
what we have already once it has settled down. I'm keeping the 
'perf bench numa' bits separate as well.

Anyway, I've applied all applicable cleanups from you and picked 
up Peter's latest code with the modifications I've indicated in 
that thread, to the latest tip:numa/core tree, which I'll send 
out for review in the next hour or so.

This version is supposed to address all review feedback received 
so far: it refines the MM specific split-up of the patches, 
fixes regressions - see the changelogs for more details.

I'll (re-)send the full series of the latest patches and any 
additional feedback will be welcome.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
  2012-11-13 15:14 ` [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Ingo Molnar
@ 2012-11-13 15:42   ` Mel Gorman
  2012-11-13 17:27     ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2012-11-13 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Tue, Nov 13, 2012 at 04:14:16PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > (Since I wrote this changelog there has been another release 
> > of schednuma. I had delayed releasing this series long enough 
> > and decided not to delay further. Of course, I plan to dig 
> > into that new revision and see what has changed.)
> 
> Thanks, I've picked up a number of cleanups from your series and 
> propagated them into tip:numa/core tree.
> 

Cool.

> FYI, in addition to the specific patches to which I replied to 
> earier today, I've also propagated all your:
> 
>    CONFIG_SCHED_NUMA -> CONFIG_BALANCE_NUMA
> 
> renames thoughout the patches - I fundamentally agree that 
> CONFIG_BALANCE_NUMA is a better, more generic name.
> 
> My structural criticism of the architecture specific bits of 
> your patch-queue still applies to this version as well. That 
> change inflicted much of the changes that you had to do to 
> Peter's patches. It blew up the size of your tree and forks the 
> code into per architecture variants for no good reason.
> 

Should be fairly easy to do what you described -- move to generic and
make weak functions. PAGE_NUMA still has to be defined per architecture
because they'll need to update their pte_present, pmd_present and pmd_bad to
match but I do not necessarily consider this to be a bad thing. Initially,
enabling automatic NUMA support be a careful choice until we can be 100%
sure that PROT_NONE is equivalent in all cases.  Prototype is below that
moves definitions to mm/pgtable-generic.c

There is still the task of converting change_prot_numa() to reuse
change_protection if PAGE_NUMA == PROT_NONE but that should be
straight-forward.

> Had you not done that and had you kept the code generic you'd 
> essentially end up close to where tip:numa/core is today.
> 
> So if we can clear that core issue up we'll have quite a bit of 
> agreement.
> 
> I'd also like to add another, structural side note: you mixed 
> new vm-stats bits into the whole queue, needlessly blowing up 
> the size and the mm/ specific portions of the tree. I'd suggest 
> to post and keep those bits separately, preferably on top of 
> what we have already once it has settled down. I'm keeping the 
> 'perf bench numa' bits separate as well.

The stats part are fairly late in the queue. I noticed they break build
for !CONFIG_BALANCE_NUMA but it was trivially resolved. I feel they are
important due to the history showing the cost of all the balancing
implementations to be fairly high. One can use profiles to see where
some of the cost is but I also find the vmstats helpful in figuring out
how much work it's doing. They can be dropped again if they are not
considered generally useful.

> 
> Anyway, I've applied all applicable cleanups from you and picked 
> up Peter's latest code with the modifications I've indicated in 
> that thread, to the latest tip:numa/core tree, which I'll send 
> out for review in the next hour or so.
> 

Ok.

> This version is supposed to address all review feedback received 
> so far: it refines the MM specific split-up of the patches, 
> fixes regressions - see the changelogs for more details.
> 
> I'll (re-)send the full series of the latest patches and any 
> additional feedback will be welcome.
> 

Thanks

---8<---
mm: numa: Make pte_numa() and pmd_numa() a generic implementation

It was pointed out by Ingo Molnar that the per-architecture definition of
the NUMA PTE helper functions means that each supporting architecture
will have to cut and paste it which is unfortunate. He suggested instead
that the helpers should be weak functions that can be overridden by the
architecture.

This patch moves the helpers to mm/pgtable-generic.c and makes them weak
functions. Architectures wishing to use this will still be required to
define _PAGE_NUMA and potentially update their p[te|md]_present and
pmd_bad helpers if they choose to make PAGE_NUMA similar to PROT_NONE.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   56 +---------------------------------------
 include/asm-generic/pgtable.h  |   17 +++++-------
 mm/pgtable-generic.c           |   53 +++++++++++++++++++++++++++++++++++++
 3 files changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e075d57..4a4c11c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -425,61 +425,6 @@ static inline int pmd_present(pmd_t pmd)
 				 _PAGE_NUMA);
 }
 
-#ifdef CONFIG_BALANCE_NUMA
-/*
- * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
- * same bit too). It's set only when _PAGE_PRESET is not set and it's
- * never set if _PAGE_PRESENT is set.
- *
- * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
- * fault triggers on those regions if pte/pmd_numa returns true
- * (because _PAGE_PRESENT is not set).
- */
-static inline int pte_numa(pte_t pte)
-{
-	return (pte_flags(pte) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-static inline int pmd_numa(pmd_t pmd)
-{
-	return (pmd_flags(pmd) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-#endif
-
-/*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
- */
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	pte = pte_clear_flags(pte, _PAGE_NUMA);
-	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
-	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
-}
-
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	pte = pte_set_flags(pte, _PAGE_NUMA);
-	return pte_clear_flags(pte, _PAGE_PRESENT);
-}
-
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
-}
-
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
@@ -534,6 +479,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
 }
 
+extern int pmd_numa(pmd_t pmd);
 static inline int pmd_bad(pmd_t pmd)
 {
 #ifdef CONFIG_BALANCE_NUMA
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 896667e..da3e761 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -554,17 +554,12 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
-#ifndef CONFIG_BALANCE_NUMA
-static inline int pte_numa(pte_t pte)
-{
-	return 0;
-}
-
-static inline int pmd_numa(pmd_t pmd)
-{
-	return 0;
-}
-#endif /* CONFIG_BALANCE_NUMA */
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..6b6507f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -170,3 +170,56 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
+
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+__weak int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+__weak int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+__weak pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+__weak pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+__weak pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+__weak pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
  2012-11-13 15:42   ` Mel Gorman
@ 2012-11-13 17:27     ` Ingo Molnar
  2012-11-14  4:09       ` Rik van Riel
  2012-11-14 12:24       ` Mel Gorman
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2012-11-13 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > I'd also like to add another, structural side note: you 
> > mixed new vm-stats bits into the whole queue, needlessly 
> > blowing up the size and the mm/ specific portions of the 
> > tree. I'd suggest to post and keep those bits separately, 
> > preferably on top of what we have already once it has 
> > settled down. I'm keeping the 'perf bench numa' bits 
> > separate as well.
> 
> The stats part are fairly late in the queue. I noticed they 
> break build for !CONFIG_BALANCE_NUMA but it was trivially 
> resolved. [...]

Ok - the vm-stats bits are the last larger item remaining that 
I've seen - could you please redo any of your changes on top of 
the latest tip:numa/core tree, to make them easier for me to 
pick up?

Your tree is slowly becoming a rebase of tip:numa/core and that 
will certainly cause problems.

I'll backmerge any delta patches and rebase as necessary - but 
please do them as deltas on top of tip:numa/core to make things 
reviewable and easier to merge:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core

Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
  2012-11-13 17:27     ` Ingo Molnar
@ 2012-11-14  4:09       ` Rik van Riel
  2012-11-14 12:24       ` Mel Gorman
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2012-11-14  4:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 12:27 PM, Ingo Molnar wrote:
>
> * Mel Gorman <mgorman@suse.de> wrote:
>
>>> I'd also like to add another, structural side note: you
>>> mixed new vm-stats bits into the whole queue, needlessly
>>> blowing up the size and the mm/ specific portions of the
>>> tree. I'd suggest to post and keep those bits separately,
>>> preferably on top of what we have already once it has
>>> settled down. I'm keeping the 'perf bench numa' bits
>>> separate as well.
>>
>> The stats part are fairly late in the queue. I noticed they
>> break build for !CONFIG_BALANCE_NUMA but it was trivially
>> resolved. [...]
>
> Ok - the vm-stats bits are the last larger item remaining that
> I've seen - could you please redo any of your changes on top of
> the latest tip:numa/core tree, to make them easier for me to
> pick up?
 >
> Your tree is slowly becoming a rebase of tip:numa/core and that
> will certainly cause problems.

Mel's tree looks like the easiest of the two to merge
from an mm/ point of view.

> I'll backmerge any delta patches and rebase as necessary - but
> please do them as deltas on top of tip:numa/core to make things
> reviewable and easier to merge:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core
>
> Thanks!
>
> 	Ingo
>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 29/31] sched: numa: CPU follows memory
  2012-11-13 11:12 ` [PATCH 29/31] sched: numa: CPU follows memory Mel Gorman
@ 2012-11-14 11:20   ` Mel Gorman
  0 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-14 11:20 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Tue, Nov 13, 2012 at 11:12:58AM +0000, Mel Gorman wrote:
> @@ -864,6 +1106,21 @@ void task_numa_fault(int node, int pages, bool misplaced)
>  		task_numa_placement(p);
>  }
>  
> +static void reset_ptenuma_scan(struct task_struct *p)
> +{
> +	ACCESS_ONCE(p->mm->numa_scan_seq)++;
> +	
> +	if (p->mm && p->mm->mm_balancenuma)
> +		p->mm->mm_balancenuma->mm_numa_fault_tot >>= 1;
> +	if (p->task_balancenuma) {
> +		int nid;
> +		p->task_balancenuma->task_numa_fault_tot >>= 1;
> +		for_each_online_node(nid) {
> +			p->task_balancenuma->task_numa_fault[nid] >>= 1;
> +		}
> +	}
> +}

Overnight tests indicated that cpu-follows is currently broken in this
series but a large part of the problem is a missing

p->mm->numa_scan_offset = 0;

here. means that all tasks are only considered for convergence once without
proper resetting of the scanner. It's effectly becomes the vanilla kernel
with a bunch of system CPU overhead.

Of course it's not the only problem with this patch as the overhead of
finding a proper placement is mnassive and due to the slow scanning rate,
it converges very slowly. While it's based on autonuma, autonuma did the
same job outside the context of a process so it's not exactly equivalent.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
  2012-11-13 17:27     ` Ingo Molnar
  2012-11-14  4:09       ` Rik van Riel
@ 2012-11-14 12:24       ` Mel Gorman
  1 sibling, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-14 12:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Tue, Nov 13, 2012 at 06:27:34PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > I'd also like to add another, structural side note: you 
> > > mixed new vm-stats bits into the whole queue, needlessly 
> > > blowing up the size and the mm/ specific portions of the 
> > > tree. I'd suggest to post and keep those bits separately, 
> > > preferably on top of what we have already once it has 
> > > settled down. I'm keeping the 'perf bench numa' bits 
> > > separate as well.
> > 
> > The stats part are fairly late in the queue. I noticed they 
> > break build for !CONFIG_BALANCE_NUMA but it was trivially 
> > resolved. [...]
> 
> Ok - the vm-stats bits are the last larger item remaining that 
> I've seen - could you please redo any of your changes on top of 
> the latest tip:numa/core tree, to make them easier for me to 
> pick up?
> 

I don't think it's that simple. I can rebase the stats patch on top without
too much effort of course but it's hardly a critical element. If the
stats were unavailable it would make no difference at all and no one would
lose any sleep over it. The greater issue for me is that superficially it
appears that a lot of the previous review comments still apply

prot_none still appears to be hard-coded (change_prot_none f.e.)
pick_numa_rand is still not random
THP migration optimisation is before patches, does schednuma depend on
	this optimisation? Dunno
cannot be disabled from command line in case it goes pear shaped
the new numa balancing is a massive monolithic patch with little comment
	(I have not reached the point yet where I'm ready to pick apart
	how and why it works and tests will not start until tonight)
the page-flags splitout is still a monolithic patch (although not a
	major concern in this case)
I think your scanner might not be restarting if the last VMA in the
	process is !vma_migratable. If true, it will not adapt with
	new information.
MIGRATE_FAULT is still there even though it's not clear it's even
	necessary

etc. I didn't go back through the old thread. I know I also have not applied
the same review issues to myself and it sounds like I'm being hypocritical
but I'm also not trying to merge. I also know that I'm currently way behind
in terms of overall performance reflecting the relative age of the tree.

> Your tree is slowly becoming a rebase of tip:numa/core and that 
> will certainly cause problems.
> 

How so? What I'm trying to do is build a tree that shows the logical
progression of getting from the vanilla kernel to a working NUMA
balancer. It's not in linux-next colliding with your tree or causing a
direct problem. I intend to expose a git tree of it shortly but am not
planning on asking it to be pulled because I know it's not ready.

> I'll backmerge any delta patches and rebase as necessary - but 
> please do them as deltas on top of tip:numa/core to make things 
> reviewable and easier to merge:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core
> 

It is a stretch to describe a git tree that requires a significant number
of scheduler patches to even apply and includes a monolith patch like
"sched, numa, mm: Add adaptive NUMA affinity support" as "reviewable".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-11-13 11:12 ` [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
@ 2012-11-14 17:13   ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2012-11-14 17:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 06:12 AM, Mel Gorman wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> When we split a transparent hugepage, transfer the NUMA type from the
> pmd to the pte if needed.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages
  2012-11-13 11:12 ` [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages Mel Gorman
@ 2012-11-14 17:58   ` Rik van Riel
  2012-11-14 18:18     ` Mel Gorman
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2012-11-14 17:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 06:12 AM, Mel Gorman wrote:
> task_numa_placement is potentially very expensive so limit it to being
> called when a page is misplaced. How necessary this is depends on
> the placement policy.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

That reads like a premature optimization :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page
  2012-11-13 11:12 ` [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
@ 2012-11-14 18:00   ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2012-11-14 18:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 06:12 AM, Mel Gorman wrote:
> The pte_same check after a misplaced page is successfully migrated will
> never succeed and force a double fault to fix it up as pointed out by Rik
> van Riel. This was the "safe" option but it's expensive.
>
> This patch uses the migration allocation callback to record the location
> of the newly migrated page. If the page is the same when the PTE lock is
> reacquired it is assumed that it is safe to complete the pte_numa fault
> without incurring a double fault.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages
  2012-11-14 17:58   ` Rik van Riel
@ 2012-11-14 18:18     ` Mel Gorman
  0 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-14 18:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Wed, Nov 14, 2012 at 12:58:37PM -0500, Rik van Riel wrote:
> On 11/13/2012 06:12 AM, Mel Gorman wrote:
> >task_numa_placement is potentially very expensive so limit it to being
> >called when a page is misplaced. How necessary this is depends on
> >the placement policy.
> >
> >Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> That reads like a premature optimization :)
> 

Hah, touche.

Will shuffle its location in the series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 26/31] sched: numa: Make mempolicy home-node aware
  2012-11-13 11:12 ` [PATCH 26/31] sched: numa: Make mempolicy home-node aware Mel Gorman
@ 2012-11-14 18:22   ` Rik van Riel
  2012-11-14 18:50     ` Mel Gorman
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2012-11-14 18:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 06:12 AM, Mel Gorman wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> Add another layer of fallback policy to make the home node concept
> useful from a memory allocation PoV.
>
> This changes the mpol order to:
>
>   - vma->vm_ops->get_policy	[if applicable]
>   - vma->vm_policy		[if applicable]
>   - task->mempolicy
>   - tsk_home_node() preferred	[NEW]
>   - default_policy

Why is the home node policy not the default policy?

> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> facilitate efficient on-demand memory migration.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Paul Turner <pjt@google.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>




-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 27/31] sched: numa: Make find_busiest_queue() a method
  2012-11-13 11:12 ` [PATCH 27/31] sched: numa: Make find_busiest_queue() a method Mel Gorman
@ 2012-11-14 18:25   ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2012-11-14 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/13/2012 06:12 AM, Mel Gorman wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> Its a bit awkward but it was the least painful means of modifying the
> queue selection. Used in the next patch to conditionally use a queue.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Paul Turner <pjt@google.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 26/31] sched: numa: Make mempolicy home-node aware
  2012-11-14 18:22   ` Rik van Riel
@ 2012-11-14 18:50     ` Mel Gorman
  0 siblings, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2012-11-14 18:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Wed, Nov 14, 2012 at 01:22:33PM -0500, Rik van Riel wrote:
> On 11/13/2012 06:12 AM, Mel Gorman wrote:
> >From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >
> >Add another layer of fallback policy to make the home node concept
> >useful from a memory allocation PoV.
> >
> >This changes the mpol order to:
> >
> >  - vma->vm_ops->get_policy	[if applicable]
> >  - vma->vm_policy		[if applicable]
> >  - task->mempolicy
> >  - tsk_home_node() preferred	[NEW]
> >  - default_policy
> 
> Why is the home node policy not the default policy?
> 

hmm, it effectively is if there is no other policy set. The changelog is
a bit misleading. In V3, this will be dropped entirely. It was not clear
that doing a remote alloc for home nodes was a good idea. Instead memory
is always allocated locally to the faulting process as normal and
migrated later if necessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-11-14 18:50 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
2012-11-13 11:12 ` [PATCH 01/31] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-11-13 11:12 ` [PATCH 02/31] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-11-13 11:12 ` [PATCH 03/31] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-11-13 11:12 ` [PATCH 04/31] mm: numa: define _PAGE_NUMA Mel Gorman
2012-11-13 11:12 ` [PATCH 05/31] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-11-13 11:12 ` [PATCH 06/31] mm: numa: teach gup_fast about pmd_numa Mel Gorman
2012-11-13 11:12 ` [PATCH 07/31] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-11-14 17:13   ` Rik van Riel
2012-11-13 11:12 ` [PATCH 08/31] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-11-13 11:12 ` [PATCH 09/31] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-11-13 11:12 ` [PATCH 10/31] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
2012-11-13 11:12 ` [PATCH 11/31] mm: mempolicy: Check for misplaced page Mel Gorman
2012-11-13 11:12 ` [PATCH 12/31] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-11-13 11:12 ` [PATCH 13/31] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-11-13 11:12 ` [PATCH 14/31] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2012-11-13 11:12 ` [PATCH 15/31] mm: numa: Add fault driven placement and migration Mel Gorman
2012-11-13 11:12 ` [PATCH 16/31] mm: numa: Only call task_numa_placement for misplaced pages Mel Gorman
2012-11-14 17:58   ` Rik van Riel
2012-11-14 18:18     ` Mel Gorman
2012-11-13 11:12 ` [PATCH 17/31] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
2012-11-14 18:00   ` Rik van Riel
2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-11-13 11:12 ` [PATCH 19/31] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-11-13 11:12 ` [PATCH 20/31] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2012-11-13 11:12 ` [PATCH 21/31] mm: numa: Migrate on reference policy Mel Gorman
2012-11-13 11:12 ` [PATCH 22/31] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-11-13 11:12 ` [PATCH 23/31] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-11-13 11:12 ` [PATCH 24/31] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-11-13 11:12 ` [PATCH 25/31] sched: numa: Introduce tsk_home_node() Mel Gorman
2012-11-13 11:12 ` [PATCH 26/31] sched: numa: Make mempolicy home-node aware Mel Gorman
2012-11-14 18:22   ` Rik van Riel
2012-11-14 18:50     ` Mel Gorman
2012-11-13 11:12 ` [PATCH 27/31] sched: numa: Make find_busiest_queue() a method Mel Gorman
2012-11-14 18:25   ` Rik van Riel
2012-11-13 11:12 ` [PATCH 28/31] sched: numa: Implement home-node awareness Mel Gorman
2012-11-13 11:12 ` [PATCH 29/31] sched: numa: CPU follows memory Mel Gorman
2012-11-14 11:20   ` Mel Gorman
2012-11-13 11:12 ` [PATCH 30/31] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-11-13 11:13 ` [PATCH 31/31] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-11-13 15:14 ` [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Ingo Molnar
2012-11-13 15:42   ` Mel Gorman
2012-11-13 17:27     ` Ingo Molnar
2012-11-14  4:09       ` Rik van Riel
2012-11-14 12:24       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).