All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC 0/8] numa - Migrate-on-Fault
@ 2010-11-11 19:44 Lee Schermerhorn
  2010-11-11 19:44 ` [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option Lee Schermerhorn
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:44 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

[RFC] Migrate-on-fault a.k.a Lazy Page Migration

At the Linux Plumber's conference, Andi Kleen encouraged me again
to resubmit my automatic page migration patches because he thinks
they will be useful for virtualization.  Later, in the Virtualization
mini-conf, the subject came up during a presentation about adding
NUMA awareness to qemu/kvm.  After the presentation, I discussed
these series with Andrea Arcangeli and he also encouraged me to
post them.  My position within HP has changed such that I'm not
sure how much time I'll have to spend on this area nor whether I'll
have access to the larger NUMA platforms on which to test the
patches thoroughly.  However, here is the second of 4 series that
comprise my shared policy enhancements and lazy/auto-migration
enhancement.

I have rebased the patches against a recent mmotm tree.  This
rebase built cleanly, booted and passed a few ad hoc tests on
x86_64.  I've made a pass over the patch descriptions to update
them.  If there is sufficient interest in merging this, I'll
do what I can to assist in the completion and testing of the
series.

Based atop the previously posted:

1) Shared policy cleanup, fixes, mapped file policy

To follow:

3)  Auto [as in "self"] migration facility
4)  a Migration Cache -- originally written by Marcello Tosatti

I'll announce this series and the automatic/lazy migration series
to follow on lkml, linux-mm, ...  However, I'll limit the actual
posting to linux-numa to avoid spamming the other lists.

---

This series of patches implements page migration in the fault path.

!!! N.B., Need to consider iteraction with KSM and Transparent Huge
!!! Pages.

The basic idea is that when a fault handler such as do_swap_page()
finds a cached page with zero mappings that is otherwise "stable"--
e.g., no I/O in progress--this is a good opportunity to check whether the
page resides on the node indicated by the mempolicy in the current context.

We only attempt to migrate when there are zero mappings because 1) we can
easily migrate the page--don't have to go through the effort of removing
all mappings and 2) default policy--a common case--can give different
answers from different tasks running on different nodes.  Checking the
policy when there are zero mappings effectively implements a "first touch"
placement policy.

Note that this mechanism could be used to migrate page cache pages that
were read in earlier, are no longer referenced, but are about to be
used by a new task on another node from where the page resides.  The
same mechanism can be used to pull anon pages along with a task when
the load balancer decides to move it to another node.  However, that
will require a bit more mechanism, and is the subject of another
patch series.

The kernel's direct migration facility support most of the
mechanism that is required to implement this "migration on fault".
Some changes were needed to the migratepage op functions to behave
appropriately when called from the fault path.  Then we need to add
the function[s] to test the current page in the fault path for zero
mapping, no writebacks, misplacement, ...; and the
function[s] to acutally migrate the page contents to a newly
allocated page using the [modified] migratepage address space
operations of the direct migration mechanism.

This series used to include patches to migrate cached file pages and
shmem pages.  Testing with, e.g., kernel builds, showed a great deal
of thrashing of page cache pages, so those patches have been removed.
I think page replication would be a better approach for shared,
read-only pages.  Nick Piggin created such a patch quite a while back
and I had integrated it with automigration series.  Those patches have
since gone stale.

---
Lee Schermerhorn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
@ 2010-11-11 19:44 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 2/8] numa - Migrate-on-Fault - add cpuset control Lee Schermerhorn
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:44 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - add migrate on fault Kconfig option

Add a "MIGRATE_ON_FAULT" Kconfig suboption to "MIGRATION" to
separately enable/disable MIGRATE_ON_FAULT handling.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/Kconfig |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/mm/Kconfig
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/Kconfig
+++ linux-2.6.36-mmotm-101103-1217/mm/Kconfig
@@ -198,6 +198,18 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+# I really want to "select SWAP" here, but that resulted in a circular
+# dependency with "HIBERNATION" back in 28-rc4[-mm*]
+config MIGRATE_ON_FAULT
+	bool "Migrate unmapped page on fault"
+	depends on MIGRATION
+	depends on SWAP
+	help
+	  Allows "misplaced" pages found in a page cache at fault time to be
+	  migrated to the node specified by the applicable policy when the
+	  page is not currently mapped by any tasks.  This allows a task to
+	  pull unmapped pages closer to itself when enabled for that task.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 2/8] numa - Migrate-on-Fault - add cpuset control
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
  2010-11-11 19:44 ` [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 3/8] numa - Migrate-on-Fault - check for misplaced page Lee Schermerhorn
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - add migrate on fault per cpuset control

Add a migrate_on_fault control file to cpusets.  Default is
disabled ['0'].  When set '1', enables migrate-on-fault for
tasks in the cpuset once they notice the flag:  for already
running tasks, this means after they update their cpuset memory
state.

To avoid adding #ifdefs to kernel/cpuset.c [at Paul Jackson's
request], most of the migrate-on-fault support is unconditionally
included in cpuset.c.  However, the control file will only be
created if migrate-on-fault is configured.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |   16 ++++++++++++++++
 include/linux/sched.h  |   19 +++++++++++++++++++
 kernel/cpuset.c        |   34 +++++++++++++++++++++++++++++++++-
 mm/Kconfig             |    1 +
 4 files changed, 69 insertions(+), 1 deletion(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1457,6 +1457,7 @@ struct task_struct {
 	short il_next;
 	short shared_huge_policy_enabled:1;
 	short shared_file_policy_enabled:1;
+	short migrate_on_fault_enabled:1;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -1905,6 +1906,24 @@ static int shared_file_policy_enabled(st
 {
 	return 0;
 }
+#endif
+
+#ifdef CONFIG_MIGRATE_ON_FAULT
+static inline void set_migrate_on_fault_enabled(struct task_struct *tsk,
+							int val)
+{
+	tsk->migrate_on_fault_enabled = !!val;
+}
+static inline int migrate_on_fault_enabled(struct task_struct *tsk)
+{
+	return tsk->migrate_on_fault_enabled;
+}
+#else
+static void set_shared_file_policy_enabled(struct task_struct *tsk, int val) { }
+static inline int migrate_on_fault_enabled(struct task_struct *tsk)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_HOTPLUG_CPU
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -134,6 +134,7 @@ typedef enum {
 	CS_SPREAD_SLAB,
 	CS_SHARED_HUGE_POLICY,
  	CS_SHARED_FILE_POLICY,
+	CS_MIGRATE_ON_FAULT,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -182,6 +183,11 @@ static inline int is_shared_file_policy(
 	return test_bit(CS_SHARED_FILE_POLICY, &cs->flags);
 }
 
+static inline int is_migrate_on_fault(const struct cpuset *cs)
+{
+	return test_bit(CS_MIGRATE_ON_FAULT,  &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -341,6 +347,12 @@ static void cpuset_update_task_cpuset_fl
 		set_shared_file_policy_enabled(tsk, 1);
 	else
 		set_shared_file_policy_enabled(tsk, 0);
+
+	if (is_migrate_on_fault(cs))
+		set_migrate_on_fault_enabled(tsk, 1);
+	else
+		set_migrate_on_fault_enabled(tsk, 0);
+
 }
 
 /*
@@ -1281,7 +1293,8 @@ static int update_flag(cpuset_flagbits_t
 	cpuset_flags_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs))
 			|| (is_shared_huge_policy(cs) != is_shared_huge_policy(trialcs))
-			|| (is_shared_file_policy(cs) != is_shared_file_policy(trialcs)));
+			|| (is_shared_file_policy(cs) != is_shared_file_policy(trialcs))
+			|| (is_migrate_on_fault(cs) != is_migrate_on_fault(trialcs)));
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
@@ -1519,6 +1532,7 @@ typedef enum {
 	FILE_SPREAD_SLAB,
 	FILE_SHARED_HUGE_POLICY,
 	FILE_SHARED_FILE_POLICY,
+	FILE_MIGRATE_ON_FAULT,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1564,6 +1578,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_SHARED_FILE_POLICY:
 		retval = update_flag(CS_SHARED_FILE_POLICY, cs, val);
 		break;
+	case FILE_MIGRATE_ON_FAULT:
+		retval = update_flag(CS_MIGRATE_ON_FAULT, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1732,6 +1749,8 @@ static u64 cpuset_read_u64(struct cgroup
 		return is_shared_huge_policy(cs);
 	case FILE_SHARED_FILE_POLICY:
 		return is_shared_file_policy(cs);
+	case FILE_MIGRATE_ON_FAULT:
+		return is_migrate_on_fault(cs);
 	default:
 		BUG();
 	}
@@ -1863,6 +1882,13 @@ static struct cftype cft_shared_file_pol
 	.private = FILE_SHARED_FILE_POLICY,
 };
 
+static struct cftype cft_migrate_on_fault = {
+	.name = "migrate_on_fault",
+	.read_u64 = cpuset_read_u64,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_MIGRATE_ON_FAULT,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1879,6 +1905,10 @@ static int cpuset_populate(struct cgroup
 	err = add_shared_xxx_policy_file(cont, ss, &cft_shared_file_policy);
 	if (err < 0)
 		return err;
+	err = add_migrate_on_fault_file(cont, ss,
+						&cft_migrate_on_fault);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -1957,6 +1987,8 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_SHARED_HUGE_POLICY, &cs->flags);
 	if (is_shared_file_policy(parent))
 		set_bit(CS_SHARED_FILE_POLICY, &cs->flags);
+	if (is_migrate_on_fault(parent))
+		set_bit(CS_MIGRATE_ON_FAULT, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
Index: linux-2.6.36-mmotm-101103-1217/mm/Kconfig
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/Kconfig
+++ linux-2.6.36-mmotm-101103-1217/mm/Kconfig
@@ -204,6 +204,7 @@ config MIGRATE_ON_FAULT
 	bool "Migrate unmapped page on fault"
 	depends on MIGRATION
 	depends on SWAP
+	select CPUSETS
 	help
 	  Allows "misplaced" pages found in a page cache at fault time to be
 	  migrated to the node specified by the applicable policy when the
Index: linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/cpuset.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
@@ -148,6 +148,22 @@ static inline int add_shared_xxx_policy_
 }
 #endif
 
+#ifdef CONFIG_MIGRATE_ON_FAULT
+static inline int add_migrate_on_fault_file(struct cgroup *cg,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return cgroup_add_file(cg, ss, cft);
+}
+#else
+static inline int add_migrate_on_fault_file(struct cgroup *cont,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return 0;
+}
+#endif
+
 extern void __init cpuset_init_shared_huge_policy(int dflt);
 extern void __init cpuset_init_shared_file_policy(int dflt);
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 3/8] numa - Migrate-on-Fault - check for misplaced page
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
  2010-11-11 19:44 ` [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 2/8] numa - Migrate-on-Fault - add cpuset control Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 4/8] numa - Migrate-on-Fault - migrate misplaced pages Lee Schermerhorn
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - check for misplaced page

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  Because interleaved and
non-interleaved allocations are accounted differently, the function
also returns whether or not the new node came from an interleaved
policy, if the page is misplaced.

A subsequent patch will call this function from the fault path for
stable pages with zero page_mapcount().  Because of this, I don't
want to go ahead and allocate the page, e.g., via alloc_page_vma()
only to have to free it if it has the correct policy.  So, I just
mimic the alloc_page_vma() node computation logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    9 ++++
 mm/mempolicy.c            |   84 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -3009,3 +3009,87 @@ struct mpol_range *get_numa_submap(struc
 	spin_unlock(&sp->lock);
 	return ranges;
 }
+
+#ifdef CONFIG_MIGRATE_ON_FAULT
+/**
+ * mpol_misplaced - check whether current page node id valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ * @newnid - [ptr to] node id to which page should be migrated
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ * If page valid in policy, return 0 -- !misplaced:  reuse current page
+ * Else
+ *     return destination nid via newnid, if !NULL
+ *     return MPOL_MIGRATE_NONINTERLEAVED for non-interleaved policy
+ *     return MPOL_MIGRATE_INTERLEAVED for interleaved policy.
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+			 unsigned long addr, int *newnid)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	int polnid = -1;
+	int ret = 0;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+
+	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
+		unsigned long pgoff;
+
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, pgoff);
+
+		if (curnid != polnid)
+			ret = MPOL_MIGRATE_INTERLEAVED;
+		goto out;
+	}
+
+	switch (pol->mode) {
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+
+	if (curnid != polnid)
+		ret = MPOL_MIGRATE_NONINTERLEAVED;
+out:
+	mpol_cond_put(pol);
+	if (ret && newnid)
+		*newnid = polnid;
+	return ret;
+}
+#endif /* _MIGRATE_ON_FAULT */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -245,6 +245,15 @@ extern int show_numa_map(struct seq_file
 struct mpol_range;
 extern struct mpol_range *get_numa_submap(struct vm_area_struct *);
 
+#ifdef CONFIG_MIGRATE_ON_FAULT
+#define MPOL_MIGRATE_NONINTERLEAVED 1
+#define MPOL_MIGRATE_INTERLEAVED 2
+#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+
+extern int mpol_misplaced(struct page *, struct vm_area_struct *,
+		unsigned long, int *);
+#endif
+
 #else
 
 struct mempolicy {};

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 4/8] numa - Migrate-on-Fault - migrate misplaced pages
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 3/8] numa - Migrate-on-Fault - check for misplaced page Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 5/8] numa - Migrate-on-Fault - migrate misplaced anon pages Lee Schermerhorn
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - migrate misplaced page

This patch adds a new function migrate_misplaced_page() to mm/migrate.c
[where most of the other page migration functions live] to migrate a
misplaced page to a specified destination node.  This function will be
called from the fault path.  Because we already know the destination
node for the migration, we allocate pages directly rather than rerunning
the policy node computation in alloc_page_vma().

We want to ignore the extra page ref count when replacing the
page in its mapping in the fault path.  To accomplish this I've added
a boolean [int] argument "faulting" to the migratepage op functions.
This arg gets passed down to migrate_page_move_mapping():  0 for direct
migration, !0 for migrate-on-fault.

NOTE:  at one point I had convinced myself that ignoring the ref count
in this path was safe.  Since then, a lot of changes have been made and
I've heard it said that raising the ref count disables migration.  This
might require rework--e.g., to account for the extra ref rather than
ignoring refs.

The patch adds the function check_migrate_misplaced_page() to migrate.c
to check whether a page is "misplaced"--i.e.  on a node different
from what the policy for (vma, address) dictates.  This check
involves accessing the vma policy, so we only do this if:
   * migrate-on-fault is enabled for this task [via cpuset ctrl]
   * page has zero mapcount [no pte references]
   * page is not in writeback
   * page is up to date
   * page's mapping has a migratepage a_op [no fallback!]
In these checks are satisfied, the page will be migrated to the
"correct" node, if possible.  If migration fails for any reason,
we just use the original page.

Note that when MIGRATE_ON_FAULT is not configured, the
check_migrate_misplaced_page() function becomes a static inline
function that just returns its page argument.

Subsequent patches will hook the fault handlers [anon,  and possibly
file  and/or shmem] to check_migrate_misplaced_page().

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/nfs/write.c            |    2 
 include/linux/fs.h        |    8 -
 include/linux/gfp.h       |    3 
 include/linux/mempolicy.h |   23 +----
 include/linux/migrate.h   |   22 ++++-
 mm/mempolicy.c            |   21 ++++
 mm/migrate.c              |  202 ++++++++++++++++++++++++++++++++++++++++------
 7 files changed, 230 insertions(+), 51 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -67,6 +67,7 @@ enum mpol_rebind_step {
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <linux/shared_policy.h>
+#include <linux/migrate.h>
 
 struct mm_struct;
 
@@ -223,22 +224,7 @@ extern int mpol_to_str(char *buffer, int
 			int no_context);
 #endif
 
-/* Check if a vma is migratable */
-static inline int vma_migratable(struct vm_area_struct *vma)
-{
-	if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
-		return 0;
-	/*
-	 * Migration allocates pages in the highest zone. If we cannot
-	 * do so then migration (at least from node to node) is not
-	 * possible.
-	 */
-	if (vma->vm_file &&
-		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
-								< policy_zone)
-			return 0;
-	return 1;
-}
+extern int vma_migratable(struct vm_area_struct *);
 
 struct seq_file;
 extern int show_numa_map(struct seq_file *, void *);
@@ -248,11 +234,12 @@ extern struct mpol_range *get_numa_subma
 #ifdef CONFIG_MIGRATE_ON_FAULT
 #define MPOL_MIGRATE_NONINTERLEAVED 1
 #define MPOL_MIGRATE_INTERLEAVED 2
-#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+#define misplaced_is_interleaved(pol) (pol == MPOL_MIGRATE_INTERLEAVED)
 
 extern int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
-#endif
+
+#endif /* CONFIG_MIGRATE_ON_FAULT */
 
 #else
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/fs.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
@@ -608,8 +608,8 @@ struct address_space_operations {
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
 	/* migrate the contents of a page to the specified target */
-	int (*migratepage) (struct address_space *,
-			struct page *, struct page *);
+	int (*migratepage) (struct address_space *, struct page *,
+			struct page *, int);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2451,8 +2451,8 @@ extern int generic_file_fsync(struct fil
 extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *);
+extern int buffer_migrate_page(struct address_space *, struct page *,
+				struct page *, int);
 #else
 #define buffer_migrate_page NULL
 #endif
Index: linux-2.6.36-mmotm-101103-1217/include/linux/gfp.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/gfp.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/gfp.h
@@ -329,11 +329,14 @@ extern struct page *alloc_page_vma(gfp_t
 			struct vm_area_struct *vma, unsigned long addr);
 struct mempolicy;
 extern struct page *alloc_page_pol(gfp_t, struct mempolicy *, pgoff_t);
+extern struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+					unsigned nid);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
 #define alloc_page_pol(gfp_mask, pol, off)  alloc_pages(gfp_mask, 0)
+#define alloc_page_interleave(gfp_mask, order, nid) alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -464,6 +464,25 @@ static void gather_stats(struct page *,
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+/*
+ * Check whether a vma is migratable
+ */
+int vma_migratable(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
+		return 0;
+	/*
+	 * Migration allocates pages in the highest zone. If we cannot
+	 * do so then migration (at least from node to node) is not
+	 * possible.
+	 */
+	if (vma->vm_file &&
+		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
+								< policy_zone)
+			return 0;
+	return 1;
+}
+
 /* Scan through pages checking if pages follow certain conditions. */
 static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end,
@@ -1901,7 +1920,7 @@ out:
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
 	struct zonelist *zl;
Index: linux-2.6.36-mmotm-101103-1217/mm/migrate.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/migrate.c
+++ linux-2.6.36-mmotm-101103-1217/mm/migrate.c
@@ -216,16 +216,28 @@ out:
 	pte_unmap_unlock(ptep, ptl);
 }
 
-/*
- * Replace the page in the mapping.
+/**
+ * migrate_page_move_mapping()  Replace the page in the mapping.
+ * @mapping - address_space in which to replace page
+ * @newpage - the replacement page
+ * @page    - page to be replaced -- key to slot in mapping
+ * @faulting - flag:  [lazy] migrating in the fault path
+ *
+ * For direct migration, [!faulting] the number of remaining references
+ * must be:
+ *   1 for anonymous pages without a mapping
+ *   2 for pages with a mapping
+ *   3 for pages with a mapping and PagePrivate set.
  *
- * The number of remaining references must be:
- * 1 for anonymous pages without a mapping
- * 2 for pages with a mapping
- * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+ * However, if we're in the fault path, we found the page in a cache,
+ * up-to-date with mapcount == 0.  We hold the page locked.  After we
+ * locked the page, another task could have faulted, found the page in
+ * the cache and thus increased the ref.  We want to allow migrate on
+ * fault to proceed in this case, so we ignore the refs when @faulting.
+//TODO:  is this true?  Can we really ignore ref count in this case?
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, int faulting)
 {
 	int expected_count;
 	void **pslot;
@@ -240,9 +252,17 @@ static int migrate_page_move_mapping(str
 	spin_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
- 					page_index(page));
+					page_index(page));
 
-	expected_count = 2 + page_has_private(page);
+	if (!faulting)
+		expected_count = 2 + !!PagePrivate(page);
+	else
+		expected_count = page_count(page); /* for page_freeze_refs() */
+
+	/*
+	 * there exists a window here, wherein a change in the reference
+	 * count on the page will block migration in the fault path. So be it.
+	 */
 	if (page_count(page) != expected_count ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -397,8 +417,8 @@ void migrate_page_copy(struct page *newp
  ***********************************************************/
 
 /* Always fail migration. Used for mappings that are not movable */
-int fail_migrate_page(struct address_space *mapping,
-			struct page *newpage, struct page *page)
+int fail_migrate_page(struct address_space *mapping, struct page *newpage,
+			struct page *page, int faulting)
 {
 	return -EIO;
 }
@@ -410,14 +430,14 @@ EXPORT_SYMBOL(fail_migrate_page);
  *
  * Pages are locked upon entry and exit.
  */
-int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+int migrate_page(struct address_space *mapping, struct page *newpage,
+			struct page *page, int faulting)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, faulting);
 
 	if (rc)
 		return rc;
@@ -433,18 +453,18 @@ EXPORT_SYMBOL(migrate_page);
  * if the underlying filesystem guarantees that no other references to "page"
  * exist.
  */
-int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+int buffer_migrate_page(struct address_space *mapping, struct page *newpage,
+			struct page *page, int faulting)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page);
+		return migrate_page(mapping, newpage, page, faulting);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, faulting);
 
 	if (rc)
 		return rc;
@@ -545,7 +565,7 @@ static int fallback_migrate_page(struct
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, 0);
 }
 
 /*
@@ -581,7 +601,7 @@ static int move_to_new_page(struct page
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page);
+		rc = migrate_page(mapping, newpage, page, 0);
 	else if (mapping->a_ops->migratepage)
 		/*
 		 * Most pages have a mapping and most filesystems
@@ -591,7 +611,7 @@ static int move_to_new_page(struct page
 		 * path for page migration.
 		 */
 		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page);
+						newpage, page, 0);
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
@@ -762,7 +782,7 @@ unlock:
  		/*
  		 * A page that has been migrated has all references
  		 * removed and will be freed. A page that has not been
- 		 * migrated will have kepts its references and be
+ 		 * migrated will have kept its references and be
  		 * restored.
  		 */
  		list_del(&page->lru);
@@ -1350,4 +1370,140 @@ int migrate_vmas(struct mm_struct *mm, c
  	}
  	return err;
 }
-#endif
+
+#ifdef CONFIG_MIGRATE_ON_FAULT
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.  Page is already unmapped, up to date and locked by caller.
+ * Anon pages are in the swap cache.  Page's mapping has a migratepage aop.
+ *
+ * page refs on entry/exit:  cache + fault path [+ bufs]
+ */
+struct page *migrate_misplaced_page(struct page *page,
+				 struct mm_struct *mm,
+				 int dest, int interleaved)
+{
+	struct page *oldpage = page, *newpage;
+	struct address_space *mapping = page_mapping(page);
+	struct mem_cgroup *mcg;
+	unsigned int gfp;
+	int rc = 0;
+	int charge = -ENOMEM;	/* in case alloc_*() fails */
+
+//TODO:  explicit assertions during debug/testing.  remove later?
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(page_mapcount(page));
+	VM_BUG_ON(PageAnon(page) && !PageSwapCache(page));
+	VM_BUG_ON(!mapping || !mapping->a_ops->migratepage);
+
+	/*
+	 * remove old page from LRU so it can't be found while migrating
+	 * except thru' the cache by other faulting tasks who will
+	 * block behind my lock.
+	 */
+	if (isolate_lru_page(page))	/* incrs page count on success */
+		goto out_nolru;	/* we lost */
+
+	/*
+	 * Never wait for allocations just to migrate on fault,
+	 * but don't dip into reserves.
+	 * And, only accept pages from specified node.
+	 * No sense migrating to a different "misplaced" page!
+	 */
+	gfp = (unsigned int)mapping_gfp_mask(mapping) & ~__GFP_WAIT;
+	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE ;
+
+	if (interleaved)
+		newpage = alloc_page_interleave(gfp, 0, dest);
+	else
+		newpage = alloc_pages_node(dest, gfp, 0);
+
+	if (!newpage)
+		goto out;	/* give up */
+
+	/*
+	 * can't just lock_page() -- "might sleep" in atomic context
+	 */
+	if (!trylock_page(newpage))
+		BUG();		/* new page should be unlocked!!! */
+
+// TODO:  are we in correct state to do this?  when called from do_swap_page()
+//        we have pending charge on this page.  Revisit when memcontrol settles
+//        down.
+	charge = mem_cgroup_prepare_migration(page, newpage, &mcg);
+	if (charge == -ENOMEM) {
+		rc = charge;
+		goto out;
+	}
+
+	newpage->index = page->index;
+	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))		/* like move_to_new_page() */
+		SetPageSwapBacked(newpage);
+
+	/*
+	 * migrate a_op transfers cache [+ buf] refs
+	 */
+	rc = mapping->a_ops->migratepage(mapping, newpage,
+						 page, 1);
+	if (!rc) {
+		get_page(newpage);	/* add isolate_lru_page ref */
+		put_page(page);		/* drop       "          "  */
+
+		unlock_page(page);
+		put_page(page);		/* drop fault path ref & free */
+
+		page = newpage;
+	}
+
+out:
+	if (!charge)
+		mem_cgroup_end_migration(mcg, oldpage, newpage);
+
+	if (rc) {
+		unlock_page(newpage);
+		__free_page(newpage);
+	}
+
+	putback_lru_page(page);		/* ultimately, drops a page ref */
+
+out_nolru:
+	return page;			/* locked, to complete fault */
+
+}
+
+/*
+ * Called in fault path, if migrate_on_fault_enabled(current) for a page
+ * found in the cache, page is locked, and page_mapping(page) != NULL;
+ * We check for page uptodate here because we want to be able to do any
+ * needed migration before grabbing the page table lock.  In the anon fault
+ * path, PageUptodate() isn't checked until after locking the page table.
+ *
+ * For migrate on fault, we only migrate pages whose mapping has a
+ * migratepage op.  The fallback path requires writing out the page and
+ * reading it back in.  That sort of defeats the purpose of
+ * migrate-on-fault [performance].  So, we don't even bother to check
+ * for misplacment unless the op is present.  Of course, this is an extra
+ * check in the fault path for pages we care about :-(
+ */
+struct page *check_migrate_misplaced_page(struct page *page,
+		struct vm_area_struct *vma, unsigned long address)
+{
+	int polnid, misplaced;
+
+	if (page_mapcount(page) || PageWriteback(page) ||
+			unlikely(!PageUptodate(page))  ||
+			!page_mapping(page)->a_ops->migratepage)
+		return page;
+
+	misplaced = mpol_misplaced(page, vma, address, &polnid);
+	if (!misplaced)
+		return page;
+
+	return migrate_misplaced_page(page, vma->vm_mm, polnid,
+			misplaced_is_interleaved(misplaced));
+
+}
+
+#endif /* _MIGRATE_ON_FAULT */
+#endif /* CONFIG_NUMA */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/migrate.h
@@ -10,15 +10,15 @@ typedef struct page *new_page_t(struct p
 #define PAGE_MIGRATION 1
 
 extern void putback_lru_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
-			struct page *, struct page *);
+extern int migrate_page(struct address_space *, struct page *,
+				struct page *, int);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, int offlining);
 extern int migrate_huge_pages(struct list_head *l, new_page_t x,
 			unsigned long private, int offlining);
 
-extern int fail_migrate_page(struct address_space *,
-			struct page *, struct page *);
+extern int fail_migrate_page(struct address_space *, struct page *,
+				struct page *, int);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
@@ -28,6 +28,20 @@ extern int migrate_vmas(struct mm_struct
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+
+#ifdef CONFIG_MIGRATE_ON_FAULT
+extern struct page *check_migrate_misplaced_page(struct page *,
+			struct vm_area_struct *, unsigned long);
+extern struct page *migrate_misplaced_page(struct page *, struct mm_struct *,
+			int, int);
+#else
+static inline struct page *check_migrate_misplaced_page(struct page *page,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	return page;
+}
+
+#endif
 #else
 #define PAGE_MIGRATION 0
 
Index: linux-2.6.36-mmotm-101103-1217/fs/nfs/write.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/nfs/write.c
+++ linux-2.6.36-mmotm-101103-1217/fs/nfs/write.c
@@ -1562,7 +1562,7 @@ int nfs_migrate_page(struct address_spac
 	if (IS_ERR(req))
 		goto out;
 
-	ret = migrate_page(mapping, newpage, page);
+	ret = migrate_page(mapping, newpage, page, 0);
 	if (!req)
 		goto out;
 	if (ret)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 5/8] numa - Migrate-on-Fault - migrate misplaced anon pages
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 4/8] numa - Migrate-on-Fault - migrate misplaced pages Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 6/8] numa - Migrate-on-Fault - add mbind() MPOL_MF_LAZY flag Lee Schermerhorn
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault  - handle misplaced anon pages

This patch simply hooks the anon page fault handler [do_swap_page()]
to check for and migrate misplaced pages if enabled and page won't
be "COWed".

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/memory.c   |   27 +++++++++++++++++++++++++++
 mm/shmem.c    |   10 ++++++++++
 mm/swapfile.c |    9 +++++++++
 3 files changed, 46 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/mm/memory.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/memory.c
+++ linux-2.6.36-mmotm-101103-1217/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/mempolicy.h>	/* check_migrate_misplaced_page() */
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2632,6 +2633,7 @@ static int do_swap_page(struct mm_struct
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
+	int can_reuse;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
 		goto out;
@@ -2716,6 +2718,31 @@ static int do_swap_page(struct mm_struct
 	}
 
 	/*
+	 * No sense in migrating a page that will be "COWed" as the new
+	 * new page will be allocated according to effective mempolicy.
+	 */
+	can_reuse = (flags & FAULT_FLAG_WRITE) && reuse_swap_page(page);
+	if (can_reuse && migrate_on_fault_enabled(current)) {
+		if (!PageSwapCache(page)) {
+			/*
+			 *  migrate-on-fault has occurred - retry access
+			 */
+			mem_cgroup_cancel_charge_swapin(ptr);
+			goto out_page;
+		}
+
+		/*
+		 * check for misplacement and migrate, if necessary/possible,
+		 * here and now.  Note that if we're racing with another thread,
+		 * we may end up discarding the migrated page after locking
+		 * the page table and checking the pte below.  However, we
+		 * don't want to hold the page table locked over migration, so
+		 * we'll live with that [unlikely, one hopes] possibility.
+		 */
+		page = check_migrate_misplaced_page(page, vma, address);
+	}
+
+	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
Index: linux-2.6.36-mmotm-101103-1217/mm/swapfile.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/swapfile.c
+++ linux-2.6.36-mmotm-101103-1217/mm/swapfile.c
@@ -1143,6 +1143,7 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
+	again:
 		pol = get_vma_policy(current, NULL, 0);
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
 							 pol, i);
@@ -1179,6 +1180,14 @@ static int try_to_unuse(unsigned int typ
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
+		if (!PageSwapCache(page)) {
+			/*
+			 * Lazy page migration has occurred
+			 */
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
 		wait_on_page_writeback(page);
 
 		/*
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1304,6 +1304,16 @@ repeat:
 			page_cache_release(swappage);
 			goto repeat;
 		}
+		if (!PageSwapCache(swappage)) {
+			/*
+			 * Lazy page migration has occurred
+			 */
+			shmem_swp_unmap(entry);
+			spin_unlock(&info->lock);
+			unlock_page(swappage);
+			page_cache_release(swappage);
+			goto repeat;
+		}
 		if (PageWriteback(swappage)) {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 6/8] numa - Migrate-on-Fault - add mbind() MPOL_MF_LAZY flag
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 5/8] numa - Migrate-on-Fault - migrate misplaced anon pages Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 7/8] numa - Migrate-on-Fault - mbind() NOOP policy Lee Schermerhorn
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - add MPOL_MF_LAZY ...

This patch adds another mbind() flag to request "lazy migration".
The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are simply unmapped from the calling task's page table ['_MOVE]
or from all referencing page tables [_MOVE_ALL].  Anon pages will first
be added to the swap [or migration?] cache, if necessary.  The pages
will be migrated in the fault path on "first touch", if the policy
dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After unmap, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |   15 ++++++-
 include/linux/migrate.h   |    6 +++
 include/linux/rmap.h      |    6 ++-
 mm/mempolicy.c            |   19 ++++++---
 mm/migrate.c              |   92 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c                 |    7 ++-
 6 files changed, 129 insertions(+), 16 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -47,9 +47,18 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#ifndef CONFIG_MIGRATE_ON_FAULT
+#define MPOL_MF_VALID (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)
+#else
+#define MPOL_MF_VALID (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL \
+			| MPOL_MF_LAZY)
+#endif
 
 /*
  * Internal flags that share the struct mempolicy flags word with
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1220,8 +1220,7 @@ static long do_mbind(unsigned long start
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+  	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1280,20 +1279,26 @@ static long do_mbind(unsigned long start
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new, flags);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+#ifdef CONFIG_MIGRATE_ON_FAULT
+			if (flags & MPOL_MF_LAZY)
+				nr_failed = migrate_pages_unmap_only(&pagelist);
+			else
+#endif
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma, 0);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
Index: linux-2.6.36-mmotm-101103-1217/include/linux/migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/migrate.h
@@ -30,6 +30,8 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 
 #ifdef CONFIG_MIGRATE_ON_FAULT
+extern int migrate_pages_unmap_only(struct list_head *);
+
 extern struct page *check_migrate_misplaced_page(struct page *,
 			struct vm_area_struct *, unsigned long);
 extern struct page *migrate_misplaced_page(struct page *, struct mm_struct *,
@@ -75,4 +77,8 @@ static inline int migrate_huge_page_move
 #define fail_migrate_page NULL
 
 #endif /* CONFIG_MIGRATION */
+
+#ifndef CONFIG_MIGRATE_ON_FAULT
+#define migrate_pages_unmap_only(L) (0)
+#endif
 #endif /* _LINUX_MIGRATE_H */
Index: linux-2.6.36-mmotm-101103-1217/mm/migrate.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/migrate.c
+++ linux-2.6.36-mmotm-101103-1217/mm/migrate.c
@@ -756,7 +756,8 @@ static int unmap_and_move(new_page_t get
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page,
+	             TTU_MIGRATE_DIRECT|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -894,6 +895,95 @@ out:
 	return rc;
 }
 
+#ifdef CONFIG_MIGRATE_ON_FAULT
+/*
+ * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
+ * necessary.  Migration will occur, if policy dictates, when a task faults
+ * an unmapped page back into its page table--i.e., on "first touch" after
+ * unmapping.  Note that migrate-on-fault only migrates pages whose mapping
+ * [e.g., file system] supplies a migratepage op, so we skip pages that
+ * wouldn't migrate on fault.
+ *
+ * Pages are placed back on the lru whether or not they were successfully
+ * unmapped.  Like migrate_pages().
+ *
+ * Unline migrate_pages(), this function is only called in the context of
+ * a task that is unmapping it's own pages while holding its map semaphore
+ * for write.
+ */
+int migrate_pages_unmap_only(struct list_head *pagelist)
+{
+	struct page *page;
+	struct page *page2;
+	int nr_failed = 0;
+	int nr_unmapped = 0;
+
+	list_for_each_entry_safe(page, page2, pagelist, lru) {
+		int ret;
+
+		/*
+		 * TODO:  cond_resched() here like migrate_pages()?
+		 */
+
+		list_del(&page->lru);
+		/*
+		 * Give up easily.  We ARE being lazy.
+		 */
+		if (page_count(page) == 1 || !trylock_page(page))
+			goto done_with_page;
+
+		if (PageKsm(page) || PageWriteback(page))
+			goto unlock_page;
+
+		/*
+		 * see comments in unmap_and_move()
+		 */
+		if (!page->mapping)
+			goto unlock_page;
+
+		if (PageAnon(page)) {
+			if (!PageSwapCache(page) && !add_to_swap(page)) {
+				nr_failed++;
+				goto unlock_page;
+			}
+		} else {
+			struct address_space *mapping = page_mapping(page);
+			BUG_ON(!mapping);
+			if (!mapping->a_ops->migratepage)
+				goto unlock_page;
+		}
+
+		ret = try_to_unmap(page,
+	             TTU_MIGRATE_DEFERRED|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		if (ret != SWAP_SUCCESS || page_mapped(page)) {
+			nr_failed++;
+			goto unlock_page;
+		}
+
+		nr_unmapped++;
+
+	unlock_page:
+		unlock_page(page);
+	done_with_page:
+		putback_lru_page(page);
+
+	}
+
+	/*
+	 * Drain local per cpu pagevecs so fault path can find the the pages
+	 * on the lru.  If we got migrated during the loop above, we may
+	 * have left pages cached on other cpus.  But, we'll live with that
+	 * here to avoid lru_add_drain_all().
+	 * TODO:  mechanism to drain on only those cpus we've been
+	 *        scheduled on between two points--e.g., during the loop.
+	 */
+	if (nr_unmapped)
+		lru_add_drain();
+
+	return nr_failed;
+}
+#endif /* _MIGRATE_ON_FAULT */
+
 /*
  * migrate_pages
  *
Index: linux-2.6.36-mmotm-101103-1217/include/linux/rmap.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/rmap.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/rmap.h
@@ -188,8 +188,9 @@ int page_referenced_one(struct page *, s
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
-	TTU_MIGRATION = 1,		/* migration mode */
-	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_MIGRATE_DIRECT = 1,		/* direct migration mode */
+	TTU_MIGRATE_DEFERRED = 2,	/* deferred [lazy] migration mode */
+	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
@@ -197,6 +198,7 @@ enum ttu_flags {
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+#define TTU_MIGRATION (TTU_MIGRATE_DIRECT | TTU_MIGRATE_DEFERRED)
 
 int try_to_unmap(struct page *, enum ttu_flags flags);
 int try_to_unmap_one(struct page *, struct vm_area_struct *,
Index: linux-2.6.36-mmotm-101103-1217/mm/rmap.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/rmap.c
+++ linux-2.6.36-mmotm-101103-1217/mm/rmap.c
@@ -1043,12 +1043,13 @@ int try_to_unmap_one(struct page *page,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATE_DIRECT);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
+	} else if (PAGE_MIGRATION &&
+		         (TTU_ACTION(flags) == TTU_MIGRATE_DIRECT)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -1254,7 +1255,7 @@ static int try_to_unmap_anon(struct page
 		 * locking requirements of exec(), migration skips
 		 * temporary VMAs until after exec() completes.
 		 */
-		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATE_DIRECT) &&
 				is_vma_temporary_stack(vma))
 			continue;
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 7/8] numa - Migrate-on-Fault - mbind() NOOP policy
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 6/8] numa - Migrate-on-Fault - add mbind() MPOL_MF_LAZY flag Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-11 19:45 ` [PATCH/RFC 8/8] numa - Migrate-on-Fault - add statistics Lee Schermerhorn
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Migrate-on-fault - add MPOL_MF_NOOP

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
policy to mbind().  When the NOOP policy is used with the 'MOVE
and 'LAZY flags, mbind() [check_range()] will walk the specified
range and unmap eligible pages so that they will be migrated on
next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |    8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -254,10 +254,10 @@ static struct mempolicy *mpol_new(unsign
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1228,7 +1228,7 @@ static long do_mbind(unsigned long start
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1280,7 +1280,7 @@ static long do_mbind(unsigned long start
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma))
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new, flags);
 
 	if (!err) {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/RFC 8/8] numa - Migrate-on-Fault - add statistics
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 7/8] numa - Migrate-on-Fault - mbind() NOOP policy Lee Schermerhorn
@ 2010-11-11 19:45 ` Lee Schermerhorn
  2010-11-14  6:37 ` [PATCH/RFC 0/8] numa - Migrate-on-Fault KOSAKI Motohiro
  2010-11-17 17:10 ` Avi Kivity
  9 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:45 UTC (permalink / raw)
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

PATCH Migrate-on-fault 8/8 - add statistics

Count migrate-on-fault events:

+ pglocchecked -- page location checked for misplacement
+ pgmisplaced  -- page misplaced -- location != policy
+ pgmigrated   -- page successfully migrated

Note:  currently, pgmigrated is only counted when migrate-on-fault
is configured.  However, it will count all successful migrations,
including "direct" migrations.  This should be promoted to depend
only on CONFIG_MIGRATE.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/vmstat.h |    3 +++
 mm/migrate.c           |   11 +++++++++++
 mm/vmstat.c            |    5 +++++
 3 files changed, 19 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/vmstat.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/vmstat.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/vmstat.h
@@ -58,6 +58,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		UNEVICTABLE_PGCLEARED,	/* on COW, page truncate */
 		UNEVICTABLE_PGSTRANDED,	/* unable to isolate on unlock */
 		UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_MIGRATE_ON_FAULT
+		PGLOCCHECK, PGMISPLACED, PGMIGRATED,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux-2.6.36-mmotm-101103-1217/mm/vmstat.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/vmstat.c
+++ linux-2.6.36-mmotm-101103-1217/mm/vmstat.c
@@ -864,6 +864,11 @@ static const char * const vmstat_text[]
 	"unevictable_pgs_cleared",
 	"unevictable_pgs_stranded",
 	"unevictable_pgs_mlockfreed",
+#ifdef CONFIG_MIGRATE_ON_FAULT
+	"pglocchecked",
+	"pgmisplaced",
+	"pgmigrated",
+#endif
 #endif
 };
 
Index: linux-2.6.36-mmotm-101103-1217/mm/migrate.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/migrate.c
+++ linux-2.6.36-mmotm-101103-1217/mm/migrate.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/hugetlb.h>
 #include <linux/gfp.h>
+#include <linux/vmstat.h>
 
 #include "internal.h"
 
@@ -410,6 +411,14 @@ void migrate_page_copy(struct page *newp
 	 */
 	if (PageWriteback(newpage))
 		end_page_writeback(newpage);
+
+	/*
+	 * all successful migrations come through here.
+	 */
+#ifdef CONFIG_MIGRATE_ON_FAULT
+//TODO:  promote statistics to CONFIG_MIGRATE?
+	count_vm_event(PGMIGRATED);
+#endif
 }
 
 /************************************************************
@@ -1586,10 +1595,12 @@ struct page *check_migrate_misplaced_pag
 			!page_mapping(page)->a_ops->migratepage)
 		return page;
 
+	count_vm_event(PGLOCCHECK);
 	misplaced = mpol_misplaced(page, vma, address, &polnid);
 	if (!misplaced)
 		return page;
 
+	count_vm_event(PGMISPLACED);
 	return migrate_misplaced_page(page, vma->vm_mm, polnid,
 			misplaced_is_interleaved(misplaced));
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2010-11-11 19:45 ` [PATCH/RFC 8/8] numa - Migrate-on-Fault - add statistics Lee Schermerhorn
@ 2010-11-14  6:37 ` KOSAKI Motohiro
  2010-11-15 14:13   ` Christoph Lameter
  2010-11-17 17:10 ` Avi Kivity
  9 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  6:37 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-numa, akpm, Mel Gorman, cl, Nick Piggin,
	Hugh Dickins, andi, David Rientjes, Avi Kivity, Andrea Arcangeli

> This series of patches implements page migration in the fault path.
> 
> !!! N.B., Need to consider iteraction with KSM and Transparent Huge
> !!! Pages.
> 
> The basic idea is that when a fault handler such as do_swap_page()
> finds a cached page with zero mappings that is otherwise "stable"--
> e.g., no I/O in progress--this is a good opportunity to check whether the
> page resides on the node indicated by the mempolicy in the current context.
> 
> We only attempt to migrate when there are zero mappings because 1) we can
> easily migrate the page--don't have to go through the effort of removing
> all mappings and 2) default policy--a common case--can give different
> answers from different tasks running on different nodes.  Checking the
> policy when there are zero mappings effectively implements a "first touch"
> placement policy.
> 
> Note that this mechanism could be used to migrate page cache pages that
> were read in earlier, are no longer referenced, but are about to be
> used by a new task on another node from where the page resides.  The
> same mechanism can be used to pull anon pages along with a task when
> the load balancer decides to move it to another node.  However, that
> will require a bit more mechanism, and is the subject of another
> patch series.
> 
> The kernel's direct migration facility support most of the
> mechanism that is required to implement this "migration on fault".
> Some changes were needed to the migratepage op functions to behave
> appropriately when called from the fault path.  Then we need to add
> the function[s] to test the current page in the fault path for zero
> mapping, no writebacks, misplacement, ...; and the
> function[s] to acutally migrate the page contents to a newly
> allocated page using the [modified] migratepage address space
> operations of the direct migration mechanism.
> 
> This series used to include patches to migrate cached file pages and
> shmem pages.  Testing with, e.g., kernel builds, showed a great deal
> of thrashing of page cache pages, so those patches have been removed.
> I think page replication would be a better approach for shared,
> read-only pages.  Nick Piggin created such a patch quite a while back
> and I had integrated it with automigration series.  Those patches have
> since gone stale.

Nice!

This is very useful for HPC area and Solaris already has similar feature.
(madvise(MADV_ACCESS_LWP)). I believe it is useful even though it has not been
integrated KSM and THP.

Side note: Another similar trial is here.
	http://lwn.net/Articles/332754/

And I can easily found some background theory paper by quck googling.
	http://www.compunity.org/events/pastevents/ewomp2004/loef_holmgren_pap_ew04.pdf


Lee, can you please update Documentaion/? espesially MPOL_MF_LAZY and MPOL_MF_NOOP
need to be, I think.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-14  6:37 ` [PATCH/RFC 0/8] numa - Migrate-on-Fault KOSAKI Motohiro
@ 2010-11-15 14:13   ` Christoph Lameter
  2010-11-15 14:21       ` Andi Kleen
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Christoph Lameter @ 2010-11-15 14:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, linux-numa, akpm, Mel Gorman, Nick Piggin,
	Hugh Dickins, andi, David Rientjes, Avi Kivity, Andrea Arcangeli

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> Nice!

Lets not get overenthused. There has been no conclusive proof that the
overhead introduced by automatic migration schemes is consistently less
than the benefit obtained by moving the data. Quite to the contrary. We
have over a decades worth of research and attempts on this issue and there
was no general improvement to be had that way.

The reason that the manual placement interfaces exist is because there was
no generally beneficial migration scheme available. The manual interfaces
allow the writing of various automatic migrations schemes in user space.

If wecan come up with something that is an improvement then lets go
this way but I am skeptical.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-15 14:13   ` Christoph Lameter
@ 2010-11-15 14:21       ` Andi Kleen
  2010-11-15 14:33     ` Andrea Arcangeli
  2010-11-16  4:54       ` KOSAKI Motohiro
  2 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2010-11-15 14:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Lee Schermerhorn, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity,
	Andrea Arcangeli, linux-mm

[Adding linux-mm where this should have been in the first place]

On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.

I agree it's not a good idea to enable this by default because
the cost of doing it wrong is too severe. But I suspect
it's a good idea to have optionally available for various workloads.

Good candidates so far:

- Virtualization with KVM (I think it's very promising for  that)
Basically this allows to keep guests local on nodes with their
own NUMA policy without having to statically bind them.

- Some HPC workloads. There were various older reports that 
it helped there.

So basically I think automatic migration would be good to have as
another option to enable in numactl.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
@ 2010-11-15 14:21       ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2010-11-15 14:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Lee Schermerhorn, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity,
	Andrea Arcangeli, linux-mm

[Adding linux-mm where this should have been in the first place]

On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.

I agree it's not a good idea to enable this by default because
the cost of doing it wrong is too severe. But I suspect
it's a good idea to have optionally available for various workloads.

Good candidates so far:

- Virtualization with KVM (I think it's very promising for  that)
Basically this allows to keep guests local on nodes with their
own NUMA policy without having to statically bind them.

- Some HPC workloads. There were various older reports that 
it helped there.

So basically I think automatic migration would be good to have as
another option to enable in numactl.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-15 14:13   ` Christoph Lameter
  2010-11-15 14:21       ` Andi Kleen
@ 2010-11-15 14:33     ` Andrea Arcangeli
  2010-11-17 17:03       ` Lee Schermerhorn
  2010-11-16  4:54       ` KOSAKI Motohiro
  2 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Lee Schermerhorn, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity

Hi everyone,

On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.
> 
> The reason that the manual placement interfaces exist is because there was
> no generally beneficial migration scheme available. The manual interfaces
> allow the writing of various automatic migrations schemes in user space.
> 
> If wecan come up with something that is an improvement then lets go
> this way but I am skeptical.

I generally find the patchset very interesting but I think like
Christoph.

It's good to give the patchset more visibility as it's quite unique in
this area, but when talking with Lee I also thought the synchronous
migrate on fault was probably too aggressive and I like an algorithm
where memory follows cpus and cpus follow memory in a total dynamic
way.

I suggested Lee during our chat (and also to others during KS+Plubers)
that we need a more dynamic algorithm that works in the background
asynchronously. Specifically I want the cpu to follow memory closely
whenever idle status allows it (change cpu in context switch is cheap,
I don't like pinning or "single" home node concept) and then memory
slowly also in tandem follow cpu in the background with kernel
thread. So that both having cpu follow memory fast, and memory follow
cpu slow, eventually things over time should converge in a optimal
behavior. I like the migration done from a kthread like
khugepaged/ksmd, not synchronously adding latency to page fault (or
having to take down ptes to trigger the migrate on fault, migrate
never need to require the app to exit kernel and take a fault just to
migrate, it happens transparently as far as userland is concerned,
well of course unless it trips on the migration pte just at the wrong
time :).

So the patchset looks very interesting, and it may actually be optimal
for some slower hardware, but I've the perception these days the
memory being remote isn't as a big deal as not keeping all two memory
controllers in action simultaneously (using just one controller is
worse than using both simultaneously from the wrong end, locality not
as important as not stepping in each other toes). So in general
synchronous migrate on fault seems a bit too aggressive to me and not
ideal for newer hardware. Still this is one of the most interesting
patchsets at this time in this area I've seen so far.

The homenode logic ironically may be optimal with the most important
bench because the way that bench is setup all vm are fairly small and
there are plenty of them so it'll never happen that a vm has more
memory than what can fit in the ram of a single node, but I like
dynamic approach that works best in all environments, even if it's not
clearly as simple and maybe not as optimal in the one relevant
benchmark we care about. I'm unsure what the homenode is supposed to
decide when the task has two three four times the ram that fits in a
single node (and that may not be a so uncommon scenario after all).
I admit not having read enough on this homenode logic, but I never got
any attraction to it personally as there should never be any single
"home" to any task in my view.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-15 14:21       ` Andi Kleen
  (?)
@ 2010-11-15 14:37       ` Andrea Arcangeli
  -1 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, KOSAKI Motohiro, Lee Schermerhorn, linux-numa,
	akpm, Mel Gorman, Nick Piggin, Hugh Dickins, David Rientjes,
	Avi Kivity, linux-mm

On Mon, Nov 15, 2010 at 03:21:22PM +0100, Andi Kleen wrote:
> - Virtualization with KVM (I think it's very promising for  that)
> Basically this allows to keep guests local on nodes with their
> own NUMA policy without having to statically bind them.

Confirm, KVM virtualization needs automatic migration (we need the cpu
to follow memory in a smart way too), hard bindings are not ok, like
hugetlbfs is not ok as VM are moved across nodes, swapped, merged with
ksm and things must work out automatically without admin intervention.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-15 14:13   ` Christoph Lameter
@ 2010-11-16  4:54       ` KOSAKI Motohiro
  2010-11-15 14:33     ` Andrea Arcangeli
  2010-11-16  4:54       ` KOSAKI Motohiro
  2 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-11-16  4:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Lee Schermerhorn, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity,
	Andrea Arcangeli, linux-mm

> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.
> 
> The reason that the manual placement interfaces exist is because there was
> no generally beneficial migration scheme available. The manual interfaces
> allow the writing of various automatic migrations schemes in user space.
> 
> If wecan come up with something that is an improvement then lets go
> this way but I am skeptical.

Ah, I thought this series only has manua migration (i.e. MPOL_MF_LAZY),
but it also has automatic migration if a page is not mapped. So my standpoint
is, manual lazy migration has certinally usecase. but I have no opinion against
automatic one.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
@ 2010-11-16  4:54       ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-11-16  4:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Lee Schermerhorn, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity,
	Andrea Arcangeli, linux-mm

> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.
> 
> The reason that the manual placement interfaces exist is because there was
> no generally beneficial migration scheme available. The manual interfaces
> allow the writing of various automatic migrations schemes in user space.
> 
> If wecan come up with something that is an improvement then lets go
> this way but I am skeptical.

Ah, I thought this series only has manua migration (i.e. MPOL_MF_LAZY),
but it also has automatic migration if a page is not mapped. So my standpoint
is, manual lazy migration has certinally usecase. but I have no opinion against
automatic one.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-16  4:54       ` KOSAKI Motohiro
  (?)
@ 2010-11-17 14:45       ` Lee Schermerhorn
  -1 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-17 14:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, linux-numa, akpm, Mel Gorman, Nick Piggin,
	Hugh Dickins, andi, David Rientjes, Avi Kivity, Andrea Arcangeli,
	linux-mm

On Tue, 2010-11-16 at 13:54 +0900, KOSAKI Motohiro wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> > 
> > > Nice!
> > 
> > Lets not get overenthused. There has been no conclusive proof that the
> > overhead introduced by automatic migration schemes is consistently less
> > than the benefit obtained by moving the data. Quite to the contrary. We
> > have over a decades worth of research and attempts on this issue and there
> > was no general improvement to be had that way.
> > 
> > The reason that the manual placement interfaces exist is because there was
> > no generally beneficial migration scheme available. The manual interfaces
> > allow the writing of various automatic migrations schemes in user space.
> > 
> > If wecan come up with something that is an improvement then lets go
> > this way but I am skeptical.
> 
> Ah, I thought this series only has manua migration (i.e. MPOL_MF_LAZY),
> but it also has automatic migration if a page is not mapped. So my standpoint
> is, manual lazy migration has certinally usecase. but I have no opinion against
> automatic one.
> 

Hello, Kosaki-san:

Yes the focus of the series is adding the lazy [on fault] + automatic
[on internode migration] migration.  The kernel has had "manual
migration" via the migrate_pages() and move_pages() sys calls and
inter-cpuset migration.  The idea here is to let the scheduler have its
way, load balancing without much consideration of numa footprint, and
try to restore locality after an internode migration by fetching just
the [anon] pages that the task actually references while executing on a
given node.  I added the per task /proc/<pid>/migrate control to force a
task to simulate internode migration and perform a direct migration or
[default] unmap to allow lazy migration of anonymous pages controlled by
local mempolicy.

You can use/test the "manual" migrate control without enabling the
automigration feature.  You'll need to enable migrate_on_fault in the
cpuset that contains the task[s] to be tested or the task will unmap
[replace ptes with swap/migration cache ptes] and remap [replace cache
ptes with real ptes] without migration.  Or, you could disable
'automigrate_lazy' and use direct migration.  Of course, you can enable
and test automigration as well :).

However, unfortunately, you'll need to use the mainline version plus the
mmotm used in patches.  I recently rebased to the 1109 mmotm on 37-rc1
and the very first lazy migration fault pulls a null pointer in
swap_cgroup_record().  This is how it has gone with these patches over
the past couple of years.  Seems to have some bad interaction with
memory cgroups in every other mmotm.  Sometimes I need to adjust my
code, other times I just wait and it gets fixed in mainline.  I'll
probably just wait a couple of mmotms this time.

More in response to Andrea's mail...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-15 14:33     ` Andrea Arcangeli
@ 2010-11-17 17:03       ` Lee Schermerhorn
  2010-11-17 21:27         ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-17 17:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, KOSAKI Motohiro, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 8445 bytes --]

On Mon, 2010-11-15 at 15:33 +0100, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> > 
> > > Nice!
> > 
> > Lets not get overenthused. There has been no conclusive proof that the
> > overhead introduced by automatic migration schemes is consistently less
> > than the benefit obtained by moving the data. Quite to the contrary. We
> > have over a decades worth of research and attempts on this issue and there
> > was no general improvement to be had that way.
> > 
> > The reason that the manual placement interfaces exist is because there was
> > no generally beneficial migration scheme available. The manual interfaces
> > allow the writing of various automatic migrations schemes in user space.
> > 
> > If wecan come up with something that is an improvement then lets go
> > this way but I am skeptical.
> 
> I generally find the patchset very interesting but I think like
> Christoph.

Christoph is correct that we have no concrete data on modern processors
for these patch sets.  I did present some results back on '07 from a
4-node, 16 processor ia64 server.  The slides from that presentation are
here:
http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf

Slide 18 shows the effects on stream benchmark execution [per pass] of
restoring the locality after a transient job [a parallel kernel build]
causes a flurry of load balancing.  The streams jobs return to 'best
case' performance after the perturbation even tho' they started in a
less than optimal locality configuration.

Slides 29 shows the effects on a stand-alone parallel kernel build of
the patches--disabled and enabled, with and without auto-migration of
page cache pages.  Not much change in real nor user time.
Auto-migrating page cache pages chewed up a lot of system time for the
stand alone kernel build because the shared pages of the tool chain
executables and libraries were seriously thrashing.  With auto-migration
of anon pages only, we see a slight [OK, tiny!] but repeatable
improvement in real time for a ~2% increase in system time.

Slide 30 shows, IMO, a more interesting result.  On a heavily loaded
system with the stream benchmark running on all nodes, the interconnect
bandwidth becomes a precious resource, so locality matters more.
Comparing a parallel kernel build in this environment with automigration
[anon pages only] enabled vs disabled, I observed:

	~18% improvement in real time
	~4% improvement in user time
	~21% improvement in system time

Slide 27 gives you an idea of what was happening during a parallel
kernel build.  "Swap faults" on that slide are faults on anon pages that
have been moved to the migration cache by automigration.  Those stats
were taken with ad hoc instrumentation.  I've added some vmstats since
then.

So, if an objective is to pack more jobs [guest vms] on a single system,
one might suppose that we'd have a heavier loaded system, perhaps
spending a lot of time in the kernel handling various faults.  Something
like this approach might help, even on current generation numa
platforms, altho' I'd expect more benefit on larger socket count
systems.  Should be testable.

> 
> It's good to give the patchset more visibility as it's quite unique in
> this area, but when talking with Lee I also thought the synchronous
> migrate on fault was probably too aggressive and I like an algorithm
> where memory follows cpus and cpus follow memory in a total dynamic
> way.
> 
> I suggested Lee during our chat (and also to others during KS+Plubers)
> that we need a more dynamic algorithm that works in the background
> asynchronously. Specifically I want the cpu to follow memory closely
> whenever idle status allows it (change cpu in context switch is cheap,
> I don't like pinning or "single" home node concept) and then memory
> slowly also in tandem follow cpu in the background with kernel
> thread. So that both having cpu follow memory fast, and memory follow
> cpu slow, eventually things over time should converge in a optimal
> behavior. I like the migration done from a kthread like
> khugepaged/ksmd, not synchronously adding latency to page fault (or
> having to take down ptes to trigger the migrate on fault, migrate
> never need to require the app to exit kernel and take a fault just to
> migrate, it happens transparently as far as userland is concerned,
> well of course unless it trips on the migration pte just at the wrong
> time :).

I don't know about the background migration thread.  Christoph mentioned
the decades of research and attempts to address this issue.  IMO, most
of these stumbled on the cost of collecting sufficient data to know what
pages to migrate where.  And, if you don't know which pages to migrate,
you can end up doing a lot of work for little gain.   I recall Christoph
or someone at SGI calling it "just too late" migration.

With lazy migration, we KNOW what pages the task is referencing in the
fault path, so me move only the pages actually needed right now.
Because any time now the scheduler could decide to move the task to a
different node.  I did add a "migration interval" control to experiment
with different length delays in inter-node migration to give a task time
to amortize the automigration overhead.  Needs more "experimentation".

> 
> So the patchset looks very interesting, and it may actually be optimal
> for some slower hardware, but I've the perception these days the
> memory being remote isn't as a big deal as not keeping all two memory
> controllers in action simultaneously (using just one controller is
> worse than using both simultaneously from the wrong end, locality not
> as important as not stepping in each other toes). So in general
> synchronous migrate on fault seems a bit too aggressive to me and not
> ideal for newer hardware. Still this is one of the most interesting
> patchsets at this time in this area I've seen so far.


As I mentioned above, my results were on older hardware, so it will be
interesting to see the results on modern hardware with lots of guest vms
as the workload.  Maybe I'll get to this eventually.  And I believe
you're correct that on modern systems being remote is not a big deal if
the interconnect and target node are lightly loaded.  But, in my
experience with recent 4 and 8-node servers, locality still matters very
much as load increases.

> 
> The homenode logic ironically may be optimal with the most important
> bench because the way that bench is setup all vm are fairly small and
> there are plenty of them so it'll never happen that a vm has more
> memory than what can fit in the ram of a single node, but I like
> dynamic approach that works best in all environments, even if it's not
> clearly as simple and maybe not as optimal in the one relevant
> benchmark we care about. I'm unsure what the homenode is supposed to
> decide when the task has two three four times the ram that fits in a
> single node (and that may not be a so uncommon scenario after all).
> I admit not having read enough on this homenode logic, but I never got
> any attraction to it personally as there should never be any single
> "home" to any task in my view.

Well, as we discussed,  now we have an implicit "home node" anyway:  the
node where a task's kernel data structures are first allocated.  A task
that spends much time in the kernel will always run faster on the node
where its task struct and thread info/stack live.  So, until we can
migrate these [was easier in unix based kernels], we'll always have an
implicit home node.

I'm attaching some statistics I collected while running a stress load on
the patches before posting them.  The 'vmstress-stats' file includes a
description of the statistics.  I've also attached a simple script to
watch the automigration stats if you decided to try out these patches.

Heads up:  as I mentioned to Kosaki-san in other mail, lazy migration
[migrate-on-fault] incurs a null pointer deref in swap_cgroup_record()
in the most recent mmotm [09nov on 37-rc1].  The patches seem quite
robust [modulo a migration remove/duplicate race, I think, under heavy
load :(] on the mmotm version referenced in the patches:  03nov on
2.6.36.  You may be able to find a copy from Andrew, but I've placed a
copy here:

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.36-mmotm-101103-1217/mmotm-101103-1217.tar.gz

Regards,
Lee

[-- Attachment #2: vmstress-stats --]
[-- Type: text/plain, Size: 5136 bytes --]

Final stats for 58 minute usex vm stress workload.

pgs  loc  pages    pages   |  tasks     pages      pages    pages  | ----------------mig cache------------------
checked  misplacd migrated | migrated  scanned    selected  failed | pgs added  pgs removd duplicates refs freed
46496887 44500155 63046843 |   965933  348568709  159804782    595 |  151431996  151416378  187409069  338825351
46503015 44505977 63052665 |   967033  349001693  159996085    595 |  151616958  151602326  187615344  339217482
46508962 44511591 63123815 |   968090  349431720  160191652    595 |  151806705  151796445  187825762  339622018
46514565 44516837 63129061 |   969031  349818983  160368637    595 |  151984095  151972326  188023077  339994848
46520744 44522719 63200479 |   970066  350273208  160586754    595 |  152191950  152172881  188256406  340429177
46526660 44528377 63206137 |   971096  351226149  161308899    595 |  152427014  152369709  188512466  340882087
46533120 44534533 63212293 |   972074  351664143  161522889    595 |  152662157  152583427  188776833  341360178
46538985 44540194 63217953 |   973051  352104161  161738022    595 |  152891544  152785119  189029203  341814220
46808222 44809022 63748925 |   975330  352692128  161930205    596 |  153503983  153174539  189746717  342921179
47056161 45056479 63996383 |   978160  353338853  162087525    597 |  153632047  153543820  190009418  343553163
47375470 45375057 64380497 |   981041  354513459  162767074    598 |  154280022  154043105  190781347  344824378
47620387 45619074 64624514 |   983463  355193001  162993768    598 |  154468417  154452153  191055045  345507110
47626207 45624603 64630043 |   984479  355619380  163192799    598 |  154656878  154644741  191271746  345915993
47632616 45630721 64636160 |   985631  356053915  163378618    598 |  154844351  154830927  191478638  346309235
47638285 45635994 64641433 |   986591  356529658  163639764    598 |  155030094  155021407  191689679  346710556
47643792 45641209 64646649 |   987526  356924666  163822582    598 |  155219308  155193598  191908717  347092074
47649221 45646312 64651752 |   988476  357318754  164004331    598 |  155411758  155379515  192129976  347489210
47654343 45651155 64656595 |   989351  357724761  164207764    598 |  155636719  155567538  192410941  347926759
47660396 45656848 64924432 |   990406  358188870  164430730    598 |  155869117  155795886  192680512  348447071
47661993 45658349 64925933 |   990663  358289854  164475908    598 |  155921468  155921168  192736716  348657884

Migrate on Fault stats:

pgs loc checked -- pages found in swap/migration cache by do_swap_page()
	with zero page_mapcount() and otherwise "stable".

pages misplacd -- of the "loc checked" pages, the number that were found
	to be misplaced relative to the mempolicy in effect--vma, task
	or system default.

pages migrated -- All pages migrated:  migrate-on-fault, mbind(), ...
	Exceeds the misplaced pages in the stats above because the
	test load included programs that migrated memory regions about
	using mbind() with the MPOL_MF_MOVE flag.

Auto-migration Stats:

tasks migrated - number of internode task migrations.  Each of these
	migrations resulted in the task walking its address space
	looking for anon pages in vmas with local allocation policy.
	Includes kicks via /proc/<pid>/migrate.

pages scanned -- total number of pages examined as candidates for
	auto-migration in mm/mempolicy.c:check_range() as a result
	of internode task migration or /proc/<pid>/migrate scans.

pages selected -- Anon pages selected for auto-migration.
	If lazy auto-migration is enabled [default], these pages
	will be unmapped to allow migrate-on-fault to migrate
	them if and when a task faults the page.  If lazy auto-
	migration is disabled, these pages will be directly
	migrated [pulled] to the destination node.

pages failed -- the number of the selected pages that the
	kernel failed to unmap for lazy migration or failed
	to direct migrate.

Migration Cache Statistics:

pgs added -- the number of pages added to the migration cache.
	This occurs when pages are unmapped for lazy migration.

pgs removd -- the number of pages removed from the migration
	cache.  This occurs when the last pte referencing the
	cache entry is replaced with a present page pte.
	The nummber of pages added less the number removed
	is the number of pages still in the cache.

duplicates -- count of migration_duplicate() calls, usually
	via swap_duplicate(), to add a reference to a migration
	cache entry.  This occurs when a page in the migration
	cache is unmapped in try_to_unmap_one() and when a task
	with anon pages in the migration cache forks and all of
	its anon pages become COW shared with the child in
	copy_one_pte().

refs freed -- count of migration cache entry reference freed,
	usually via one of the swap cache free functions.
	When the reference count on a migration cache entry
	goes zero, the entry is removed from the cache.  Thus,
	the number of pages added plus the number of duplicates
	should equal the number of refs freed plus the number
	of pages still in the cache [adds - removes].


[-- Attachment #3: automig_stats --]
[-- Type: application/x-shellscript, Size: 1383 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2010-11-14  6:37 ` [PATCH/RFC 0/8] numa - Migrate-on-Fault KOSAKI Motohiro
@ 2010-11-17 17:10 ` Avi Kivity
  2010-11-17 17:34   ` Lee Schermerhorn
  9 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2010-11-17 17:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-numa, akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, andi, David Rientjes, Andrea Arcangeli

On 11/11/2010 09:44 PM, Lee Schermerhorn wrote:
> This series of patches implements page migration in the fault path.
>
> !!! N.B., Need to consider iteraction with KSM and Transparent Huge
> !!! Pages.
>
> The basic idea is that when a fault handler such as do_swap_page()
> finds a cached page with zero mappings that is otherwise "stable"--
> e.g., no I/O in progress--this is a good opportunity to check whether the
> page resides on the node indicated by the mempolicy in the current context.
>
> We only attempt to migrate when there are zero mappings because 1) we can
> easily migrate the page--don't have to go through the effort of removing
> all mappings and 2) default policy--a common case--can give different
> answers from different tasks running on different nodes.  Checking the
> policy when there are zero mappings effectively implements a "first touch"
> placement policy.

A couple of kvm-related notes:
- kvm page faults are significantly more expensive than ordinary page 
faults; this will affect the cost/benefit tradeoff
- kvm faults go through get_user_pages_fast(), not the ordinary fault 
path.  Will the code handle this?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-17 17:10 ` Avi Kivity
@ 2010-11-17 17:34   ` Lee Schermerhorn
  0 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2010-11-17 17:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-numa, akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, andi, David Rientjes, Andrea Arcangeli

On Wed, 2010-11-17 at 19:10 +0200, Avi Kivity wrote:
> On 11/11/2010 09:44 PM, Lee Schermerhorn wrote:
> > This series of patches implements page migration in the fault path.
> >
> > !!! N.B., Need to consider iteraction with KSM and Transparent Huge
> > !!! Pages.
> >
> > The basic idea is that when a fault handler such as do_swap_page()
> > finds a cached page with zero mappings that is otherwise "stable"--
> > e.g., no I/O in progress--this is a good opportunity to check whether the
> > page resides on the node indicated by the mempolicy in the current context.
> >
> > We only attempt to migrate when there are zero mappings because 1) we can
> > easily migrate the page--don't have to go through the effort of removing
> > all mappings and 2) default policy--a common case--can give different
> > answers from different tasks running on different nodes.  Checking the
> > policy when there are zero mappings effectively implements a "first touch"
> > placement policy.
> 
> A couple of kvm-related notes:
> - kvm page faults are significantly more expensive than ordinary page 
> faults; this will affect the cost/benefit tradeoff
> - kvm faults go through get_user_pages_fast(), not the ordinary fault 
> path.  Will the code handle this?

Don't know, Avi.  These patches pre-date all of the kvm, ksm,
transparent huge pages work.  I wasn't running any of these on the tests
I ran last week before posting.  When I rebased to the 9nov mmotm, I
attempted to test on a RHEL6.0 install, so probably transparent
hugepages and maybe ksm daemons were running.  Certainly libvirtd was
running, but I don't think I had any guests running.  But, it occurs to
me that the null pointer deref I saw on this config could have been the
result of interactions with newer virt features in RHEL6.

That said, the 'migrate-on-fault' mechanism which is at the heart of
lazy migration is currently hooked [only] into do_swap_page().  So, when
get_user_pages(_fast)() calls the page fault handler, the pages should
be migrated if they're found in the correct state and are "misplaced".
The interactions I'm more worried about are in the filtering of pages
selected for automigration--probably want to skip THPs and KSM'd pages.

Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
  2010-11-17 17:03       ` Lee Schermerhorn
@ 2010-11-17 21:27         ` Andrea Arcangeli
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-17 21:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, KOSAKI Motohiro, linux-numa, akpm, Mel Gorman,
	Nick Piggin, Hugh Dickins, andi, David Rientjes, Avi Kivity

On Wed, Nov 17, 2010 at 12:03:56PM -0500, Lee Schermerhorn wrote:
> Slide 30 shows, IMO, a more interesting result.  On a heavily loaded
> system with the stream benchmark running on all nodes, the interconnect
> bandwidth becomes a precious resource, so locality matters more.
> Comparing a parallel kernel build in this environment with automigration
> [anon pages only] enabled vs disabled, I observed:
> 
> 	~18% improvement in real time
> 	~4% improvement in user time
> 	~21% improvement in system time

Nice results indeed.

> With lazy migration, we KNOW what pages the task is referencing in the
> fault path, so me move only the pages actually needed right now.
> Because any time now the scheduler could decide to move the task to a
> different node.  I did add a "migration interval" control to experiment
> with different length delays in inter-node migration to give a task time
> to amortize the automigration overhead.  Needs more "experimentation".

Ok my idea is even we don't collect any "stat", we can just clear the
young bits instead of creating swap entries. That's quite fast
especially on hugepages as there are very few hugepmd. Then later from
the kernel thread we scan the hugepmd again to check if there's any
young bit set and we validate the placement of the PageTransHuge
against the actual cpu the thread is running (or it was last running
on). That should lead to the exact same results but with lower
overhead and no page fault at all if the placement was already correct
(and no page fault unless userland accesses the page exactly during the
migration copy for the migration entry).

The scan of the hugepmd to see which one has the young bit set is
equivalent to your unmapping of the pages as a "timing issue". Before
you unmap the pages things weren't guaranteed to be running in the
right node with your current migrate-on-fault too. So I don't see why
it should behave any different from an heuristic prospective.

I just don't like unmapping the pages and I'd prefer to just clear the
young bit for it.

Now there are architectures that may not have any young bit at all,
and those should approximate to emulate the young bit in software so
ideally the same logic should work for those archs too (and maybe
those archs are obsolete or they don't need numa).

> As I mentioned above, my results were on older hardware, so it will be
> interesting to see the results on modern hardware with lots of guest vms
> as the workload.  Maybe I'll get to this eventually.  And I believe

Agreed.

> Well, as we discussed,  now we have an implicit "home node" anyway:  the
> node where a task's kernel data structures are first allocated.  A task
> that spends much time in the kernel will always run faster on the node
> where its task struct and thread info/stack live.  So, until we can
> migrate these [was easier in unix based kernels], we'll always have an
> implicit home node.

Yes I remember your point. But there is no guarantee the kernel stack
and other kernel data structures were satisfied from the local node
where fork() was executed. So in effect there is no home node. Also by
the time execve runs a couple of times in a row, the concept of home
node as far as kernel is concerned may be well lost forever and be
equivalent to having allocate the task struct and kernel stack in the
wrong node during fork...

So as of now there is no real home node, just we try to do the obvious
thing when fork runs, but then it goes random over time, exactly like
it happens for userland. You're fixing userland for long lived, kernel
is the same issue but I guess we're not going to address the kernel
side any time soon (no migrate for slab, unless we slowdown
dramatically all kernel allocations ;). migrate OTOH is a nice
abstraction that we can advantage of for plenty of things like memory
compation and here to fixup the memory locality of the tasks.

About the pagecache, the problem in slide 26 is that with make -j 99%
of the mmapped pagecache (that would be migrated-on-fault) is shared
across all tasks so it trashes badly and for no good. The exact same
regression and trashing would happen for shared anonymous memory I
guess... (fork + reads in a loop).

So it'd be better to start migrating everything (pagecache included!)
if page_mapcount == 1, that should be a safe start... Migrating shared
entities is way more troublesome unless we're able to verify all users
sits on the same node (only in that case it'd be safe, and it'd also
have a chance to reduce the trashing).

So my suggestion would be to create a patch with a kernel thread that
scans a thousand pte or hugepmd per second and for each pte or hugepmd
encountered we check the mapcount is 1, if it's 1 we run
test_and_clear_young_notify() (which takes care of shadow ptes too and
won't actually trigger any linux page fault and no gup_fast for
secondary page fault if the page is in the right place), if it returns
true we validate the page is in the right node and if it's not we
migrate it right away... That to the pagecache too (only if mapcount
== 1). I would remove any hook from do_swap_page or the
migrate-on-fault concept as a whole and I'd move the whole thing over
test_and_clear_young_notify() instead. There's sure stuff to keep in
this patchset (all validation of the page, migration invocation
etc..). Also currently migration won't be able to migrate a THP, it'd
work but it'd split it, so it's not so nice that khugepaged is
required to get the THP performance back after
migration. Unfortunately hugetlbfs has grown even more to mimic the
core VM code, and now migrate has special hooks to migrate hugetlbfs
pages, which will make it more difficult to teach migrate to
differentiate between hugetlbfs migrate and THP migrate to have both
working simultaneously... I'm not sure why hugetlbfs has to grow so
much (especially outside of vm_flags & VM_HUGETLB checks that in the
past kept it more separate and less intrusive into the core VM).

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-11-17 21:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
2010-11-11 19:44 ` [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 2/8] numa - Migrate-on-Fault - add cpuset control Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 3/8] numa - Migrate-on-Fault - check for misplaced page Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 4/8] numa - Migrate-on-Fault - migrate misplaced pages Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 5/8] numa - Migrate-on-Fault - migrate misplaced anon pages Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 6/8] numa - Migrate-on-Fault - add mbind() MPOL_MF_LAZY flag Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 7/8] numa - Migrate-on-Fault - mbind() NOOP policy Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 8/8] numa - Migrate-on-Fault - add statistics Lee Schermerhorn
2010-11-14  6:37 ` [PATCH/RFC 0/8] numa - Migrate-on-Fault KOSAKI Motohiro
2010-11-15 14:13   ` Christoph Lameter
2010-11-15 14:21     ` Andi Kleen
2010-11-15 14:21       ` Andi Kleen
2010-11-15 14:37       ` Andrea Arcangeli
2010-11-15 14:33     ` Andrea Arcangeli
2010-11-17 17:03       ` Lee Schermerhorn
2010-11-17 21:27         ` Andrea Arcangeli
2010-11-16  4:54     ` KOSAKI Motohiro
2010-11-16  4:54       ` KOSAKI Motohiro
2010-11-17 14:45       ` Lee Schermerhorn
2010-11-17 17:10 ` Avi Kivity
2010-11-17 17:34   ` Lee Schermerhorn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.