linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/26] sched/numa
@ 2012-03-16 14:40 Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
                   ` (28 more replies)
  0 siblings, 29 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm


Hi All,

While the current scheduler has knowledge of the machine topology, including
NUMA (although there's room for improvement there as well [1]), it is
completely insensitive to which nodes a task's memory actually is on.

Current upstream task memory allocation prefers to use the node the task is
currently running on (unless explicitly told otherwise, see
mbind()/set_mempolicy()), and with the scheduler free to move the task about at
will, the task's memory can end up being spread all over the machine's nodes.

While the scheduler does a reasonable job of keeping short running tasks on a
single node (by means of simply not doing the cross-node migration very often),
it completely blows for long-running processes with a large memory footprint.

This patch-set aims at improving this situation. It does so by assigning a
preferred, or home, node to every process/thread_group. Memory allocation is
then directed by this preference instead of the node the task might actually be
running on momentarily. The load-balancer is also modified to prefer running
the task on its home-node, although not at the cost of letting CPUs go idle or
at the cost of execution fairness.

On top of this a new NUMA balancer is introduced, which can change a process'
home-node the hard way. This heavy process migration is driven by two factors:
either tasks are running away from their home-node, or memory is being
allocated away from the home-node. In either case, it tries to move processes
around to make the 'problem' go away.

The home-node migration handles both cpu and memory (anonymous only for now) in
an integrated fashion. The memory migration uses migrate-on-fault to avoid
doing a lot of work from the actual numa balancer kernl thread and only
migrates the active memory.

For processes that have more tasks than would fit on a node and which want to
split their activity in a useful fashion, the patch-set introduces two new
syscalls: sys_numa_tbind()/sys_numa_mbind(). These syscalls can be used to
create {thread}x{vma} groups which are then scheduled as a unit instead of the
entire process.

That said, its still early days and there's lots of improvements to make.

On to the actual patches...

The first two are generic cleanups:

  [01/26] mm, mpol: Re-implement check_*_range() using walk_page_range()
  [02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT

The second set is a rework of Lee Schermerhorn's Migrate-on-Fault patches [2]:

  [03/26] mm, mpol: add MPOL_MF_LAZY ...
  [04/26] mm, mpol: add MPOL_MF_NOOP
  [05/26] mm, mpol: Check for misplaced page
  [06/26] mm: Migrate misplaced page
  [07/26] mm: Handle misplaced anon pages
  [08/26] mm, mpol: Simplify do_mbind()

The third set implements the basic numa balancing:

  [09/26] sched, mm: Introduce tsk_home_node()
  [10/26] mm, mpol: Make mempolicy home-node aware
  [11/26] mm, mpol: Lazy migrate a process/vma
  [12/26] sched, mm: sched_{fork,exec} node assignment
  [13/26] sched: Implement home-node awareness
  [14/26] sched, numa: Numa balancer
  [15/26] sched, numa: Implement hotplug hooks
  [16/26] sched, numa: Abstract the numa_entity

The next three patches are a band-aid, Lai Jiangshan (and Paul McKenney) are
doing a proper implementation.. the reverts are me being lazy about fwd porting
my call_srcu() implementation.

  [17/26] srcu: revert1
  [18/26] srcu: revert2
  [19/26] srcu: Implement call_srcu()

The last bits implement the new syscalls:

  [20/26] mm, mpol: Introduce vma_dup_policy()
  [21/26] mm, mpol: Introduce vma_put_policy()
  [22/26] mm, mpol: Split and explose some mempolicy functions
  [23/26] sched, numa: Introduce sys_numa_{t,m}bind()
  [24/26] mm, mpol: Implement numa_group RSS accounting
  [25/26] sched, numa: Only migrate long-running entities
  [26/26] sched, numa: A few debug bits


And a few numbers...

On my WSM-EP (2 nodes, 6 cores/node, 2 thread/core), running 48 stream
benchmarks [3] (modified to use ~230MB and run long).

Without these patches it degrades into 50-50 local/remote memory accesses:

 Performance counter stats for 'sleep 2':

       259,668,750 r01b7@500b:u 		[100.00%]
       262,170,142 r01b7@200b:u                                                

       2.010446121 seconds time elapsed

With the patches there's a significant improvement in locality:

 Performance counter stats for 'sleep 2':

       496,860,345 r01b7@500b:u 		[100.00%]
        78,292,565 r01b7@200b:u                                                

       2.010707488 seconds time elapsed

(the perf events are a bit magical and not supported in an actual perf
 release -- but the first one is L3 misses to local dram, the second is
 L3 misses to remote dram)

If you look at those numbers you can also see that the sum is greater in the
second case, this means that we can service L3 misses at a higher rate, which
translates into a performance gain.

These numbers also show that while there's a marked improvement, there's still
some gain to be had. The current numa balancer is still somewhat fickle.

 ~ Peter


[1] - http://marc.info/?l=linux-kernel&m=130218515520540
      now that we have SD_OVERLAP it should be fairly easy to do.

[2] - http://markmail.org/message/mdwbcitql5ka4uws

[3] - https://asc.llnl.gov/computing_resources/purple/archive/benchmarks/memory/stream.tar 


^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: mempol-pagewalk.patch --]
[-- Type: text/plain, Size: 4905 bytes --]

Fixes-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempolicy.c |  141 ++++++++++++++++++---------------------------------------
 1 file changed, 45 insertions(+), 96 deletions(-)
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -460,105 +460,45 @@ static const struct mempolicy_operations
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
-/* Scan through pages checking if pages follow certain conditions. */
-static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
-
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 * And we cannot move PageKsm pages sensibly or safely yet.
-		 */
-		if (PageReserved(page) || PageKsm(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
-
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
-}
+struct mempol_walk_data {
+	struct vm_area_struct *vma;
+	const nodemask_t *nodes;
+	unsigned long flags;
+	void *private;
+};
 
-static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int check_pte_entry(pte_t *pte, unsigned long addr,
+			   unsigned long end, struct mm_walk *walk)
 {
-	pmd_t *pmd;
-	unsigned long next;
+	struct mempol_walk_data *data = walk->private;
+	struct page *page;
+	int nid;
 
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		if (check_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
+	if (!pte_present(*pte))
+		return 0;
 
-static inline int check_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
+	page = vm_normal_page(data->vma, addr, *pte);
+	if (!page)
+		return 0;
 
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (check_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 * And we cannot move PageKsm pages sensibly or safely yet.
+	 */
+	if (PageReserved(page) || PageKsm(page))
+		return 0;
 
-static inline int check_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *data->nodes) == !!(data->flags & MPOL_MF_INVERT))
+		return 0;
 
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (check_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
-	return 0;
+	if (data->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+		migrate_page_add(page, data->private, data->flags);
+		return 0;
+	}
+
+	return -EIO;
 }
 
 /*
@@ -570,9 +510,18 @@ static struct vm_area_struct *
 check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		const nodemask_t *nodes, unsigned long flags, void *private)
 {
-	int err;
 	struct vm_area_struct *first, *vma, *prev;
-
+	struct mempol_walk_data data = {
+		.nodes = nodes,
+		.flags = flags,
+		.private = private,
+	};
+	struct mm_walk walk = {
+		.pte_entry = check_pte_entry,
+		.mm = mm,
+		.private = &data,
+	};
+	int err;
 
 	first = find_vma(mm, start);
 	if (!first)
@@ -595,8 +544,8 @@ check_range(struct mm_struct *mm, unsign
 				endvma = end;
 			if (vma->vm_start > start)
 				start = vma->vm_start;
-			err = check_pgd_range(vma, start, endvma, nodes,
-						flags, private);
+			data.vma = vma;
+			err = walk_page_range(start, endvma, &walk);
 			if (err) {
 				first = ERR_PTR(err);
 				break;



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-07-06 10:32   ` Johannes Weiner
  2012-07-06 14:54   ` Kyungmin Park
  2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-mm-remove_numa_interleave_hit.patch --]
[-- Type: text/plain, Size: 4822 bytes --]

Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
to be compared to either a total of interleave allocations or to a miss
count, remove it.

Fixing it would be possible, but since we've gone years without these
statistics I figure we can continue that way.

This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/base/node.c    |    2 +-
 include/linux/mmzone.h |    1 -
 mm/mempolicy.c         |   66 +++++++++++++++--------------------------------
 3 files changed, 22 insertions(+), 47 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5693ece..942cdbc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -172,7 +172,7 @@ static ssize_t node_read_numastat(struct sys_device * dev,
 		       node_page_state(dev->id, NUMA_HIT),
 		       node_page_state(dev->id, NUMA_MISS),
 		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       0UL,
 		       node_page_state(dev->id, NUMA_LOCAL),
 		       node_page_state(dev->id, NUMA_OTHER));
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3ac040f..3a3be81 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -111,7 +111,6 @@ enum zone_stat_item {
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
 	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c3fdbcb..2c48c45 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1530,11 +1530,29 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->il_next;
+	next = next_node(nid, policy->v.nodes);
+	if (next >= MAX_NUMNODES)
+		next = first_node(policy->v.nodes);
+	if (next < MAX_NUMNODES)
+		me->il_next = next;
+	return nid;
+}
+
 /* Return a zonelist indicated by gfp for node representing a mempolicy */
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
 	switch (policy->mode) {
+	case MPOL_INTERLEAVE:
+		nd = interleave_nodes(policy);
+		break;
 	case MPOL_PREFERRED:
 		if (!(policy->flags & MPOL_F_LOCAL))
 			nd = policy->v.preferred_node;
@@ -1556,21 +1574,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	return node_zonelist(nd, gfp);
 }
 
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
-	unsigned nid, next;
-	struct task_struct *me = current;
-
-	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
-	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
-}
-
 /*
  * Depending on the memory policy provide a node from which to allocate the
  * next slab entry.
@@ -1801,21 +1804,6 @@ out:
 	return ret;
 }
 
-/* Allocate a page in interleaved policy.
-   Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
-{
-	struct zonelist *zl;
-	struct page *page;
-
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
-		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
-	return page;
-}
-
 /**
  * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
@@ -1848,15 +1836,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	struct page *page;
 
 	get_mems_allowed();
-	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
-		return page;
-	}
 	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
@@ -1909,12 +1888,9 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
 	 */
-	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
-		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+	page = __alloc_pages_nodemask(gfp, order,
+			policy_zonelist(gfp, pol, numa_node_id()),
+			policy_nodemask(gfp, pol));
 	put_mems_allowed();
 	return page;
 }




^ permalink raw reply related	[flat|nested] 153+ messages in thread

* [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-23 11:50   ` Mel Gorman
  2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: migrate-on-fault-06-mbind-lazy-migrate.patch --]
[-- Type: text/plain, Size: 8963 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

This patch adds another mbind() flag to request "lazy migration".
The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are simply unmapped from the calling task's page table ['_MOVE]
or from all referencing page tables [_MOVE_ALL].  Anon pages will first
be added to the swap [or migration?] cache, if necessary.  The pages
will be migrated in the fault path on "first touch", if the policy
dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After unmap, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |   13 ++++--
 include/linux/migrate.h   |    2 
 include/linux/rmap.h      |    5 +-
 mm/mempolicy.c            |   20 +++++----
 mm/migrate.c              |   96 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c                 |    7 +--
 6 files changed, 125 insertions(+), 18 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -47,9 +47,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -31,6 +31,8 @@ extern int migrate_vmas(struct mm_struct
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+
+extern int migrate_pages_unmap_only(struct list_head *);
 #else
 #define PAGE_MIGRATION 0
 
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -164,8 +164,9 @@ int page_referenced_one(struct page *, s
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
-	TTU_MIGRATION = 1,		/* migration mode */
-	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_MIGRATE_DIRECT = 1,		/* direct migration mode */
+	TTU_MIGRATE_DEFERRED = 2,	/* deferred [lazy] migration mode */
+	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1094,8 +1094,7 @@ static long do_mbind(unsigned long start
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+  	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1154,21 +1153,26 @@ static long do_mbind(unsigned long start
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
-			nr_failed = migrate_pages(&pagelist, new_vma_page,
+			if (flags & MPOL_MF_LAZY)
+				nr_failed = migrate_pages_unmap_only(&pagelist);
+			else {
+				nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
 						false, true);
+			}
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static int __unmap_and_move(struct page 
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page, TTU_MIGRATE_DIRECT|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -920,7 +920,7 @@ static int unmap_and_move_huge_page(new_
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(hpage, TTU_MIGRATE_DIRECT|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
@@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
 }
 
 /*
+ * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
+ * necessary.  Migration will occur, if policy dictates, when a task faults
+ * an unmapped page back into its page table--i.e., on "first touch" after
+ * unmapping.  Note that migrate-on-fault only migrates pages whose mapping
+ * [e.g., file system] supplies a migratepage op, so we skip pages that
+ * wouldn't migrate on fault.
+ *
+ * Pages are placed back on the lru whether or not they were successfully
+ * unmapped.  Like migrate_pages().
+ *
+ * Unline migrate_pages(), this function is only called in the context of
+ * a task that is unmapping it's own pages while holding its map semaphore
+ * for write.
+ */
+int migrate_pages_unmap_only(struct list_head *pagelist)
+{
+	struct page *page;
+	struct page *page2;
+	int nr_failed = 0;
+	int nr_unmapped = 0;
+
+	list_for_each_entry_safe(page, page2, pagelist, lru) {
+		int ret;
+
+		cond_resched();
+
+		/*
+		 * Give up easily.  We ARE being lazy.
+		 */
+		if (page_count(page) == 1)
+			goto next;
+
+		if (unlikely(PageTransHuge(page)))
+			if (unlikely(split_huge_page(page)))
+				goto next;
+
+		if (!trylock_page(page))
+			goto next;
+
+		if (PageKsm(page) || PageWriteback(page))
+			goto unlock;
+
+		/*
+		 * see comments in unmap_and_move()
+		 */
+		if (!page->mapping)
+			goto unlock;
+
+		if (PageAnon(page)) {
+			if (!PageSwapCache(page) && !add_to_swap(page)) {
+				nr_failed++;
+				goto unlock;
+			}
+		} else {
+			struct address_space *mapping = page_mapping(page);
+			BUG_ON(!mapping);
+			if (!mapping->a_ops->migratepage)
+				goto unlock;
+		}
+
+		ret = try_to_unmap(page,
+	             TTU_MIGRATE_DEFERRED|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		if (ret != SWAP_SUCCESS || page_mapped(page))
+			nr_failed++;
+		else
+			nr_unmapped++;
+
+unlock:
+		unlock_page(page);
+next:
+		list_del(&page->lru);
+		dec_zone_page_state(page, NR_ISOLATED_ANON +
+				page_is_file_cache(page));
+		putback_lru_page(page);
+
+	}
+
+	/*
+	 * Drain local per cpu pagevecs so fault path can find the the pages
+	 * on the lru.  If we got migrated during the loop above, we may
+	 * have left pages cached on other cpus.  But, we'll live with that
+	 * here to avoid lru_add_drain_all().
+	 * TODO:  mechanism to drain on only those cpus we've been
+	 *        scheduled on between two points--e.g., during the loop.
+	 */
+	if (nr_unmapped)
+		lru_add_drain();
+
+	return nr_failed;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1288,12 +1288,13 @@ int try_to_unmap_one(struct page *page, 
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATE_DIRECT);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
+	} else if (PAGE_MIGRATION &&
+		         (TTU_ACTION(flags) == TTU_MIGRATE_DIRECT)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -1499,7 +1500,7 @@ static int try_to_unmap_anon(struct page
 		 * locking requirements of exec(), migration skips
 		 * temporary VMAs until after exec() completes.
 		 */
-		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATE_DIRECT) &&
 				is_vma_temporary_stack(vma))
 			continue;
 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (2 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-07-06 18:40   ` Rik van Riel
  2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: migrate-on-fault-07-mbind-noop-policy.patch --]
[-- Type: text/plain, Size: 2306 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
policy to mbind().  When the NOOP policy is used with the 'MOVE
and 'LAZY flags, mbind() [check_range()] will walk the specified
range and unmap eligible pages so that they will be migrated on
next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |    8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsign
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1121,7 +1121,7 @@ static long do_mbind(unsigned long start
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1173,7 +1173,7 @@ static long do_mbind(unsigned long start
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma))
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
 	if (!err) {



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 05/26] mm, mpol: Check for misplaced page
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (3 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: migrate-on-fault-03-mpol_misplaced.patch --]
[-- Type: text/plain, Size: 4570 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  Because interleaved and
non-interleaved allocations are accounted differently, the function
also returns whether or not the new node came from an interleaved
policy, if the page is misplaced.

A subsequent patch will call this function from the fault path for
stable pages with zero page_mapcount().  Because of this, I don't
want to go ahead and allocate the page, e.g., via alloc_page_vma()
only to have to free it if it has the correct policy.  So, I just
mimic the alloc_page_vma() node computation logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    3 +
 mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -67,6 +67,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 #ifdef __KERNEL__
 
@@ -261,6 +262,8 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1117,6 +1117,9 @@ static long do_mbind(unsigned long start
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -2072,6 +2075,82 @@ mpol_shared_policy_lookup(struct shared_
 	return pol;
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * 	-1	- not misplaced, page is in the right node
+ * 	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 06/26] mm: Migrate misplaced page
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (4 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-04-03 17:32   ` Dan Smith
  2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: migrate-on-fault-04-migrate_misplaced_page.patch --]
[-- Type: text/plain, Size: 9385 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

This patch adds a new function migrate_misplaced_page() to mm/migrate.c
[where most of the other page migration functions live] to migrate a
misplaced page to a specified destination node.  This function will be
called from the fault path.  Because we already know the destination
node for the migration, we allocate pages directly rather than rerunning
the policy node computation in alloc_page_vma().

Since the fault path holds an extra reference from other migration
paths, introduce a new migration_mode (MIGRATE_FAULT) to communicate
this.

The patch adds the function check_migrate_misplaced_page() to migrate.c
to check whether a page is "misplaced" -- i.e. on a node different
from what the policy for (vma, address) dictates.  This check
involves accessing the vma policy, so we only do this if:
   * page has zero mapcount [no pte references]
   * page is not in writeback
   * page is up to date
   * page's mapping has a migratepage a_op [no fallback!]
If these checks are satisfied, the page will be migrated to the
"correct" node, if possible.  If migration fails for any reason,
we just use the original page.

Subsequent patches will hook the fault handlers [anon, and possibly
file and/or shmem] to check_migrate_misplaced_page().

XXX: hnaz, dansmith saw some bad_page() reports when using memcg, I
could not reproduce -- is there something funny with the mem_cgroup
calls in the below patch?

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
[ removed the weird ignore page count on migrate stuff with the
  new migrate_mode and strict accounting ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h    |   18 ------
 include/linux/migrate.h      |    9 +++
 include/linux/migrate_mode.h |    3 +
 mm/mempolicy.c               |   19 ++++++
 mm/migrate.c                 |  128 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 160 insertions(+), 17 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -77,6 +77,7 @@ enum mpol_rebind_step {
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 
 struct mm_struct;
 
@@ -245,22 +246,7 @@ extern int mpol_parse_str(char *str, str
 extern int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 			int no_context);
 
-/* Check if a vma is migratable */
-static inline int vma_migratable(struct vm_area_struct *vma)
-{
-	if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
-		return 0;
-	/*
-	 * Migration allocates pages in the highest zone. If we cannot
-	 * do so then migration (at least from node to node) is not
-	 * possible.
-	 */
-	if (vma->vm_file &&
-		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
-								< policy_zone)
-			return 0;
-	return 1;
-}
+extern int vma_migratable(struct vm_area_struct *);
 
 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
 
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,6 +33,10 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 
 extern int migrate_pages_unmap_only(struct list_head *);
+extern struct page *check_migrate_misplaced_page(struct page *,
+			struct vm_area_struct *, unsigned long);
+extern struct page *migrate_misplaced_page(struct page *,
+			struct mm_struct *, int);
 #else
 #define PAGE_MIGRATION 0
 
@@ -67,5 +71,10 @@ static inline int migrate_huge_page_move
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline struct page *check_migrate_misplaced_page(struct page *page,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	return page;
+}
 #endif /* CONFIG_MIGRATION */
 #endif /* _LINUX_MIGRATE_H */
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * 	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -460,6 +460,25 @@ static const struct mempolicy_operations
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+/*
+ * Check whether a vma is migratable
+ */
+int vma_migratable(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
+		return 0;
+	/*
+	 * Migration allocates pages in the highest zone. If we cannot
+	 * do so then migration (at least from node to node) is not
+	 * possible.
+	 */
+	if (vma->vm_file &&
+		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
+								< policy_zone)
+			return 0;
+	return 1;
+}
+
 struct mempol_walk_data {
 	struct vm_area_struct *vma;
 	const nodemask_t *nodes;
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,6 +294,10 @@ static int migrate_page_move_mapping(str
  					page_index(page));
 
 	expected_count = 2 + page_has_private(page);
+	if (mode == MIGRATE_FAULT) {
+		expected_count++;
+		mode = MIGRATE_ASYNC; /* don't bother blocking for MoF */
+	}
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -1517,4 +1521,126 @@ int migrate_vmas(struct mm_struct *mm, c
  	}
  	return err;
 }
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.  Page is already unmapped, up to date and locked by caller.
+ * Anon pages are in the swap cache.  Page's mapping has a migratepage aop.
+ *
+ * page refs on entry/exit:  cache + fault path [+ bufs]
+ */
+struct page *
+migrate_misplaced_page(struct page *page, struct mm_struct *mm, int node)
+{
+	struct page *oldpage = page, *newpage;
+	struct address_space *mapping = page_mapping(page);
+	struct mem_cgroup *mcg;
+	unsigned int gfp;
+	int rc = 0;
+	int charge = -ENOMEM;
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(page_mapcount(page));
+	VM_BUG_ON(PageAnon(page) && !PageSwapCache(page));
+	VM_BUG_ON(!mapping || !mapping->a_ops->migratepage);
+
+	/*
+	 * remove old page from LRU so it can't be found while migrating
+	 * except thru' the cache by other faulting tasks who will
+	 * block behind my lock.
+	 */
+	if (isolate_lru_page(page))	/* incrs page count on success */
+		goto out_nolru;	/* we lost */
+
+	/*
+	 * Never wait for allocations just to migrate on fault,
+	 * but don't dip into reserves.
+	 * And, only accept pages from specified node.
+	 * No sense migrating to a different "misplaced" page!
+	 */
+	gfp = (unsigned int)mapping_gfp_mask(mapping) & ~__GFP_WAIT;
+	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE ;
+
+	newpage = alloc_pages_node(node, gfp, 0);
+	if (!newpage)
+		goto out;	/* give up */
+
+	/*
+	 * can't just lock_page() -- "might sleep" in atomic context
+	 */
+	if (!trylock_page(newpage))
+		BUG();		/* new page should be unlocked!!! */
+
+	// XXX hnaz, is this right?
+	charge = mem_cgroup_prepare_migration(page, newpage, &mcg, gfp);
+	if (charge == -ENOMEM) {
+		rc = charge;
+		goto out;
+	}
+
+	newpage->index = page->index;
+	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))		/* like move_to_new_page() */
+		SetPageSwapBacked(newpage);
+
+	/*
+	 * migrate a_op transfers cache [+ buf] refs
+	 */
+	rc = mapping->a_ops->migratepage(mapping, newpage, page, MIGRATE_FAULT);
+	if (!rc) {
+		get_page(newpage);	/* add isolate_lru_page ref */
+		put_page(page);		/* drop       "          "  */
+
+		unlock_page(page);
+		put_page(page);		/* drop fault path ref & free */
+
+		page = newpage;
+	}
+
+out:
+	if (!charge)
+		mem_cgroup_end_migration(mcg, oldpage, newpage, !rc);
+
+	if (rc) {
+		unlock_page(newpage);
+		__free_page(newpage);
+	}
+
+	putback_lru_page(page);		/* ultimately, drops a page ref */
+
+out_nolru:
+	return page;			/* locked, to complete fault */
+}
+
+/*
+ * Called in fault path, if migrate_on_fault_enabled(current) for a page
+ * found in the cache, page is locked, and page_mapping(page) != NULL;
+ * We check for page uptodate here because we want to be able to do any
+ * needed migration before grabbing the page table lock.  In the anon fault
+ * path, PageUptodate() isn't checked until after locking the page table.
+ *
+ * For migrate on fault, we only migrate pages whose mapping has a
+ * migratepage op.  The fallback path requires writing out the page and
+ * reading it back in.  That sort of defeats the purpose of
+ * migrate-on-fault [performance].  So, we don't even bother to check
+ * for misplacment unless the op is present.  Of course, this is an extra
+ * check in the fault path for pages we care about :-(
+ */
+struct page *check_migrate_misplaced_page(struct page *page,
+		struct vm_area_struct *vma, unsigned long address)
+{
+	int node;
+
+	if (page_mapcount(page) || PageWriteback(page) ||
+			unlikely(!PageUptodate(page))  ||
+			!page_mapping(page)->a_ops->migratepage)
+		return page;
+
+	node = mpol_misplaced(page, vma, address);
+	if (node == -1)
+		return page;
+
+	return migrate_misplaced_page(page, vma->vm_mm, node);
+}
+
+#endif /* CONFIG_NUMA */



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 07/26] mm: Handle misplaced anon pages
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (5 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: migrate-on-fault-05.1-misplaced-anon-pages.patch --]
[-- Type: text/plain, Size: 3276 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

This patch simply hooks the anon page fault handler [do_swap_page()]
to check for and migrate misplaced pages if enabled and page won't
be "COWed".

This introduces can_reuse_swap_page() since reuse_swap_page() does
delete_from_swap_cache() which messes our migration path (since that
assumes its still a swapcache page).

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
[ removed the retry loops after lock_page on a swapcache which tried
  to fixup the wreckage caused by ignoring the page count on migate;
  added can_reuse_swap_page(); moved the migrate-on-fault enabled
  test into check_migrate_misplaced_page() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/swap.h |    4 +++-
 mm/memory.c          |   17 +++++++++++++++++
 mm/swapfile.c        |   13 +++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -342,6 +342,7 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int reuse_swap_page(struct page *);
+extern int can_reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
@@ -459,7 +460,8 @@ static inline void delete_from_swap_cach
 {
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+#define reuse_swap_page(page)		(page_mapcount(page) == 1)
+#define can_reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/mempolicy.h>	/* check_migrate_misplaced_page() */
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2962,6 +2963,22 @@ static int do_swap_page(struct mm_struct
 	}
 
 	/*
+	 * No sense in migrating a page that will be "COWed" as the new
+	 * new page will be allocated according to effective mempolicy.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && can_reuse_swap_page(page)) {
+		/*
+		 * check for misplacement and migrate, if necessary/possible,
+		 * here and now.  Note that if we're racing with another thread,
+		 * we may end up discarding the migrated page after locking
+		 * the page table and checking the pte below.  However, we
+		 * don't want to hold the page table locked over migration, so
+		 * we'll live with that [unlikely, one hopes] possibility.
+		 */
+		page = check_migrate_misplaced_page(page, vma, address);
+	}
+
+	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -640,6 +640,19 @@ int reuse_swap_page(struct page *page)
 	return count <= 1;
 }
 
+int can_reuse_swap_page(struct page *page)
+{
+	int count;
+
+	VM_BUG_ON(!PageLocked(page));
+	if (unlikely(PageKsm(page)))
+		return 0;
+	count = page_mapcount(page);
+	if (count <= 1 && PageSwapCache(page))
+		count += page_swapcount(page);
+	return count <= 1;
+}
+
 /*
  * If swap is getting full, or if there are no more mappings of this page,
  * then try_to_free_swap is called to free its swap space.



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (6 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: mempol-simplify-do_mbind.patch --]
[-- Type: text/plain, Size: 3063 bytes --]

Code flow got a little convoluted, try and straighten it some.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempolicy.c |   73 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 38 insertions(+), 35 deletions(-)
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1054,9 +1054,9 @@ static long do_mbind(unsigned long start
 {
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
-	struct mempolicy *new;
+	struct mempolicy *new = NULL;
 	unsigned long end;
-	int err;
+	int err, nr_failed = 0;
 	LIST_HEAD(pagelist);
 
   	if (flags & ~(unsigned long)MPOL_MF_VALID)
@@ -1078,13 +1078,15 @@ static long do_mbind(unsigned long start
 	if (end == start)
 		return 0;
 
-	new = mpol_new(mode, mode_flags, nmask);
-	if (IS_ERR(new))
-		return PTR_ERR(new);
+	if (mode != MPOL_NOOP) {
+		new = mpol_new(mode, mode_flags, nmask);
+		if (IS_ERR(new))
+			return PTR_ERR(new);
 
-	if (flags & MPOL_MF_LAZY)
-		new->flags |= MPOL_F_MOF;
+		if (flags & MPOL_MF_LAZY)
+			new->flags |= MPOL_F_MOF;
 
+	}
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1097,56 +1099,57 @@ static long do_mbind(unsigned long start
 		 nmask ? nodes_addr(*nmask)[0] : -1);
 
 	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
-
 		err = migrate_prep();
 		if (err)
 			goto mpol_out;
 	}
-	{
+
+	down_write(&mm->mmap_sem);
+
+	if (mode != MPOL_NOOP) {
 		NODEMASK_SCRATCH(scratch);
+		err = -ENOMEM;
 		if (scratch) {
-			down_write(&mm->mmap_sem);
 			task_lock(current);
 			err = mpol_set_nodemask(new, nmask, scratch);
 			task_unlock(current);
-			if (err)
-				up_write(&mm->mmap_sem);
-		} else
-			err = -ENOMEM;
+		}
 		NODEMASK_SCRATCH_FREE(scratch);
+		if (err)
+			goto mpol_out_unlock;
 	}
-	if (err)
-		goto mpol_out;
 
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma) && mode != MPOL_NOOP)
-		err = mbind_range(mm, start, end, new);
+	if (IS_ERR(vma))
+		goto mpol_out_unlock;
 
-	if (!err) {
-		int nr_failed = 0;
+	if (mode != MPOL_NOOP) {
+		err = mbind_range(mm, start, end, new);
+		if (err)
+			goto mpol_out_unlock;
+	}
 
-		if (!list_empty(&pagelist)) {
-			if (flags & MPOL_MF_LAZY)
-				nr_failed = migrate_pages_unmap_only(&pagelist);
-			else {
-				nr_failed = migrate_pages(&pagelist, new_vma_page,
-						(unsigned long)vma,
-						false, true);
-			}
-			if (nr_failed)
-				putback_lru_pages(&pagelist);
+	if (!list_empty(&pagelist)) {
+		if (flags & MPOL_MF_LAZY)
+			nr_failed = migrate_pages_unmap_only(&pagelist);
+		else {
+			nr_failed = migrate_pages(&pagelist, new_vma_page,
+					(unsigned long)vma,
+					false, true);
 		}
+	}
 
-		if (nr_failed && (flags & MPOL_MF_STRICT))
-			err = -EIO;
-	} else
-		putback_lru_pages(&pagelist);
+	if (nr_failed && (flags & MPOL_MF_STRICT))
+		err = -EIO;
+
+	putback_lru_pages(&pagelist);
 
+mpol_out_unlock:
 	up_write(&mm->mmap_sem);
- mpol_out:
+mpol_out:
 	mpol_put(new);
 	return err;
 }



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (7 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-1.patch --]
[-- Type: text/plain, Size: 3192 bytes --]

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |    6 ++++++
 kernel/sched/core.c       |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -127,6 +127,13 @@ extern struct cred init_cred;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -192,6 +199,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1541,6 +1541,7 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+	int node;
 #endif
 	struct rcu_head rcu;
 
@@ -1615,6 +1616,11 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+	return p->node;
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5874,6 +5874,38 @@ __setup("isolcpus=", isolated_cpu_setup)
 
 #ifdef CONFIG_NUMA
 
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
 /**
  * find_next_best_node - find the next node to include in a sched_domain
  * @node: node whose sched_domain we're building



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (8 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 18:34   ` Christoph Lameter
  2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-2.patch --]
[-- Type: text/plain, Size: 2248 bytes --]

Add another layer of fallback policy to make the home node concept
useful from a memory allocation PoV.

This changes the mpol order to:

 - vma->vm_ops->get_policy	[if applicable]
 - vma->vm_policy		[if applicable]
 - task->mempolicy
 - tsk_home_node() preferred	[NEW]
 - default_policy

Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
facilitate efficient on-demand memory migration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempolicy.c |   29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -117,6 +117,22 @@ static struct mempolicy default_policy =
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = tsk_home_node(p);
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1478,7 +1494,7 @@ asmlinkage long compat_sys_mbind(compat_
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1856,7 +1872,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
@@ -2302,6 +2318,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (9 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-3.patch --]
[-- Type: text/plain, Size: 1804 bytes --]

Provide simple functions to lazy migrate a process (or part thereof).
These will be used to implement memory migration for NUMA process
migration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    3 +++
 mm/mempolicy.c            |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -250,6 +250,9 @@ extern int vma_migratable(struct vm_area
 
 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
 
+extern void lazy_migrate_vma(struct vm_area_struct *vma, int node);
+extern void lazy_migrate_process(struct mm_struct *mm, int node);
+
 #else
 
 struct mempolicy {};
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1173,6 +1173,46 @@ static long do_mbind(unsigned long start
 	return err;
 }
 
+void lazy_migrate_vma(struct vm_area_struct *vma, int node)
+{
+	nodemask_t nmask = nodemask_of_node(node);
+	LIST_HEAD(pagelist);
+
+	struct mempol_walk_data data = {
+		.nodes = &nmask,
+		.flags = MPOL_MF_MOVE | MPOL_MF_INVERT, /* move all pages not in set */
+		.private = &pagelist,
+		.vma = vma,
+	};
+
+	struct mm_walk walk = {
+		.pte_entry = check_pte_entry,
+		.mm = vma->vm_mm,
+		.private = &data,
+	};
+
+	if (vma->vm_file)
+		return;
+
+	if (!vma_migratable(vma))
+		return;
+
+	if (!walk_page_range(vma->vm_start, vma->vm_end, &walk))
+		migrate_pages_unmap_only(&pagelist);
+
+	putback_lru_pages(&pagelist);
+}
+
+void lazy_migrate_process(struct mm_struct *mm, int node)
+{
+	struct vm_area_struct *vma;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next)
+		lazy_migrate_vma(vma, node);
+	up_read(&mm->mmap_sem);
+}
+
 /*
  * User space interface with variable sized bitmaps for nodelists.
  */



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (10 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-06-15 18:16   ` Tony Luck
  2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-4.patch --]
[-- Type: text/plain, Size: 3464 bytes --]

Rework the scheduler fork,exec hooks to allow home-node assignment.

In particular:
  - call sched_fork() after the mm is set up and the thread
    group list is initialized (such that we can iterate the mm_owner
    thread group).
  - call sched_exec() after we've got our fresh mm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c             |    4 ++--
 include/linux/sched.h |    4 ++--
 kernel/fork.c         |    9 +++++----
 kernel/sched/core.c   |    7 +++++--
 kernel/sched/sched.h  |    2 ++
 5 files changed, 16 insertions(+), 10 deletions(-)
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1505,8 +1505,6 @@ static int do_execve_common(const char *
 	if (IS_ERR(file))
 		goto out_unmark;
 
-	sched_exec();
-
 	bprm->file = file;
 	bprm->filename = filename;
 	bprm->interp = filename;
@@ -1515,6 +1513,8 @@ static int do_execve_common(const char *
 	if (retval)
 		goto out_file;
 
+	sched_exec(bprm->mm);
+
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
 	if ((retval = bprm->argc) < 0)
 		goto out;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1999,9 +1999,9 @@ task_sched_runtime(struct task_struct *t
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+extern void sched_exec(struct mm_struct *mm);
 #else
-#define sched_exec()   {}
+#define sched_exec(mm)   {}
 #endif
 
 extern void sched_clock_idle_sleep_event(void);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1229,9 +1229,6 @@ static struct task_struct *copy_process(
 	p->memcg_batch.memcg = NULL;
 #endif
 
-	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
-
 	retval = perf_event_init_task(p);
 	if (retval)
 		goto bad_fork_cleanup_policy;
@@ -1284,6 +1281,11 @@ static struct task_struct *copy_process(
 	 * Clear TID on mm_release()?
 	 */
 	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
+
+	INIT_LIST_HEAD(&p->thread_group);
+	/* Perform scheduler related setup. Assign this task to a CPU. */
+	sched_fork(p);
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1326,7 +1328,6 @@ static struct task_struct *copy_process(
 	 * We dont wake it up yet.
 	 */
 	p->group_leader = p;
-	INIT_LIST_HEAD(&p->thread_group);
 
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1767,8 +1767,9 @@ void sched_fork(struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 #endif
-
 	put_cpu();
+
+	select_task_node(p, p->mm, SD_BALANCE_FORK);
 }
 
 /*
@@ -2507,12 +2508,14 @@ static void update_cpu_load_active(struc
  * sched_exec - execve() is a valuable balancing opportunity, because at
  * this point the task has the smallest effective memory and cache footprint.
  */
-void sched_exec(void)
+void sched_exec(struct mm_struct *mm)
 {
 	struct task_struct *p = current;
 	unsigned long flags;
 	int dest_cpu;
 
+	select_task_node(p, mm, SD_BALANCE_EXEC);
+
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1153,3 +1153,5 @@ enum rq_nohz_flag_bits {
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
 #endif
+
+static inline void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags) { }



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 13/26] sched: Implement home-node awareness
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (11 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-5.patch --]
[-- Type: text/plain, Size: 21093 bytes --]

Implement home node preference in the load-balancer.

This is done in four pieces:

 - task_numa_hot(); make it harder to migrate tasks away from their
   home-node, controlled using the NUMA_HOT feature flag.

 - select_task_rq_fair(); prefer placing the task in their home-node,
   controlled using the NUMA_BIAS feature flag.

 - load_balance(); during the regular pull load-balance pass, try
   pulling tasks that are on the wrong node first with a preference
   of moving them nearer to their home-node through task_numa_hot(),
   controlled through the NUMA_PULL feature flag.

 - load_balance(); when the balancer finds no imbalance, introduce
   some such that it still prefers to move tasks towards their
   home-node, using active load-balance if needed, controlled through
   the NUMA_PULL_BIAS feature flag.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |   22 +++
 kernel/sched/debug.c    |    3 
 kernel/sched/fair.c     |  299 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    7 +
 kernel/sched/sched.h    |    9 +
 6 files changed, 299 insertions(+), 42 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -850,6 +850,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5806,7 +5806,9 @@ static void destroy_sched_domains(struct
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_id);
 
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
@@ -5817,6 +5819,17 @@ static void update_top_cache_domain(int
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_id, cpu) = id;
+
+	for_each_domain(cpu, sd) {
+		if (cpumask_equal(sched_domain_span(sd),
+				  cpumask_of_node(cpu_to_node(cpu))))
+			goto got_node;
+	}
+	sd = NULL;
+got_node:
+	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
+	if (sd) for (sd = sd->parent; sd; sd = sd->parent)
+		sd->flags |= SD_NUMA;
 }
 
 /*
@@ -5859,7 +5872,7 @@ cpu_attach_domain(struct sched_domain *s
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
-	update_top_cache_domain(cpu);
+	update_domain_cache(cpu);
 }
 
 /* cpus with isolated domains */
@@ -7012,6 +7025,11 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_NUMA
+		INIT_LIST_HEAD(&rq->offnode_tasks);
+		rq->offnode_running = 0;
+		rq->offnode_weight = 0;
+#endif
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA
+	SEQ_printf(m, " %d/%d", p->node, cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/random.h>
 
 #include <trace/events/sched.h>
 
@@ -783,8 +784,10 @@ account_entity_enqueue(struct cfs_rq *cf
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		if (!account_numa_enqueue(task_of(se)))
+			list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -795,8 +798,10 @@ account_entity_dequeue(struct cfs_rq *cf
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
 		list_del_init(&se->group_node);
+		account_numa_dequeue(task_of(se));
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -2702,6 +2707,7 @@ select_task_rq_fair(struct task_struct *
 	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
+	int node = tsk_home_node(p);
 
 	if (p->rt.nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -2713,6 +2719,29 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	rcu_read_lock();
+	if (sched_feat(NUMA_BIAS) && node != -1) {
+		int node_cpu;
+
+		node_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpumask_of_node(node));
+		if (node_cpu >= nr_cpu_ids)
+			goto find_sd;
+
+		/*
+		 * For fork,exec find the idlest cpu in the home-node.
+		 */
+		if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
+			new_cpu = cpu = node_cpu;
+			sd = per_cpu(sd_node, cpu);
+			goto pick_idlest;
+		}
+
+		/*
+		 * For wake, pretend we were running in the home-node.
+		 */
+		prev_cpu = node_cpu;
+	}
+
+find_sd:
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -2769,6 +2798,7 @@ select_task_rq_fair(struct task_struct *
 		goto unlock;
 	}
 
+pick_idlest:
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -3085,6 +3115,8 @@ struct lb_env {
 	long			load_move;
 	unsigned int		flags;
 
+	struct list_head	*tasks;
+
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
@@ -3102,6 +3134,30 @@ static void move_task(struct task_struct
 	check_preempt_curr(env->dst_rq, p, 0);
 }
 
+#ifdef CONFIG_NUMA
+static int task_numa_hot(struct task_struct *p, int from_cpu, int to_cpu)
+{
+	int from_dist, to_dist;
+	int node = tsk_home_node(p);
+
+	if (!sched_feat(NUMA_HOT) || node == -1)
+		return 0; /* no node preference */
+
+	from_dist = node_distance(cpu_to_node(from_cpu), node);
+	to_dist = node_distance(cpu_to_node(to_cpu), node);
+
+	if (to_dist < from_dist)
+		return 0; /* getting closer is ok */
+
+	return 1; /* stick to where we are */
+}
+#else
+static inline int task_numa_hot(struct task_struct *p, int from_cpu, int to_cpu)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 /*
  * Is this task likely cache-hot:
  */
@@ -3165,6 +3221,7 @@ int can_migrate_task(struct task_struct
 	 */
 
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	tsk_cache_hot |= task_numa_hot(p, env->src_cpu, env->dst_cpu);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
@@ -3190,11 +3247,11 @@ int can_migrate_task(struct task_struct
  *
  * Called with both runqueues locked.
  */
-static int move_one_task(struct lb_env *env)
+static int __move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
-	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3213,6 +3270,21 @@ static int move_one_task(struct lb_env *
 	return 0;
 }
 
+static int move_one_task(struct lb_env *env)
+{
+	if (sched_feat(NUMA_PULL)) {
+		env->tasks = &env->src_rq->offnode_tasks;
+		if (__move_one_task(env))
+			return 1;
+	}
+
+	env->tasks = &env->src_rq->cfs_tasks;
+	if (__move_one_task(env))
+		return 1;
+
+	return 0;
+}
+
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -3224,7 +3296,6 @@ static unsigned long task_h_load(struct
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3232,8 +3303,9 @@ static int move_tasks(struct lb_env *env
 	if (env->load_move <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+	while (!list_empty(env->tasks)) {
+		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3244,7 +3316,7 @@ static int move_tasks(struct lb_env *env
 		if (env->loop > env->loop_break) {
 			env->loop_break += sysctl_sched_nr_migrate;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3272,7 +3344,7 @@ static int move_tasks(struct lb_env *env
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3280,13 +3352,20 @@ static int move_tasks(struct lb_env *env
 		 * weighted load.
 		 */
 		if (env->load_move <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, env->tasks);
 	}
 
+	if (env->tasks == &env->src_rq->offnode_tasks) {
+		env->tasks = &env->src_rq->cfs_tasks;
+		env->loop = 0;
+		goto again;
+	}
+
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3441,6 +3520,15 @@ struct sd_lb_stats {
 	unsigned long leader_nr_running; /* Nr running of group_leader */
 	unsigned long min_nr_running; /* Nr running of group_min */
 #endif
+#ifdef CONFIG_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+#endif
+
+	struct rq *(*find_busiest_queue)(struct sched_domain *sd,
+			struct sched_group *group, enum cpu_idle_type idle,
+			unsigned long imbalance, const struct cpumask *cpus);
 };
 
 /*
@@ -3456,6 +3544,10 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA
+	unsigned long numa_weight;
+	unsigned long numa_running;
+#endif
 };
 
 /**
@@ -3625,6 +3717,117 @@ static inline int check_power_save_busie
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
+#ifdef CONFIG_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->numa_weight += rq->offnode_weight;
+	sgs->numa_running += rq->offnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ */
+static inline bool pick_numa_rand(void)
+{
+	return get_random_int() & 1;
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group)
+		return;
+
+	if (!sgs->numa_running)
+		return;
+
+	if (!sds->numa_group_running || pick_numa_rand()) {
+		sds->numa_group = group;
+		sds->numa_group_weight = sgs->numa_weight;
+		sds->numa_group_running = sgs->numa_running;
+	}
+}
+
+static struct rq *
+find_busiest_numa_queue(struct sched_domain *sd, struct sched_group *group,
+		   enum cpu_idle_type idle, unsigned long imbalance,
+		   const struct cpumask *cpus)
+{
+	struct rq *busiest = NULL, *rq;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(group), cpus) {
+		rq = cpu_rq(cpu);
+		if (!rq->offnode_running)
+			continue;
+		if (!busiest || pick_numa_rand())
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+static inline int check_numa_busiest_group(struct sd_lb_stats *sds,
+		int this_cpu, unsigned long *imbalance)
+{
+	if (!sched_feat(NUMA_PULL_BIAS))
+		return 0;
+
+	if (!sds->numa_group)
+		return 0;
+
+	*imbalance = sds->numa_group_weight / sds->numa_group_running;
+	sds->busiest = sds->numa_group;
+	sds->find_busiest_queue = find_busiest_numa_queue;
+	return 1;
+}
+
+static inline
+bool need_active_numa_balance(struct sched_domain *sd, struct rq *busiest)
+{
+	/*
+	 * Not completely fail-safe, but its a fair bet that if we're at a
+	 * rq that only has one task, and its offnode, we're here through
+	 * find_busiest_numa_queue(). In any case, we want to kick such tasks.
+	 */
+	if ((sd->flags & SD_NUMA) && busiest->offnode_running == 1 &&
+			busiest->nr_running == 1)
+		return true;
+
+	return false;
+}
+
+#else /* CONFIG_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline int check_numa_busiest_group(struct sd_lb_stats *sds,
+		int this_cpu, unsigned long *imbalance)
+{
+	return 0;
+}
+
+static inline
+bool need_active_numa_balance(struct sched_domain *sd, struct rq *busiest)
+{
+	return false;
+}
+#endif /* CONFIG_NUMA */
 
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
@@ -3816,6 +4019,8 @@ static inline void update_sg_lb_stats(st
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		update_sg_numa_stats(sgs, rq);
 	}
 
 	/*
@@ -3977,6 +4182,8 @@ static inline void update_sd_lb_stats(st
 		}
 
 		update_sd_power_savings_stats(sg, sds, local_group, &sgs);
+		update_sd_numa_stats(sd, sg, sds, local_group, &sgs);
+
 		sg = sg->next;
 	} while (sg != sd->groups);
 }
@@ -4192,19 +4399,16 @@ static inline void calculate_imbalance(s
  *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *
-find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum cpu_idle_type idle,
-		   const struct cpumask *cpus, int *balance)
+find_busiest_group(struct sched_domain *sd, struct sd_lb_stats *sds,
+		   int this_cpu, unsigned long *imbalance,
+		   enum cpu_idle_type idle, const struct cpumask *cpus,
+		   int *balance)
 {
-	struct sd_lb_stats sds;
-
-	memset(&sds, 0, sizeof(sds));
-
 	/*
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
-	update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, &sds);
+	update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, sds);
 
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
@@ -4214,40 +4418,40 @@ find_busiest_group(struct sched_domain *
 		goto ret;
 
 	if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
-	    check_asym_packing(sd, &sds, this_cpu, imbalance))
-		return sds.busiest;
+	    check_asym_packing(sd, sds, this_cpu, imbalance))
+		return sds->busiest;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds->busiest || sds->busiest_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
+	sds->avg_load = (SCHED_POWER_SCALE * sds->total_load) / sds->total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
 	 * work because they assumes all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
-	if (sds.group_imb)
+	if (sds->group_imb)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
-			!sds.busiest_has_capacity)
+	if (idle == CPU_NEWLY_IDLE && sds->this_has_capacity &&
+			!sds->busiest_has_capacity)
 		goto force_balance;
 
 	/*
 	 * If the local group is more busy than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (sds.this_load >= sds.max_load)
+	if (sds->this_load >= sds->max_load)
 		goto out_balanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	if (sds.this_load >= sds.avg_load)
+	if (sds->this_load >= sds->avg_load)
 		goto out_balanced;
 
 	if (idle == CPU_IDLE) {
@@ -4257,30 +4461,33 @@ find_busiest_group(struct sched_domain *
 		 * there is no imbalance between this and busiest group
 		 * wrt to idle cpu's, it is balanced.
 		 */
-		if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
-		    sds.busiest_nr_running <= sds.busiest_group_weight)
+		if ((sds->this_idle_cpus <= sds->busiest_idle_cpus + 1) &&
+		    sds->busiest_nr_running <= sds->busiest_group_weight)
 			goto out_balanced;
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 		 * imbalance_pct to be conservative.
 		 */
-		if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
+		if (100 * sds->max_load <= sd->imbalance_pct * sds->this_load)
 			goto out_balanced;
 	}
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	calculate_imbalance(&sds, this_cpu, imbalance);
-	return sds.busiest;
+	calculate_imbalance(sds, this_cpu, imbalance);
+	return sds->busiest;
 
 out_balanced:
+	if (check_numa_busiest_group(sds, this_cpu, imbalance))
+		return sds->busiest;
+
 	/*
 	 * There is no obvious imbalance. But check if we can do some balancing
 	 * to save power.
 	 */
-	if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
-		return sds.busiest;
+	if (check_power_save_busiest_group(sds, this_cpu, imbalance))
+		return sds->busiest;
 ret:
 	*imbalance = 0;
 	return NULL;
@@ -4347,9 +4554,11 @@ find_busiest_queue(struct sched_domain *
 DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
 
 static int need_active_balance(struct sched_domain *sd, int idle,
-			       int busiest_cpu, int this_cpu)
+			       struct rq *busiest, struct rq *this)
 {
 	if (idle == CPU_NEWLY_IDLE) {
+		int busiest_cpu = cpu_of(busiest);
+		int this_cpu = cpu_of(this);
 
 		/*
 		 * ASYM_PACKING needs to force migrate tasks from busy but
@@ -4382,6 +4591,9 @@ static int need_active_balance(struct sc
 			return 0;
 	}
 
+	if (need_active_numa_balance(sd, busiest))
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -4401,6 +4613,7 @@ static int load_balance(int this_cpu, st
 	struct rq *busiest;
 	unsigned long flags;
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
+	struct sd_lb_stats sds;
 
 	struct lb_env env = {
 		.sd		= sd,
@@ -4412,10 +4625,12 @@ static int load_balance(int this_cpu, st
 
 	cpumask_copy(cpus, cpu_active_mask);
 
+	memset(&sds, 0, sizeof(sds));
+	sds.find_busiest_queue = find_busiest_queue;
 	schedstat_inc(sd, lb_count[idle]);
 
 redo:
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle,
+	group = find_busiest_group(sd, &sds, this_cpu, &imbalance, idle,
 				   cpus, balance);
 
 	if (*balance == 0)
@@ -4426,7 +4641,7 @@ static int load_balance(int this_cpu, st
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(sd, group, idle, imbalance, cpus);
+	busiest = sds.find_busiest_queue(sd, group, idle, imbalance, cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -4449,6 +4664,10 @@ static int load_balance(int this_cpu, st
 		env.src_cpu = busiest->cpu;
 		env.src_rq = busiest;
 		env.loop_max = busiest->nr_running;
+		if (sched_feat(NUMA_PULL))
+			env.tasks = &busiest->offnode_tasks;
+		else
+			env.tasks = &busiest->cfs_tasks;
 
 more_balance:
 		local_irq_save(flags);
@@ -4490,7 +4709,7 @@ static int load_balance(int this_cpu, st
 		if (idle != CPU_NEWLY_IDLE)
 			sd->nr_balance_failed++;
 
-		if (need_active_balance(sd, idle, cpu_of(busiest), this_cpu)) {
+		if (need_active_balance(sd, idle, busiest, this_rq)) {
 			raw_spin_lock_irqsave(&busiest->lock, flags);
 
 			/* don't kick the active_load_balance_cpu_stop,
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,3 +68,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
+
+#ifdef CONFIG_NUMA
+SCHED_FEAT(NUMA_HOT,       true)
+SCHED_FEAT(NUMA_BIAS,      true)
+SCHED_FEAT(NUMA_PULL,      true)
+SCHED_FEAT(NUMA_PULL_BIAS, true)
+#endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,12 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA
+	unsigned long    offnode_running;
+	unsigned long	 offnode_weight;
+	struct list_head offnode_tasks;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -525,6 +531,7 @@ static inline struct sched_domain *highe
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
 
 #endif /* CONFIG_SMP */
 
@@ -1158,3 +1165,5 @@ enum rq_nohz_flag_bits {
 #endif
 
 static inline void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags) { }
+static inline bool account_numa_enqueue(struct task_struct *p) { return false; }
+static inline void account_numa_dequeue(struct task_struct *p) { }



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (12 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-07-07 18:26   ` Rik van Riel
                     ` (2 more replies)
  2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
                   ` (14 subsequent siblings)
  28 siblings, 3 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-6.patch --]
[-- Type: text/plain, Size: 22982 bytes --]

Implement a NUMA process balancer that migrates processes across nodes
(it changes their home-node). This implies full memory migration.

Adds node wide cpu load tracking in two measures, tasks that should
have ran on this node and tasks that run away from their node (the
former includes the latter). We use the latter measure as indication
that the node is overloaded and use the former to compute cpu
imbalance.

Adds node wide memory load tracking in two measures, page rate of page
allocations that miss the preferred node (NUMA_FOREIGN) and an
absolute measure of pages used on the node (MR_ANON_PAGES +
NR_ACTIVE_FILE). We use the first as indication that the node is
overloaded with memory and use the second to compute imbalance.

For process mem load measure we use RSS (MM_ANONPAGES), this is
comparable to our absolute memory load (both are in pages). For
process cpu load we use the sum of load over the process thread group.

Using all this information we build two main functions:

 - select_task_node(); this is ran on fork and exec and finds a
   suitable node for the 'new' process. This is a typical least loaded
   node scan, controlled through the NUMA_SELECT feature flag.

 - numa_balance(); active pull-based node load-balancer, tries to
   balance node cpu usage against node mem usage, controlled through
   the NUMA_BALANCE feature flag. XXX needs TLC

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    8 
 include/linux/sched.h    |   13 
 init/Kconfig             |    2 
 kernel/fork.c            |    2 
 kernel/sched/Makefile    |    2 
 kernel/sched/core.c      |    1 
 kernel/sched/fair.c      |    6 
 kernel/sched/features.h  |    4 
 kernel/sched/numa.c      |  735 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |   16 +
 mm/init-mm.c             |   10 
 11 files changed, 793 insertions(+), 6 deletions(-)
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -285,6 +285,13 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+struct numa_entity {
+#ifdef CONFIG_NUMA
+	int		 node;		/* home node */
+	struct list_head numa_entry;	/* balance list */
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -388,6 +395,7 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+	struct numa_entity numa;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1270,6 +1270,11 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
+#ifdef CONFIG_NUMA
+	unsigned long	 numa_contrib;
+	int		 numa_remote;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
@@ -2818,6 +2823,14 @@ static inline unsigned long rlimit_max(u
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_NUMA
+void mm_init_numa(struct mm_struct *mm);
+void exit_numa(struct mm_struct *mm);
+#else
+static inline void mm_init_numa(struct mm_struct *mm) { }
+static inline void exit_numa(struct mm_struct *mm) { }
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -866,7 +866,7 @@ config SCHED_AUTOGROUP
 	  upon task session.
 
 config MM_OWNER
-	bool
+	def_bool NUMA
 
 config SYSFS_DEPRECATED
 	bool "Enable deprecated sysfs features to support old userspace tools"
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -501,6 +501,7 @@ static struct mm_struct *mm_init(struct
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	mm_init_numa(mm);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
@@ -554,6 +555,7 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
+		exit_numa(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,5 +16,5 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
-
+obj-$(CONFIG_NUMA) += numa.o
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7078,6 +7078,7 @@ void __init sched_init(void)
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 #endif
 	init_sched_fair_class();
+	init_sched_numa();
 
 	scheduler_running = 1;
 }
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3285,8 +3285,6 @@ static int move_one_task(struct lb_env *
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 /*
  * move_tasks tries to move up to load_move weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3458,7 +3456,7 @@ static void update_h_load(long cpu)
 	rcu_read_unlock();
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+unsigned long task_h_load(struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
 	unsigned long load;
@@ -3477,7 +3475,7 @@ static inline void update_h_load(long cp
 {
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.load.weight;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,8 @@ SCHED_FEAT(NUMA_HOT,       true)
 SCHED_FEAT(NUMA_BIAS,      true)
 SCHED_FEAT(NUMA_PULL,      true)
 SCHED_FEAT(NUMA_PULL_BIAS, true)
+SCHED_FEAT(NUMA_BALANCE,   true)
+SCHED_FEAT(NUMA_BALANCE_FILTER, false)
+SCHED_FEAT(NUMA_SELECT,    true)
+SCHED_FEAT(NUMA_SLOW,      false)
 #endif
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,735 @@
+/*
+ * NUMA scheduler
+ *
+ *  Copyright (C) 2011-2012 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * With input and fixes from:
+ *
+ *  Ingo Molnar <mingo@elte.hu>
+ *  Bharata B Rao <bharata@linux.vnet.ibm.com>
+ *  Dan Smith <danms@us.ibm.com>
+ *
+ * For licensing details see kernel-base/COPYING
+ */
+
+#include <linux/mempolicy.h>
+#include <linux/kthread.h>
+
+#include "sched.h"
+
+
+static const int numa_balance_interval = 2 * HZ; /* 2 seconds */
+
+struct numa_cpu_load {
+	unsigned long	remote; /* load of tasks running away from their home node */
+	unsigned long	all;	/* load of tasks that should be running on this node */
+};
+
+static struct numa_cpu_load **numa_load_array;
+
+static struct {
+	spinlock_t		lock;
+	unsigned long		load;
+} max_mem_load = {
+	.lock = __SPIN_LOCK_UNLOCKED(max_mem_load.lock),
+	.load = 0,
+};
+
+/*
+ * Assumes symmetric NUMA -- that is, each node is of equal size.
+ */
+static void set_max_mem_load(unsigned long load)
+{
+	unsigned long old_load;
+
+	spin_lock(&max_mem_load.lock);
+	old_load = max_mem_load.load;
+	if (!old_load)
+		old_load = load;
+	max_mem_load.load = (old_load + load) >> 1;
+	spin_unlock(&max_mem_load.lock);
+}
+
+static unsigned long get_max_mem_load(void)
+{
+	return max_mem_load.load;
+}
+
+struct node_queue {
+	struct task_struct	*numad;
+
+	unsigned long		remote_cpu_load;
+	unsigned long		cpu_load;
+
+	unsigned long		prev_numa_foreign;
+	unsigned long		remote_mem_load;
+
+	spinlock_t		lock;
+	struct list_head	entity_list;
+	int			nr_processes;
+
+	unsigned long		next_schedule;
+	int			node;
+};
+
+static struct node_queue **nqs;
+
+static inline struct node_queue *nq_of(int node)
+{
+	return nqs[node];
+}
+
+static inline struct node_queue *this_nq(void)
+{
+	return nq_of(numa_node_id());
+}
+
+bool account_numa_enqueue(struct task_struct *p)
+{
+	int home_node = tsk_home_node(p);
+	int cpu = task_cpu(p);
+	int node = cpu_to_node(cpu);
+	struct rq *rq = cpu_rq(cpu);
+	struct numa_cpu_load *nl;
+	unsigned long load;
+
+	/*
+	 * not actually an auto-numa task, ignore
+	 */
+	if (home_node == -1)
+		return false;
+
+	load = task_h_load(p);
+	nl = this_cpu_ptr(numa_load_array[home_node]);
+	p->numa_remote = (node != home_node);
+	p->numa_contrib = load;
+	nl->all += load;
+	if (p->numa_remote)
+		nl->remote += load;
+
+	/*
+	 * the task is on its home-node, we're done, the rest is offnode
+	 * accounting.
+	 */
+	if (!p->numa_remote)
+		return false;
+
+	list_add_tail(&p->se.group_node, &rq->offnode_tasks);
+	rq->offnode_running++;
+	rq->offnode_weight += load;
+
+	return true;
+}
+
+void account_numa_dequeue(struct task_struct *p)
+{
+	int home_node = tsk_home_node(p);
+	struct numa_cpu_load *nl;
+	struct rq *rq;
+
+	/*
+	 * not actually an auto-numa task, ignore
+	 */
+	if (home_node == -1)
+		return;
+
+	nl = this_cpu_ptr(numa_load_array[home_node]);
+	nl->all -= p->numa_contrib;
+	if (p->numa_remote)
+		nl->remote -= p->numa_contrib;
+
+	/*
+	 * the task is on its home-node, we're done, the rest is offnode
+	 * accounting.
+	 */
+	if (!p->numa_remote)
+		return;
+
+	rq = task_rq(p);
+	rq->offnode_running--;
+	rq->offnode_weight -= p->numa_contrib;
+}
+
+static inline struct mm_struct *ne_mm(struct numa_entity *ne)
+{
+	return container_of(ne, struct mm_struct, numa);
+}
+
+static inline struct task_struct *ne_owner(struct numa_entity *ne)
+{
+	return rcu_dereference(ne_mm(ne)->owner);
+}
+
+static void process_cpu_migrate(struct numa_entity *ne, int node)
+{
+	struct task_struct *p, *t;
+
+	rcu_read_lock();
+	t = p = ne_owner(ne);
+	if (p) do {
+		sched_setnode(t, node);
+	} while ((t = next_thread(t)) != p);
+	rcu_read_unlock();
+}
+
+static void process_mem_migrate(struct numa_entity *ne, int node)
+{
+	lazy_migrate_process(ne_mm(ne), node);
+}
+
+static int process_tryget(struct numa_entity *ne)
+{
+	/*
+	 * This is possible when we hold &nq_of(ne->node)->lock since then
+	 * numa_exit() will block on that lock, we can't however write an
+	 * assertion to check this, since if we don't hold the lock that
+	 * expression isn't safe to evaluate.
+	 */
+	return atomic_inc_not_zero(&ne_mm(ne)->mm_users);
+}
+
+static void process_put(struct numa_entity *ne)
+{
+	mmput(ne_mm(ne));
+}
+
+static struct node_queue *lock_ne_nq(struct numa_entity *ne)
+{
+	struct node_queue *nq;
+	int node;
+
+	for (;;) {
+		node = ACCESS_ONCE(ne->node);
+		BUG_ON(node == -1);
+		nq = nq_of(node);
+
+		spin_lock(&nq->lock);
+		if (likely(ne->node == node))
+			break;
+		spin_unlock(&nq->lock);
+	}
+
+	return nq;
+}
+
+static void double_lock_nq(struct node_queue *nq1, struct node_queue *nq2)
+{
+	if (nq1 > nq2)
+		swap(nq1, nq2);
+
+	spin_lock(&nq1->lock);
+	if (nq2 != nq1)
+		spin_lock_nested(&nq2->lock, SINGLE_DEPTH_NESTING);
+}
+
+static void double_unlock_nq(struct node_queue *nq1, struct node_queue *nq2)
+{
+	if (nq1 > nq2)
+		swap(nq1, nq2);
+
+	if (nq2 != nq1)
+		spin_unlock(&nq2->lock);
+	spin_unlock(&nq1->lock);
+}
+
+static void __enqueue_ne(struct node_queue *nq, struct numa_entity *ne)
+{
+	ne->node = nq->node;
+	list_add_tail(&ne->numa_entry, &nq->entity_list);
+	nq->nr_processes++;
+}
+
+static void __dequeue_ne(struct node_queue *nq, struct numa_entity *ne)
+{
+	list_del(&ne->numa_entry);
+	nq->nr_processes--;
+	BUG_ON(nq->nr_processes < 0);
+}
+
+static void enqueue_ne(struct numa_entity *ne, int node)
+{
+	struct node_queue *nq = nq_of(node);
+
+	BUG_ON(ne->node != -1);
+
+	process_cpu_migrate(ne, node);
+	process_mem_migrate(ne, node);
+
+	spin_lock(&nq->lock);
+	__enqueue_ne(nq, ne);
+	spin_unlock(&nq->lock);
+}
+
+static void dequeue_ne(struct numa_entity *ne)
+{
+	struct node_queue *nq;
+
+	if (ne->node == -1) // XXX serialization
+		return;
+
+	nq = lock_ne_nq(ne);
+	ne->node = -1;
+	__dequeue_ne(nq, ne);
+	spin_unlock(&nq->lock);
+}
+
+static void init_ne(struct numa_entity *ne)
+{
+	ne->node = -1;
+}
+
+void mm_init_numa(struct mm_struct *mm)
+{
+	init_ne(&mm->numa);
+}
+
+void exit_numa(struct mm_struct *mm)
+{
+	dequeue_ne(&mm->numa);
+}
+
+static inline unsigned long node_pages_load(int node)
+{
+	unsigned long pages = 0;
+
+	pages += node_page_state(node, NR_ANON_PAGES);
+	pages += node_page_state(node, NR_ACTIVE_FILE);
+
+	return pages;
+}
+
+static int find_idlest_node(int this_node)
+{
+	unsigned long mem_load, cpu_load;
+	unsigned long min_cpu_load;
+	unsigned long this_cpu_load;
+	int min_node;
+	int node, cpu;
+
+	min_node = -1;
+	this_cpu_load = min_cpu_load = ULONG_MAX;
+
+	// XXX should be sched_domain aware
+	for_each_online_node(node) {
+		struct node_queue *nq = nq_of(node);
+		/*
+		 * Pick the node that has least cpu load provided there's no
+		 * foreign memory load.
+		 *
+		 * XXX if all nodes were to have foreign allocations we'd OOM,
+		 *     however check the low-pass filter in update_node_load().
+		 */
+		mem_load = nq->remote_mem_load;
+		if (mem_load)
+			continue;
+
+		cpu_load = 0;
+		for_each_cpu_mask(cpu, *cpumask_of_node(node))
+			cpu_load += cpu_rq(cpu)->load.weight;
+		cpu_load += nq->remote_cpu_load;
+
+		if (this_node == node)
+			this_cpu_load = cpu_load;
+
+		if (cpu_load < min_cpu_load) {
+			min_cpu_load = cpu_load;
+			min_node = node;
+		}
+	}
+
+	/*
+	 * If there's no choice, stick to where we are.
+	 */
+	if (min_node == -1)
+		return this_node;
+
+	/*
+	 * Add a little hysteresis so we don't hard-interleave over nodes
+	 * scattering workloads.
+	 */
+	if (this_cpu_load != ULONG_MAX && this_node != min_node) {
+		if (this_cpu_load * 100 < min_cpu_load * 110)
+			return this_node;
+	}
+
+	return min_node;
+}
+
+void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags)
+{
+	if (!sched_feat(NUMA_SELECT)) {
+		p->node = -1;
+		return;
+	}
+
+	if (!mm)
+		return;
+
+	/*
+	 * If there's an explicit task policy set, bail.
+	 */
+	if (p->flags & PF_MEMPOLICY) {
+		p->node = -1;
+		return;
+	}
+
+	if (sd_flags & SD_BALANCE_FORK) {
+		/* For new threads, set the home-node. */
+		if (mm == current->mm) {
+			p->node = mm->numa.node;
+			return;
+		}
+	}
+
+	enqueue_ne(&mm->numa, find_idlest_node(p->node));
+}
+
+__init void init_sched_numa(void)
+{
+	int node;
+
+	numa_load_array = kzalloc(sizeof(struct numa_cpu_load *) * nr_node_ids, GFP_KERNEL);
+	BUG_ON(!numa_load_array);
+
+	for_each_node(node) {
+		numa_load_array[node] = alloc_percpu(struct numa_cpu_load);
+		BUG_ON(!numa_load_array[node]);
+	}
+}
+
+static void add_load(unsigned long *load, unsigned long new_load)
+{
+	if (sched_feat(NUMA_SLOW)) {
+		*load = (*load + new_load) >> 1;
+		return;
+	}
+
+	*load = new_load;
+}
+
+/*
+ * Called every @numa_balance_interval to update current node state.
+ */
+static void update_node_load(struct node_queue *nq)
+{
+	unsigned long pages, delta;
+	struct numa_cpu_load l;
+	int cpu;
+
+	memset(&l, 0, sizeof(l));
+
+	/*
+	 * Aggregate per-cpu cpu-load values for this node as per
+	 * account_numa_{en,de}queue().
+	 *
+	 * XXX limit to max balance sched_domain
+	 */
+	for_each_online_cpu(cpu) {
+		struct numa_cpu_load *nl = per_cpu_ptr(numa_load_array[nq->node], cpu);
+
+		l.remote += nl->remote;
+		l.all += nl->all;
+	}
+
+	add_load(&nq->remote_cpu_load, l.remote);
+	add_load(&nq->cpu_load, l.all);
+
+	/*
+	 * Fold regular samples of NUMA_FOREIGN into a memory load measure.
+	 */
+	pages = node_page_state(nq->node, NUMA_FOREIGN);
+	delta = pages - nq->prev_numa_foreign;
+	nq->prev_numa_foreign = pages;
+	add_load(&nq->remote_mem_load, delta);
+
+	/*
+	 * If there was NUMA_FOREIGN load, that means this node was at its
+	 * maximum memory capacity, record that.
+	 */
+	set_max_mem_load(node_pages_load(nq->node));
+}
+
+enum numa_balance_type {
+	NUMA_BALANCE_NONE = 0,
+	NUMA_BALANCE_CPU  = 1,
+	NUMA_BALANCE_MEM  = 2,
+	NUMA_BALANCE_ALL  = 3,
+};
+
+struct numa_imbalance {
+	long cpu, mem;
+	long mem_load;
+	enum numa_balance_type type;
+};
+
+static unsigned long process_cpu_load(struct numa_entity *ne)
+{
+	unsigned long load = 0;
+	struct task_struct *t, *p;
+
+	rcu_read_lock();
+	t = p = ne_owner(ne);
+	if (p) do {
+		load += t->numa_contrib;
+	} while ((t = next_thread(t)) != p);
+	rcu_read_unlock();
+
+	return load;
+}
+
+static unsigned long process_mem_load(struct numa_entity *ne)
+{
+	return get_mm_counter(ne_mm(ne), MM_ANONPAGES);
+}
+
+static int find_busiest_node(int this_node, struct numa_imbalance *imb)
+{
+	unsigned long cpu_load, mem_load;
+	unsigned long max_cpu_load, max_mem_load;
+	unsigned long sum_cpu_load, sum_mem_load;
+	unsigned long mem_cpu_load, cpu_mem_load;
+	int cpu_node, mem_node;
+	struct node_queue *nq;
+	int node;
+
+	sum_cpu_load = sum_mem_load = 0;
+	max_cpu_load = max_mem_load = 0;
+	mem_cpu_load = cpu_mem_load = 0;
+	cpu_node = mem_node = -1;
+
+	/* XXX scalability -- sched_domain */
+	for_each_online_node(node) {
+		nq = nq_of(node);
+
+		cpu_load = nq->remote_cpu_load;
+		mem_load = nq->remote_mem_load;
+
+		/*
+		 * If this node is overloaded on memory, we don't want more
+		 * tasks, bail!
+		 */
+		if (node == this_node) {
+			if (mem_load)
+				return -1;
+		}
+
+		sum_cpu_load += cpu_load;
+		if (cpu_load > max_cpu_load) {
+			max_cpu_load = cpu_load;
+			cpu_mem_load = mem_load;
+			cpu_node = node;
+		}
+
+		sum_mem_load += mem_load;
+		if (mem_load > max_mem_load) {
+			max_mem_load = mem_load;
+			mem_cpu_load = cpu_load;
+			mem_node = node;
+		}
+	}
+
+	/*
+	 * Nobody had overload of any kind, cool we're done!
+	 */
+	if (cpu_node == -1 && mem_node == -1)
+		return -1;
+
+	if (mem_node == -1) {
+set_cpu_node:
+		node = cpu_node;
+		cpu_load = max_cpu_load;
+		mem_load = cpu_mem_load;
+		goto calc_imb;
+	}
+
+	if (cpu_node == -1) {
+set_mem_node:
+		node = mem_node;
+		cpu_load = mem_cpu_load;
+		mem_load = max_mem_load;
+		goto calc_imb;
+	}
+
+	/*
+	 * We have both cpu and mem overload, oh my! pick whichever is most
+	 * overloaded wrt the average.
+	 */
+	if ((u64)max_mem_load * sum_cpu_load > (u64)max_cpu_load * sum_mem_load)
+		goto set_mem_node;
+
+	goto set_cpu_node;
+
+calc_imb:
+	memset(imb, 0, sizeof(*imb));
+
+	if (cpu_node != -1) {
+		imb->type |= NUMA_BALANCE_CPU;
+		imb->cpu = (long)(nq_of(node)->cpu_load -
+				  nq_of(this_node)->cpu_load) / 2;
+	}
+
+	if (mem_node != -1) {
+		imb->type |= NUMA_BALANCE_MEM;
+		imb->mem_load = node_pages_load(this_node);
+		imb->mem = (long)(node_pages_load(node) - imb->mem_load) / 2;
+	}
+
+	return node;
+}
+
+static bool can_move_ne(struct numa_entity *ne)
+{
+	/*
+	 * XXX: consider mems_allowed, stinking cpusets has mems_allowed
+	 * per task and it can actually differ over a whole process, la-la-la.
+	 */
+	return true;
+}
+
+static void move_processes(struct node_queue *busiest_nq,
+			   struct node_queue *this_nq,
+			   struct numa_imbalance *imb)
+{
+	unsigned long max_mem_load = get_max_mem_load();
+	long cpu_moved = 0, mem_moved = 0;
+	struct numa_entity *ne;
+	long ne_mem, ne_cpu;
+	int loops;
+
+	double_lock_nq(this_nq, busiest_nq);
+	loops = busiest_nq->nr_processes;
+	while (!list_empty(&busiest_nq->entity_list) && loops--) {
+		ne = list_first_entry(&busiest_nq->entity_list,
+				     struct numa_entity,
+				     numa_entry);
+
+		ne_cpu = process_cpu_load(ne);
+		ne_mem = process_mem_load(ne);
+
+		if (sched_feat(NUMA_BALANCE_FILTER)) {
+			/*
+			 * Avoid moving ne's when we create a larger imbalance
+			 * on the other end.
+			 */
+			if ((imb->type & NUMA_BALANCE_CPU) &&
+			    imb->cpu - cpu_moved < ne_cpu / 2)
+				goto next;
+
+			/*
+			 * Avoid migrating ne's when we'll know we'll push our
+			 * node over the memory limit.
+			 */
+			if (max_mem_load &&
+			    imb->mem_load + mem_moved + ne_mem > max_mem_load)
+				goto next;
+		}
+
+		if (!can_move_ne(ne))
+			goto next;
+
+		__dequeue_ne(busiest_nq, ne);
+		__enqueue_ne(this_nq, ne);
+		if (process_tryget(ne)) {
+			double_unlock_nq(this_nq, busiest_nq);
+
+			process_cpu_migrate(ne, this_nq->node);
+			process_mem_migrate(ne, this_nq->node);
+
+			process_put(ne);
+			double_lock_nq(this_nq, busiest_nq);
+		}
+
+		cpu_moved += ne_cpu;
+		mem_moved += ne_mem;
+
+		if (imb->cpu - cpu_moved <= 0 &&
+		    imb->mem - mem_moved <= 0)
+			break;
+
+		continue;
+
+next:
+		list_move_tail(&ne->numa_entry, &busiest_nq->entity_list);
+	}
+	double_unlock_nq(this_nq, busiest_nq);
+}
+
+static void numa_balance(struct node_queue *this_nq)
+{
+	struct numa_imbalance imb;
+	int busiest;
+
+	busiest = find_busiest_node(this_nq->node, &imb);
+	if (busiest == -1)
+		return;
+
+	if (imb.cpu <= 0 && imb.mem <= 0)
+		return;
+
+	move_processes(nq_of(busiest), this_nq, &imb);
+}
+
+static int wait_for_next_balance(struct node_queue *nq)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	while (!kthread_should_stop()) {
+		long timeout = nq->next_schedule - jiffies;
+		if (timeout <= 0) {
+			__set_current_state(TASK_RUNNING);
+			return 1;
+		}
+		schedule_timeout(timeout);
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
+static int numad_thread(void *data)
+{
+	struct node_queue *nq = data;
+	struct task_struct *p = nq->numad;
+
+	set_cpus_allowed_ptr(p, cpumask_of_node(nq->node));
+
+	while (wait_for_next_balance(nq)) {
+
+		update_node_load(nq);
+
+		if (sched_feat(NUMA_BALANCE))
+			numa_balance(nq);
+
+		nq->next_schedule += numa_balance_interval;
+	}
+
+	return 0;
+}
+
+static __init int numa_init(void)
+{
+	int node;
+
+	nqs = kzalloc(sizeof(struct node_queue*) * nr_node_ids, GFP_KERNEL);
+	BUG_ON(!nqs);
+
+	for_each_node(node) { // XXX hotplug
+		struct node_queue *nq = kmalloc_node(sizeof(*nq),
+				GFP_KERNEL | __GFP_ZERO, node);
+		BUG_ON(!nq);
+
+		nq->numad = kthread_create_on_node(numad_thread,
+				nq, node, "numad/%d", node);
+		BUG_ON(IS_ERR(nq->numad));
+
+		spin_lock_init(&nq->lock);
+		INIT_LIST_HEAD(&nq->entity_list);
+
+		nq->next_schedule = jiffies + HZ;
+		nq->node = node;
+		nqs[node] = nq;
+
+		wake_up_process(nq->numad);
+	}
+
+	return 0;
+}
+early_initcall(numa_init);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1164,6 +1164,22 @@ enum rq_nohz_flag_bits {
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
 #endif
 
+unsigned long task_h_load(struct task_struct *p);
+
+#ifdef CONFIG_NUMA
+
+void sched_setnode(struct task_struct *p, int node);
+void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags);
+bool account_numa_enqueue(struct task_struct *p);
+void account_numa_dequeue(struct task_struct *p);
+void init_sched_numa(void);
+
+#else /* CONFIG_NUMA */
+
 static inline void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags) { }
 static inline bool account_numa_enqueue(struct task_struct *p) { return false; }
 static inline void account_numa_dequeue(struct task_struct *p) { }
+static inline void init_sched_numa(void) { }
+
+#endif /* CONFIG_NUMA */
+
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -13,6 +13,15 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+#ifdef CONFIG_NUMA
+# define INIT_MM_NUMA(mm)						\
+	.numa = {							\
+		.node = -1,						\
+	},
+#else
+# define INIT_MM_NUMA(mm)
+#endif
+
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -22,4 +31,5 @@ struct mm_struct init_mm = {
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_NUMA(init_mm)
 };



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (13 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-19 12:16   ` Srivatsa S. Bhat
  2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-6a.patch --]
[-- Type: text/plain, Size: 2200 bytes --]

start/stop numa balance threads on-demand using cpu-hotlpug.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/numa.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 55 insertions(+), 7 deletions(-)
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -596,31 +596,79 @@ static int numad_thread(void *data)
 	return 0;
 }
 
+static int __cpuinit
+numa_hotplug(struct notifier_block *nb, unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+	int node = cpu_to_node(cpu);
+	struct node_queue *nq = nq_of(node);
+	struct task_struct *numad;
+	int err = 0;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		if (nq->numad)
+			break;
+
+		numad = kthread_create_on_node(numad_thread,
+				nq, node, "numad/%d", node);
+		if (IS_ERR(numad)) {
+			err = PTR_ERR(numad);
+			break;
+		}
+
+		nq->numad = numad;
+		nq->next_schedule = jiffies + HZ; // XXX sync-up?
+		break;
+
+	case CPU_ONLINE:
+		wake_up_process(nq->numad);
+		break;
+
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		if (!nq->numad)
+			break;
+
+		if (cpumask_any_and(cpu_online_mask,
+				    cpumask_of_node(node)) >= nr_cpu_ids) {
+			kthread_stop(nq->numad);
+			nq->numad = NULL;
+		}
+		break;
+	}
+
+	return notifier_from_errno(err);
+}
+
 static __init int numa_init(void)
 {
-	int node;
+	int node, cpu, err;
 
 	nqs = kzalloc(sizeof(struct node_queue*) * nr_node_ids, GFP_KERNEL);
 	BUG_ON(!nqs);
 
-	for_each_node(node) { // XXX hotplug
+	for_each_node(node) {
 		struct node_queue *nq = kmalloc_node(sizeof(*nq),
 				GFP_KERNEL | __GFP_ZERO, node);
 		BUG_ON(!nq);
 
-		nq->numad = kthread_create_on_node(numad_thread,
-				nq, node, "numad/%d", node);
-		BUG_ON(IS_ERR(nq->numad));
-
 		spin_lock_init(&nq->lock);
 		INIT_LIST_HEAD(&nq->entity_list);
 
 		nq->next_schedule = jiffies + HZ;
 		nq->node = node;
 		nqs[node] = nq;
+	}
 
-		wake_up_process(nq->numad);
+	get_online_cpus();
+	cpu_notifier(numa_hotplug, 0);
+	for_each_online_cpu(cpu) {
+		err = numa_hotplug(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
+		BUG_ON(notifier_to_errno(err));
+		numa_hotplug(NULL, CPU_ONLINE, (void *)(long)cpu);
 	}
+	put_online_cpus();
 
 	return 0;
 }



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (14 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-7.patch --]
[-- Type: text/plain, Size: 4780 bytes --]

In order to prepare the NUMA balancer for non-process entities, add
further abstraction to the thing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    5 +-
 kernel/sched/numa.c      |   85 +++++++++++++++++++++++++++++------------------
 2 files changed, 57 insertions(+), 33 deletions(-)
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,8 +287,9 @@ struct mm_rss_stat {
 
 struct numa_entity {
 #ifdef CONFIG_NUMA
-	int		 node;		/* home node */
-	struct list_head numa_entry;	/* balance list */
+	int			node;		/* home node */
+	struct list_head	numa_entry;	/* balance list */
+	const struct numa_ops	*nops;
 #endif
 };
 
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -7,6 +7,17 @@
 
 static const int numa_balance_interval = 2 * HZ; /* 2 seconds */
 
+struct numa_ops {
+	unsigned long	(*mem_load)(struct numa_entity *ne);
+	unsigned long	(*cpu_load)(struct numa_entity *ne);
+
+	void		(*mem_migrate)(struct numa_entity *ne, int node);
+	void		(*cpu_migrate)(struct numa_entity *ne, int node);
+
+	bool		(*tryget)(struct numa_entity *ne);
+	void		(*put)(struct numa_entity *ne);
+};
+
 struct numa_cpu_load {
 	unsigned long	remote; /* load of tasks running away from their home node */
 	unsigned long	all;	/* load of tasks that should be running on this node */
@@ -147,6 +158,26 @@ static inline struct task_struct *ne_own
 	return rcu_dereference(ne_mm(ne)->owner);
 }
 
+static unsigned long process_cpu_load(struct numa_entity *ne)
+{
+	unsigned long load = 0;
+	struct task_struct *t, *p;
+
+	rcu_read_lock();
+	t = p = ne_owner(ne);
+	if (p) do {
+		load += t->numa_contrib;
+	} while ((t = next_thread(t)) != p);
+	rcu_read_unlock();
+
+	return load;
+}
+
+static unsigned long process_mem_load(struct numa_entity *ne)
+{
+	return get_mm_counter(ne_mm(ne), MM_ANONPAGES);
+}
+
 static void process_cpu_migrate(struct numa_entity *ne, int node)
 {
 	struct task_struct *p, *t;
@@ -164,7 +195,7 @@ static void process_mem_migrate(struct n
 	lazy_migrate_process(ne_mm(ne), node);
 }
 
-static int process_tryget(struct numa_entity *ne)
+static bool process_tryget(struct numa_entity *ne)
 {
 	/*
 	 * This is possible when we hold &nq_of(ne->node)->lock since then
@@ -180,6 +211,17 @@ static void process_put(struct numa_enti
 	mmput(ne_mm(ne));
 }
 
+static const struct numa_ops process_numa_ops = {
+	.mem_load	= process_mem_load,
+	.cpu_load	= process_cpu_load,
+
+	.mem_migrate	= process_mem_migrate,
+	.cpu_migrate	= process_cpu_migrate,
+
+	.tryget		= process_tryget,
+	.put		= process_put,
+};
+
 static struct node_queue *lock_ne_nq(struct numa_entity *ne)
 {
 	struct node_queue *nq;
@@ -239,8 +281,8 @@ static void enqueue_ne(struct numa_entit
 
 	BUG_ON(ne->node != -1);
 
-	process_cpu_migrate(ne, node);
-	process_mem_migrate(ne, node);
+	ne->nops->cpu_migrate(ne, node);
+	ne->nops->mem_migrate(ne, node);
 
 	spin_lock(&nq->lock);
 	__enqueue_ne(nq, ne);
@@ -260,14 +302,15 @@ static void dequeue_ne(struct numa_entit
 	spin_unlock(&nq->lock);
 }
 
-static void init_ne(struct numa_entity *ne)
+static void init_ne(struct numa_entity *ne, const struct numa_ops *nops)
 {
 	ne->node = -1;
+	ne->nops = nops;
 }
 
 void mm_init_numa(struct mm_struct *mm)
 {
-	init_ne(&mm->numa);
+	init_ne(&mm->numa, &process_numa_ops);
 }
 
 void exit_numa(struct mm_struct *mm)
@@ -449,26 +492,6 @@ struct numa_imbalance {
 	enum numa_balance_type type;
 };
 
-static unsigned long process_cpu_load(struct numa_entity *ne)
-{
-	unsigned long load = 0;
-	struct task_struct *t, *p;
-
-	rcu_read_lock();
-	t = p = ne_owner(ne);
-	if (p) do {
-		load += t->numa_contrib;
-	} while ((t = next_thread(t)) != p);
-	rcu_read_unlock();
-
-	return load;
-}
-
-static unsigned long process_mem_load(struct numa_entity *ne)
-{
-	return get_mm_counter(ne_mm(ne), MM_ANONPAGES);
-}
-
 static int find_busiest_node(int this_node, struct numa_imbalance *imb)
 {
 	unsigned long cpu_load, mem_load;
@@ -590,8 +613,8 @@ static void move_processes(struct node_q
 				     struct numa_entity,
 				     numa_entry);
 
-		ne_cpu = process_cpu_load(ne);
-		ne_mem = process_mem_load(ne);
+		ne_cpu = ne->nops->cpu_load(ne);
+		ne_mem = ne->nops->mem_load(ne);
 
 		if (sched_feat(NUMA_BALANCE_FILTER)) {
 			/*
@@ -616,13 +639,13 @@ static void move_processes(struct node_q
 
 		__dequeue_ne(busiest_nq, ne);
 		__enqueue_ne(this_nq, ne);
-		if (process_tryget(ne)) {
+		if (ne->nops->tryget(ne)) {
 			double_unlock_nq(this_nq, busiest_nq);
 
-			process_cpu_migrate(ne, this_nq->node);
-			process_mem_migrate(ne, this_nq->node);
+			ne->nops->cpu_migrate(ne, this_nq->node);
+			ne->nops->mem_migrate(ne, this_nq->node);
+			ne->nops->put(ne);
 
-			process_put(ne);
 			double_lock_nq(this_nq, busiest_nq);
 		}
 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 17/26] srcu: revert1
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (15 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: srcu-revert-1.patch --]
[-- Type: text/plain, Size: 7888 bytes --]

    rcu: Call out dangers of expedited RCU primitives

    The expedited RCU primitives can be quite useful, but they have some
    high costs as well.  This commit updates and creates docbook comments
    calling out the costs, and updates the RCU documentation as well.

    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/RCU/checklist.txt |   14 --------------
 include/linux/rcutree.h         |   16 ----------------
 kernel/rcutree.c                |   22 ++++++++--------------
 kernel/rcutree_plugin.h         |   20 ++++----------------
 kernel/srcu.c                   |   27 ++++++++++-----------------
 5 files changed, 22 insertions(+), 77 deletions(-)

--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -180,20 +180,6 @@ over a rather long period of time, but i
 	operations that would not normally be undertaken while a real-time
 	workload is running.
 
-	In particular, if you find yourself invoking one of the expedited
-	primitives repeatedly in a loop, please do everyone a favor:
-	Restructure your code so that it batches the updates, allowing
-	a single non-expedited primitive to cover the entire batch.
-	This will very likely be faster than the loop containing the
-	expedited primitive, and will be much much easier on the rest
-	of the system, especially to real-time workloads running on
-	the rest of the system.
-
-	In addition, it is illegal to call the expedited forms from
-	a CPU-hotplug notifier, or while holding a lock that is acquired
-	by a CPU-hotplug notifier.  Failing to observe this restriction
-	will result in deadlock.
-
 7.	If the updater uses call_rcu() or synchronize_rcu(), then the
 	corresponding readers must use rcu_read_lock() and
 	rcu_read_unlock().  If the updater uses call_rcu_bh() or
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -63,22 +63,6 @@ extern void synchronize_rcu_expedited(vo
 
 void kfree_call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
 
-/**
- * synchronize_rcu_bh_expedited - Brute-force RCU-bh grace period
- *
- * Wait for an RCU-bh grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.  In fact,
- * if you are using synchronize_rcu_bh_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_rcu_bh() instead.
- *
- * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.
- */
 static inline void synchronize_rcu_bh_expedited(void)
 {
 	synchronize_sched_expedited();
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1970,21 +1970,15 @@ static int synchronize_sched_expedited_c
 	return 0;
 }
 
-/**
- * synchronize_sched_expedited - Brute-force RCU-sched grace period
- *
- * Wait for an RCU-sched grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.  In fact,
- * if you are using synchronize_sched_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_sched() instead.
+/*
+ * Wait for an rcu-sched grace period to elapse, but use "big hammer"
+ * approach to force grace period to end quickly.  This consumes
+ * significant time on all CPUs, and is thus not recommended for
+ * any sort of common-case code.
  *
- * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.
+ * Note that it is illegal to call this function while holding any
+ * lock that is acquired by a CPU-hotplug notifier.  Failing to
+ * observe this restriction will result in deadlock.
  *
  * This implementation can be thought of as an application of ticket
  * locking to RCU, with sync_sched_expedited_started and
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -835,22 +835,10 @@ sync_rcu_preempt_exp_init(struct rcu_sta
 		rcu_report_exp_rnp(rsp, rnp, false); /* Don't wake self. */
 }
 
-/**
- * synchronize_rcu_expedited - Brute-force RCU grace period
- *
- * Wait for an RCU-preempt grace period, but expedite it.  The basic
- * idea is to invoke synchronize_sched_expedited() to push all the tasks to
- * the ->blkd_tasks lists and wait for this list to drain.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.
- * In fact, if you are using synchronize_rcu_expedited() in a loop,
- * please restructure your code to batch your updates, and then Use a
- * single synchronize_rcu() instead.
- *
- * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.
+/*
+ * Wait for an rcu-preempt grace period, but expedite it.  The basic idea
+ * is to invoke synchronize_sched_expedited() to push all the tasks to
+ * the ->blkd_tasks lists and wait for this list to drain.
  */
 void synchronize_rcu_expedited(void)
 {
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -286,26 +286,19 @@ void synchronize_srcu(struct srcu_struct
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 
 /**
- * synchronize_srcu_expedited - Brute-force SRCU grace period
+ * synchronize_srcu_expedited - like synchronize_srcu, but less patient
  * @sp: srcu_struct with which to synchronize.
  *
- * Wait for an SRCU grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.  In fact,
- * if you are using synchronize_srcu_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_srcu() instead.
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates.  Can block; must be called from
+ * process context.
  *
- * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.  It is also illegal to call
- * synchronize_srcu_expedited() from the corresponding SRCU read-side
- * critical section; doing so will result in deadlock.  However, it is
- * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct
- * from some other srcu_struct's read-side critical section, as long as
- * the resulting graph of srcu_structs is acyclic.
+ * Note that it is illegal to call synchronize_srcu_expedited()
+ * from the corresponding SRCU read-side critical section; doing so
+ * will result in deadlock.  However, it is perfectly legal to call
+ * synchronize_srcu_expedited() on one srcu_struct from some other
+ * srcu_struct's read-side critical section.
  */
 void synchronize_srcu_expedited(struct srcu_struct *sp)
 {



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 18/26] srcu: revert2
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (16 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: srcu-revert-2.patch --]
[-- Type: text/plain, Size: 3250 bytes --]

    rcu: Add lockdep-RCU checks for simple self-deadlock

    It is illegal to have a grace period within a same-flavor RCU read-side
    critical section, so this commit adds lockdep-RCU checks to splat when
    such abuse is encountered.  This commit does not detect more elaborate
    RCU deadlock situations.  These situations might be a job for lockdep
    enhancements.

    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/rcutiny.c        |    4 ----
 kernel/rcutiny_plugin.h |    5 -----
 kernel/rcutree.c        |    8 --------
 kernel/rcutree_plugin.h |    4 ----
 kernel/srcu.c           |    6 ------
 5 files changed, 27 deletions(-)

--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -329,10 +329,6 @@ static void rcu_process_callbacks(struct
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU read-side critical section");
 	cond_resched();
 }
 EXPORT_SYMBOL_GPL(synchronize_sched);
--- a/kernel/rcutiny_plugin.h
+++ b/kernel/rcutiny_plugin.h
@@ -735,11 +735,6 @@ EXPORT_SYMBOL_GPL(call_rcu);
  */
 void synchronize_rcu(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu() in RCU read-side critical section");
-
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	if (!rcu_scheduler_active)
 		return;
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1919,10 +1919,6 @@ EXPORT_SYMBOL_GPL(call_rcu_bh);
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU-sched read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	wait_rcu_gp(call_rcu_sched);
@@ -1940,10 +1936,6 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
  */
 void synchronize_rcu_bh(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	wait_rcu_gp(call_rcu_bh);
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -731,10 +731,6 @@ EXPORT_SYMBOL_GPL(kfree_call_rcu);
  */
 void synchronize_rcu(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu() in RCU read-side critical section");
 	if (!rcu_scheduler_active)
 		return;
 	wait_rcu_gp(call_rcu);
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -172,12 +172,6 @@ static void __synchronize_srcu(struct sr
 {
 	int idx;
 
-	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
-			   !lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
-
 	idx = sp->completed;
 	mutex_lock(&sp->mutex);
 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 19/26] srcu: Implement call_srcu()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (17 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: call_srcu.patch --]
[-- Type: text/plain, Size: 13883 bytes --]

Implement call_srcu() by using a state machine driven by
call_rcu_sched() and timer callbacks.

The state machine is a direct derivation of the existing
synchronize_srcu() code and replaces synchronize_sched() calls with a
call_rcu_sched() callback and the schedule_timeout() calls with simple
timer callbacks.

It then re-implements synchronize_srcu() using a completion where we
send the complete through call_srcu().

It completely wrecks synchronize_srcu_extradited() which is only used
by KVM.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/srcu.h |   23 +++
 kernel/srcu.c        |  304 +++++++++++++++++++++++++++++----------------------
 2 files changed, 196 insertions(+), 131 deletions(-)

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,17 +27,35 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
-#include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/rcupdate.h>
+#include <linux/timer.h>
 
 struct srcu_struct_array {
 	int c[2];
 };
 
+enum srcu_state {
+	srcu_idle,
+	srcu_sync_1,
+	srcu_sync_2,
+	srcu_sync_2b,
+	srcu_wait,
+	srcu_wait_b,
+	srcu_sync_3,
+	srcu_sync_3b,
+};
+
 struct srcu_struct {
 	int completed;
 	struct srcu_struct_array __percpu *per_cpu_ref;
-	struct mutex mutex;
+	raw_spinlock_t lock;
+	enum srcu_state state;
+	union {
+		struct rcu_head head;
+		struct timer_list timer;
+	};
+	struct rcu_head *pending[2];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
@@ -73,6 +91,7 @@ void __srcu_read_unlock(struct srcu_stru
 void synchronize_srcu(struct srcu_struct *sp);
 void synchronize_srcu_expedited(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
+void call_srcu(struct srcu_struct *sp, struct rcu_head *head, void (*func)(struct rcu_head *));
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -16,6 +16,7 @@
  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
  *
  * Copyright (C) IBM Corporation, 2006
+ * Copyright (C) 2012 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Author: Paul McKenney <paulmck@us.ibm.com>
  *
@@ -33,11 +34,14 @@
 #include <linux/smp.h>
 #include <linux/delay.h>
 #include <linux/srcu.h>
+#include <linux/completion.h>
 
 static int init_srcu_struct_fields(struct srcu_struct *sp)
 {
 	sp->completed = 0;
-	mutex_init(&sp->mutex);
+	raw_spin_lock_init(&sp->lock);
+	sp->state = srcu_idle;
+	sp->pending[0] = sp->pending[1] = NULL;
 	sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
 	return sp->per_cpu_ref ? 0 : -ENOMEM;
 }
@@ -155,119 +159,190 @@ void __srcu_read_unlock(struct srcu_stru
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
 
-/*
- * We use an adaptive strategy for synchronize_srcu() and especially for
- * synchronize_srcu_expedited().  We spin for a fixed time period
- * (defined below) to allow SRCU readers to exit their read-side critical
- * sections.  If there are still some readers after 10 microseconds,
- * we repeatedly block for 1-millisecond time periods.  This approach
- * has done well in testing, so there is no need for a config parameter.
+
+/**
+ * synchronize_srcu_expedited - like synchronize_srcu, but less patient
+ * @sp: srcu_struct with which to synchronize.
+ *
+ * Note that it is illegal to call synchronize_srcu_expedited()
+ * from the corresponding SRCU read-side critical section; doing so
+ * will result in deadlock.  However, it is perfectly legal to call
+ * synchronize_srcu_expedited() on one srcu_struct from some other
+ * srcu_struct's read-side critical section.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY 10
+void synchronize_srcu_expedited(struct srcu_struct *sp)
+{
+	/* XXX kill me */
+	synchronize_srcu(sp);
+}
+EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
 
-/*
- * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
+/**
+ * srcu_batches_completed - return batches completed.
+ * @sp: srcu_struct on which to report batch completion.
+ *
+ * Report the number of batches, correlated with, but not necessarily
+ * precisely the same as, the number of grace periods that have elapsed.
  */
-static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
+long srcu_batches_completed(struct srcu_struct *sp)
 {
-	int idx;
+	return sp->completed;
+}
+EXPORT_SYMBOL_GPL(srcu_batches_completed);
+
+static void do_srcu_state(struct srcu_struct *sp);
 
-	idx = sp->completed;
-	mutex_lock(&sp->mutex);
+static void do_srcu_state_timer(unsigned long __data)
+{
+	struct srcu_struct *sp = (void *)__data;
+	do_srcu_state(sp);
+}
 
-	/*
-	 * Check to see if someone else did the work for us while we were
-	 * waiting to acquire the lock.  We need -two- advances of
-	 * the counter, not just one.  If there was but one, we might have
-	 * shown up -after- our helper's first synchronize_sched(), thus
-	 * having failed to prevent CPU-reordering races with concurrent
-	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
-	 * either (1) wait for two or (2) supply the second ourselves.
-	 */
+static void do_srcu_state_rcu(struct rcu_head *head)
+{
+	struct srcu_struct *sp = container_of(head, struct srcu_struct, head);
+	do_srcu_state(sp);
+}
 
-	if ((sp->completed - idx) >= 2) {
-		mutex_unlock(&sp->mutex);
-		return;
+static void do_srcu_state(struct srcu_struct *sp)
+{
+	struct rcu_head *head, *next;
+	unsigned long flags;
+	int idx;
+
+	raw_spin_lock_irqsave(&sp->lock, flags);
+	switch (sp->state) {
+	case srcu_idle:
+		BUG();
+
+	case srcu_sync_1:
+		/*
+		 * The preceding synchronize_sched() ensures that any CPU that
+		 * sees the new value of sp->completed will also see any
+		 * preceding changes to data structures made by this CPU.  This
+		 * prevents some other CPU from reordering the accesses in its
+		 * SRCU read-side critical section to precede the corresponding
+		 * srcu_read_lock() -- ensuring that such references will in
+		 * fact be protected.
+		 *
+		 * So it is now safe to do the flip.
+		 */
+		idx = sp->completed & 0x1;
+		sp->completed++;
+
+		sp->state = srcu_sync_2 + idx;
+		call_rcu_sched(&sp->head, do_srcu_state_rcu);
+		break;
+
+	case srcu_sync_2:
+	case srcu_sync_2b:
+		idx = sp->state - srcu_sync_2;
+
+		init_timer(&sp->timer);
+		sp->timer.data = (unsigned long)sp;
+		sp->timer.function = do_srcu_state_timer;
+		sp->state = srcu_wait + idx;
+
+		/*
+		 * At this point, because of the preceding synchronize_sched(),
+		 * all srcu_read_lock() calls using the old counters have
+		 * completed. Their corresponding critical sections might well
+		 * be still executing, but the srcu_read_lock() primitives
+		 * themselves will have finished executing.
+		 */
+test_pending:
+		if (!srcu_readers_active_idx(sp, idx)) {
+			sp->state = srcu_sync_3 + idx;
+			call_rcu_sched(&sp->head, do_srcu_state_rcu);
+			break;
+		}
+
+		mod_timer(&sp->timer, jiffies + 1);
+		break;
+
+	case srcu_wait:
+	case srcu_wait_b:
+		idx = sp->state - srcu_wait;
+		goto test_pending;
+
+	case srcu_sync_3:
+	case srcu_sync_3b:
+		idx = sp->state - srcu_sync_3;
+		/*
+		 * The preceding synchronize_sched() forces all
+		 * srcu_read_unlock() primitives that were executing
+		 * concurrently with the preceding for_each_possible_cpu() loop
+		 * to have completed by this point. More importantly, it also
+		 * forces the corresponding SRCU read-side critical sections to
+		 * have also completed, and the corresponding references to
+		 * SRCU-protected data items to be dropped.
+		 */
+		head = sp->pending[idx];
+		sp->pending[idx] = NULL;
+		raw_spin_unlock(&sp->lock);
+		while (head) {
+			next = head->next;
+			head->func(head);
+			head = next;
+		}
+		raw_spin_lock(&sp->lock);
+
+		/*
+		 * If there's a new batch waiting...
+		 */
+		if (sp->pending[idx ^ 1]) {
+			sp->state = srcu_sync_1;
+			call_rcu_sched(&sp->head, do_srcu_state_rcu);
+			break;
+		}
+
+		/*
+		 * We done!!
+		 */
+		sp->state = srcu_idle;
+		break;
 	}
+	raw_spin_unlock_irqrestore(&sp->lock, flags);
+}
 
-	sync_func();  /* Force memory barrier on all CPUs. */
+void call_srcu(struct srcu_struct *sp,
+	       struct rcu_head *head, void (*func)(struct rcu_head *))
+{
+	unsigned long flags;
+	int idx;
 
-	/*
-	 * The preceding synchronize_sched() ensures that any CPU that
-	 * sees the new value of sp->completed will also see any preceding
-	 * changes to data structures made by this CPU.  This prevents
-	 * some other CPU from reordering the accesses in its SRCU
-	 * read-side critical section to precede the corresponding
-	 * srcu_read_lock() -- ensuring that such references will in
-	 * fact be protected.
-	 *
-	 * So it is now safe to do the flip.
-	 */
+	head->func = func;
 
-	idx = sp->completed & 0x1;
-	sp->completed++;
+	raw_spin_lock_irqsave(&sp->lock, flags);
+	idx = sp->completed & 1;
+	barrier(); /* look at sp->completed once */
+	head->next = sp->pending[idx];
+	sp->pending[idx] = head;
+
+	if (sp->state == srcu_idle) {
+		sp->state = srcu_sync_1;
+		call_rcu_sched(&sp->head, do_srcu_state_rcu);
+	}
+	raw_spin_unlock_irqrestore(&sp->lock, flags);
+}
+EXPORT_SYMBOL_GPL(call_srcu);
 
-	sync_func();  /* Force memory barrier on all CPUs. */
+struct srcu_waiter {
+	struct completion wait;
+	struct rcu_head head;
+};
 
-	/*
-	 * At this point, because of the preceding synchronize_sched(),
-	 * all srcu_read_lock() calls using the old counters have completed.
-	 * Their corresponding critical sections might well be still
-	 * executing, but the srcu_read_lock() primitives themselves
-	 * will have finished executing.  We initially give readers
-	 * an arbitrarily chosen 10 microseconds to get out of their
-	 * SRCU read-side critical sections, then loop waiting 1/HZ
-	 * seconds per iteration.  The 10-microsecond value has done
-	 * very well in testing.
-	 */
-
-	if (srcu_readers_active_idx(sp, idx))
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-	while (srcu_readers_active_idx(sp, idx))
-		schedule_timeout_interruptible(1);
-
-	sync_func();  /* Force memory barrier on all CPUs. */
-
-	/*
-	 * The preceding synchronize_sched() forces all srcu_read_unlock()
-	 * primitives that were executing concurrently with the preceding
-	 * for_each_possible_cpu() loop to have completed by this point.
-	 * More importantly, it also forces the corresponding SRCU read-side
-	 * critical sections to have also completed, and the corresponding
-	 * references to SRCU-protected data items to be dropped.
-	 *
-	 * Note:
-	 *
-	 *	Despite what you might think at first glance, the
-	 *	preceding synchronize_sched() -must- be within the
-	 *	critical section ended by the following mutex_unlock().
-	 *	Otherwise, a task taking the early exit can race
-	 *	with a srcu_read_unlock(), which might have executed
-	 *	just before the preceding srcu_readers_active() check,
-	 *	and whose CPU might have reordered the srcu_read_unlock()
-	 *	with the preceding critical section.  In this case, there
-	 *	is nothing preventing the synchronize_sched() task that is
-	 *	taking the early exit from freeing a data structure that
-	 *	is still being referenced (out of order) by the task
-	 *	doing the srcu_read_unlock().
-	 *
-	 *	Alternatively, the comparison with "2" on the early exit
-	 *	could be changed to "3", but this increases synchronize_srcu()
-	 *	latency for bulk loads.  So the current code is preferred.
-	 */
+static void synchronize_srcu_complete(struct rcu_head *head)
+{
+	struct srcu_waiter *waiter = container_of(head, struct srcu_waiter, head);
 
-	mutex_unlock(&sp->mutex);
+	complete(&waiter->wait);
 }
 
 /**
  * synchronize_srcu - wait for prior SRCU read-side critical-section completion
  * @sp: srcu_struct with which to synchronize.
  *
- * Flip the completed counter, and wait for the old count to drain to zero.
- * As with classic RCU, the updater must use some separate means of
- * synchronizing concurrent updates.  Can block; must be called from
- * process context.
- *
  * Note that it is illegal to call synchronize_srcu() from the corresponding
  * SRCU read-side critical section; doing so will result in deadlock.
  * However, it is perfectly legal to call synchronize_srcu() on one
@@ -275,41 +350,12 @@ static void __synchronize_srcu(struct sr
  */
 void synchronize_srcu(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, synchronize_sched);
-}
-EXPORT_SYMBOL_GPL(synchronize_srcu);
+	struct srcu_waiter waiter = {
+		.wait = COMPLETION_INITIALIZER_ONSTACK(waiter.wait),
+	};
 
-/**
- * synchronize_srcu_expedited - like synchronize_srcu, but less patient
- * @sp: srcu_struct with which to synchronize.
- *
- * Flip the completed counter, and wait for the old count to drain to zero.
- * As with classic RCU, the updater must use some separate means of
- * synchronizing concurrent updates.  Can block; must be called from
- * process context.
- *
- * Note that it is illegal to call synchronize_srcu_expedited()
- * from the corresponding SRCU read-side critical section; doing so
- * will result in deadlock.  However, it is perfectly legal to call
- * synchronize_srcu_expedited() on one srcu_struct from some other
- * srcu_struct's read-side critical section.
- */
-void synchronize_srcu_expedited(struct srcu_struct *sp)
-{
-	__synchronize_srcu(sp, synchronize_sched_expedited);
-}
-EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
-
-/**
- * srcu_batches_completed - return batches completed.
- * @sp: srcu_struct on which to report batch completion.
- *
- * Report the number of batches, correlated with, but not necessarily
- * precisely the same as, the number of grace periods that have elapsed.
- */
+	call_srcu(sp, &waiter.head, synchronize_srcu_complete);
 
-long srcu_batches_completed(struct srcu_struct *sp)
-{
-	return sp->completed;
+	wait_for_completion(&waiter.wait);
 }
-EXPORT_SYMBOL_GPL(srcu_batches_completed);
+EXPORT_SYMBOL_GPL(synchronize_srcu);



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (18 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-vma_dup_policy.patch --]
[-- Type: text/plain, Size: 4390 bytes --]

In preparation of other changes, pull some code duplication in a
common function so that we can later extend its behaviour without
having to touch all these sites.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    3 +++
 kernel/fork.c             |    9 +++------
 mm/mempolicy.c            |   11 +++++++++++
 mm/mmap.c                 |   17 +++++------------
 4 files changed, 22 insertions(+), 18 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -168,6 +168,8 @@ static inline struct mempolicy *mpol_dup
 #define vma_policy(vma) ((vma)->vm_policy)
 #define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
 
+extern int vma_dup_policy(struct vm_area_struct *new, struct vm_area_struct *old);
+
 static inline void mpol_get(struct mempolicy *pol)
 {
 	if (pol)
@@ -311,6 +313,7 @@ mpol_shared_policy_lookup(struct shared_
 
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
+#define vma_dup_policy(new, old) (0)
 
 static inline void numa_policy_init(void)
 {
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,7 +315,6 @@ static int dup_mmap(struct mm_struct *mm
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
-	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
@@ -365,11 +364,9 @@ static int dup_mmap(struct mm_struct *mm
 			goto fail_nomem;
 		*tmp = *mpnt;
 		INIT_LIST_HEAD(&tmp->anon_vma_chain);
-		pol = mpol_dup(vma_policy(mpnt));
-		retval = PTR_ERR(pol);
-		if (IS_ERR(pol))
+		retval = vma_dup_policy(tmp, mpnt);
+		if (retval)
 			goto fail_nomem_policy;
-		vma_set_policy(tmp, pol);
 		tmp->vm_mm = mm;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
@@ -431,7 +428,7 @@ static int dup_mmap(struct mm_struct *mm
 	up_write(&oldmm->mmap_sem);
 	return retval;
 fail_nomem_anon_vma_fork:
-	mpol_put(pol);
+	mpol_put(vma_policy(tmp));
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1971,6 +1971,17 @@ struct mempolicy *__mpol_dup(struct memp
 	return new;
 }
 
+int vma_dup_policy(struct vm_area_struct *new, struct vm_area_struct *old)
+{
+	struct mempolicy *mpol;
+
+	mpol = mpol_dup(vma_policy(old));
+	if (IS_ERR(mpol))
+		return PTR_ERR(mpol);
+	vma_set_policy(new, mpol);
+	return 0;
+}
+
 /*
  * If *frompol needs [has] an extra ref, copy *frompol to *tompol ,
  * eliminate the * MPOL_F_* flags that require conditional ref and
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1935,7 +1935,6 @@ detach_vmas_to_be_unmapped(struct mm_str
 static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
-	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	int err = -ENOMEM;
 
@@ -1959,12 +1958,9 @@ static int __split_vma(struct mm_struct
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
-	pol = mpol_dup(vma_policy(vma));
-	if (IS_ERR(pol)) {
-		err = PTR_ERR(pol);
+	err = vma_dup_policy(new, vma);
+	if (err)
 		goto out_free_vma;
-	}
-	vma_set_policy(new, pol);
 
 	if (anon_vma_clone(new, vma))
 		goto out_free_mpol;
@@ -1998,7 +1994,7 @@ static int __split_vma(struct mm_struct
 	}
 	unlink_anon_vmas(new);
  out_free_mpol:
-	mpol_put(pol);
+	mpol_put(new->vm_policy);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new);
  out_err:
@@ -2331,7 +2327,6 @@ struct vm_area_struct *copy_vma(struct v
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
-	struct mempolicy *pol;
 	bool faulted_in_anon_vma = true;
 
 	/*
@@ -2372,13 +2367,11 @@ struct vm_area_struct *copy_vma(struct v
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
 			*new_vma = *vma;
-			pol = mpol_dup(vma_policy(vma));
-			if (IS_ERR(pol))
+			if (vma_dup_policy(new_vma, vma))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
 			if (anon_vma_clone(new_vma, vma))
 				goto out_free_mempol;
-			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
 			new_vma->vm_end = addr + len;
 			new_vma->vm_pgoff = pgoff;
@@ -2399,7 +2392,7 @@ struct vm_area_struct *copy_vma(struct v
 	return new_vma;
 
  out_free_mempol:
-	mpol_put(pol);
+	mpol_put(new_vma->vm_policy);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new_vma);
 	return NULL;



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (19 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: mpol-vma_put_policy.patch --]
[-- Type: text/plain, Size: 2420 bytes --]

In preparation of other changes, create a new interface so that we can
later extend its behaviour without having to touch all these sites.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    5 +++++
 mm/mempolicy.c            |    5 +++++
 mm/mmap.c                 |    8 ++++----
 3 files changed, 14 insertions(+), 4 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -169,6 +169,7 @@ static inline struct mempolicy *mpol_dup
 #define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
 
 extern int vma_dup_policy(struct vm_area_struct *new, struct vm_area_struct *old);
+extern void vma_put_policy(struct vm_area_struct *vma);
 
 static inline void mpol_get(struct mempolicy *pol)
 {
@@ -315,6 +316,10 @@ mpol_shared_policy_lookup(struct shared_
 #define vma_set_policy(vma, pol) do {} while(0)
 #define vma_dup_policy(new, old) (0)
 
+static inline void vma_put_policy(struct vm_area_struct *)
+{
+}
+
 static inline void numa_policy_init(void)
 {
 }
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1982,6 +1982,11 @@ int vma_dup_policy(struct vm_area_struct
 	return 0;
 }
 
+void vma_put_policy(struct vm_area_struct *vma)
+{
+	mpol_put(vma_policy(vma));
+}
+
 /*
  * If *frompol needs [has] an extra ref, copy *frompol to *tompol ,
  * eliminate the * MPOL_F_* flags that require conditional ref and
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -236,7 +236,7 @@ static struct vm_area_struct *remove_vma
 		if (vma->vm_flags & VM_EXECUTABLE)
 			removed_exe_file_vma(vma->vm_mm);
 	}
-	mpol_put(vma_policy(vma));
+	vma_put_policy(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
 }
@@ -633,7 +633,7 @@ again:			remove_next = 1 + (end > next->
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
-		mpol_put(vma_policy(next));
+		vma_put_policy(next);
 		kmem_cache_free(vm_area_cachep, next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
@@ -1994,7 +1994,7 @@ static int __split_vma(struct mm_struct
 	}
 	unlink_anon_vmas(new);
  out_free_mpol:
-	mpol_put(new->vm_policy);
+	vma_put_policy(new);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new);
  out_err:
@@ -2392,7 +2392,7 @@ struct vm_area_struct *copy_vma(struct v
 	return new_vma;
 
  out_free_mempol:
-	mpol_put(new_vma->vm_policy);
+	vma_put_policy(new_vma);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new_vma);
 	return NULL;



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (20 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: mpol-mbind-split.patch --]
[-- Type: text/plain, Size: 6169 bytes --]

In order to allow creating 'custom' mpols, expose some guts. In
particular means to allocate fresh mpols and to bind them to memory
ranges, skipping out on the intermediate -- policy -- part of
sys_mbind().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    8 +++
 mm/mempolicy.c            |  111 ++++++++++++++++++++++++++--------------------
 2 files changed, 71 insertions(+), 48 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -203,6 +203,12 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
+extern struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
+				  nodemask_t *nodes);
+extern long mpol_do_mbind(unsigned long start, unsigned long len,
+				struct mempolicy *policy, unsigned long mode,
+				nodemask_t *nmask, unsigned long flags);
+
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -216,6 +222,8 @@ struct mempolicy *get_vma_policy(struct 
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
+extern void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *new,
+				enum mpol_rebind_step step);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -259,7 +259,7 @@ static int mpol_set_nodemask(struct memp
  * This function just creates a new policy, does some check and simple
  * initialization. You must invoke mpol_set_nodemask() to set nodes.
  */
-static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
+struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 				  nodemask_t *nodes)
 {
 	struct mempolicy *policy;
@@ -401,7 +401,7 @@ static void mpol_rebind_preferred(struct
  * 	MPOL_REBIND_STEP1 - set all the newly nodes
  * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
  */
-static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
+void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
 				enum mpol_rebind_step step)
 {
 	if (!pol)
@@ -1067,55 +1067,28 @@ static struct page *new_vma_page(struct 
 }
 #endif
 
-static long do_mbind(unsigned long start, unsigned long len,
-		     unsigned short mode, unsigned short mode_flags,
-		     nodemask_t *nmask, unsigned long flags)
+long mpol_do_mbind(unsigned long start, unsigned long len,
+		struct mempolicy *new, unsigned long mode,
+		nodemask_t *nmask, unsigned long flags)
 {
-	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
-	struct mempolicy *new = NULL;
-	unsigned long end;
+	struct vm_area_struct *vma;
 	int err, nr_failed = 0;
+	unsigned long end;
 	LIST_HEAD(pagelist);
 
-  	if (flags & ~(unsigned long)MPOL_MF_VALID)
-		return -EINVAL;
-	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
-		return -EPERM;
-
-	if (start & ~PAGE_MASK)
-		return -EINVAL;
-
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
-		flags &= ~MPOL_MF_STRICT;
-
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
 	end = start + len;
 
-	if (end < start)
-		return -EINVAL;
-	if (end == start)
-		return 0;
-
-	if (mode != MPOL_NOOP) {
-		new = mpol_new(mode, mode_flags, nmask);
-		if (IS_ERR(new))
-			return PTR_ERR(new);
-
-		if (flags & MPOL_MF_LAZY)
-			new->flags |= MPOL_F_MOF;
-
+	if (end < start) {
+		err = -EINVAL;
+		goto mpol_out;
 	}
-	/*
-	 * If we are using the default policy then operation
-	 * on discontinuous address spaces is okay after all
-	 */
-	if (!new)
-		flags |= MPOL_MF_DISCONTIG_OK;
 
-	pr_debug("mbind %lx-%lx mode:%d flags:%d nodes:%lx\n",
-		 start, start + len, mode, mode_flags,
-		 nmask ? nodes_addr(*nmask)[0] : -1);
+	if (end == start) {
+		err = 0;
+		goto mpol_out;
+	}
 
 	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
 		err = migrate_prep();
@@ -1123,8 +1096,6 @@ static long do_mbind(unsigned long start
 			goto mpol_out;
 	}
 
-	down_write(&mm->mmap_sem);
-
 	if (mode != MPOL_NOOP) {
 		NODEMASK_SCRATCH(scratch);
 		err = -ENOMEM;
@@ -1135,7 +1106,7 @@ static long do_mbind(unsigned long start
 		}
 		NODEMASK_SCRATCH_FREE(scratch);
 		if (err)
-			goto mpol_out_unlock;
+			goto mpol_out;
 	}
 
 	vma = check_range(mm, start, end, nmask,
@@ -1143,12 +1114,12 @@ static long do_mbind(unsigned long start
 
 	err = PTR_ERR(vma);	/* maybe ... */
 	if (IS_ERR(vma))
-		goto mpol_out_unlock;
+		goto mpol_out_putback;
 
 	if (mode != MPOL_NOOP) {
 		err = mbind_range(mm, start, end, new);
 		if (err)
-			goto mpol_out_unlock;
+			goto mpol_out_putback;
 	}
 
 	if (!list_empty(&pagelist)) {
@@ -1164,12 +1135,56 @@ static long do_mbind(unsigned long start
 	if (nr_failed && (flags & MPOL_MF_STRICT))
 		err = -EIO;
 
+mpol_out_putback:
 	putback_lru_pages(&pagelist);
 
-mpol_out_unlock:
-	up_write(&mm->mmap_sem);
 mpol_out:
+	return err;
+}
+
+static long do_mbind(unsigned long start, unsigned long len,
+		     unsigned short mode, unsigned short mode_flags,
+		     nodemask_t *nmask, unsigned long flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct mempolicy *new = NULL;
+	int err;
+
+	if (flags & ~(unsigned long)MPOL_MF_VALID)
+		return -EINVAL;
+	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
+		return -EPERM;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
+		flags &= ~MPOL_MF_STRICT;
+
+	if (mode != MPOL_NOOP) {
+		new = mpol_new(mode, mode_flags, nmask);
+		if (IS_ERR(new))
+			return PTR_ERR(new);
+
+		if (flags & MPOL_MF_LAZY)
+			new->flags |= MPOL_F_MOF;
+	}
+	/*
+	 * If we are using the default policy then operation
+	 * on discontinuous address spaces is okay after all
+	 */
+	if (!new)
+		flags |= MPOL_MF_DISCONTIG_OK;
+
+	pr_debug("mbind %lx-%lx mode:%d flags:%d nodes:%lx\n",
+		 start, start + len, mode, mode_flags,
+		 nmask ? nodes_addr(*nmask)[0] : -1);
+
+	down_write(&mm->mmap_sem);
+	err = mpol_do_mbind(start, len, new, mode, nmask, flags);
+	up_write(&mm->mmap_sem);
 	mpol_put(new);
+
 	return err;
 }
 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind()
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (21 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-foo-syscall.patch --]
[-- Type: text/plain, Size: 19723 bytes --]

Now that we have a NUMA process scheduler, provide a syscall interface
for finer granularity NUMA balancing. In particular this allows
setting up NUMA groups of threads and vmas within a process.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/syscalls/syscall_32.tbl |    2 
 arch/x86/syscalls/syscall_64.tbl |    2 
 include/asm-generic/unistd.h     |    6 
 include/linux/mempolicy.h        |   35 ++
 include/linux/sched.h            |    2 
 include/linux/syscalls.h         |    3 
 kernel/exit.c                    |    1 
 kernel/sched/numa.c              |  582 ++++++++++++++++++++++++++++++++++++++-
 kernel/sys_ni.c                  |    4 
 mm/mempolicy.c                   |    8 
 10 files changed, 639 insertions(+), 6 deletions(-)
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,5 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	numa_mbind		sys_numa_mbind			compat_sys_numa_mbind
+350	i386	numa_tbind		sys_numa_tbind			compat_sys_numa_tbind
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -318,6 +318,8 @@
 309	common	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	numa_mbind		sys_numa_mbind
+313	64	numa_tbind		sys_numa_tbind
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -691,9 +691,13 @@ __SC_COMP(__NR_process_vm_readv, sys_pro
 #define __NR_process_vm_writev 271
 __SC_COMP(__NR_process_vm_writev, sys_process_vm_writev, \
           compat_sys_process_vm_writev)
+#define __NR_numa_mbind 272
+__SC_COMP(__NR_numa_mbind, sys_numa_mbind, compat_sys_ms_mbind)
+#define __NR_numa_tbind 273
+__SC_COMP(__NR_numa_tbind, sys_numa_tbind, compat_sys_ms_tbind)
 
 #undef __NR_syscalls
-#define __NR_syscalls 272
+#define __NR_syscalls 274
 
 /*
  * All syscalls below here should go away really,
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -78,6 +78,8 @@ enum mpol_rebind_step {
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
+#include <linux/list.h>
+#include <linux/sched.h>
 
 struct mm_struct;
 
@@ -109,6 +111,10 @@ struct mempolicy {
 	atomic_t refcnt;
 	unsigned short mode; 	/* See MPOL_* above */
 	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
+	struct numa_group *numa_group;
+	struct list_head ng_entry;
+	struct vm_area_struct *vma;
+	struct rcu_head rcu;
 	union {
 		short 		 preferred_node; /* preferred */
 		nodemask_t	 nodes;		/* interleave/bind */
@@ -396,6 +402,35 @@ static inline int mpol_to_str(char *buff
 }
 
 #endif /* CONFIG_NUMA */
+
+#ifdef CONFIG_NUMA
+
+extern void __numa_task_exit(struct task_struct *);
+extern void numa_vma_link(struct vm_area_struct *, struct vm_area_struct *);
+extern void numa_vma_unlink(struct vm_area_struct *);
+extern void __numa_add_vma_counter(struct vm_area_struct *, int, long);
+
+static inline
+void numa_add_vma_counter(struct vm_area_struct *vma, int member, long value)
+{
+	if (vma->vm_policy && vma->vm_policy->numa_group)
+		__numa_add_vma_counter(vma, member, value);
+}
+
+static inline void numa_task_exit(struct task_struct *p)
+{
+	if (p->numa_group)
+		__numa_task_exit(p);
+}
+
+#else /* CONFIG_NUMA */
+
+static inline void numa_task_exit(struct task_struct *) { }
+static inline void numa_vma_link(struct vm_area_struct *, struct vm_area_struct *) { }
+static inline void numa_vma_unlink(struct vm_area_struct *) { }
+
+#endif /* CONFIG_NUMA */
+
 #endif /* __KERNEL__ */
 
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1548,6 +1548,8 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 	int node;
+	struct numa_group *numa_group;
+	struct list_head ng_entry;
 #endif
 	struct rcu_head rcu;
 
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -856,5 +856,8 @@ asmlinkage long sys_process_vm_writev(pi
 				      const struct iovec __user *rvec,
 				      unsigned long riovcnt,
 				      unsigned long flags);
+asmlinkage long sys_numa_mbind(unsigned long addr, unsigned long len,
+			       int ng_id, unsigned long flags);
+asmlinkage long sys_numa_tbind(int tid, int ng_id, unsigned long flags);
 
 #endif
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1010,6 +1010,7 @@ void do_exit(long code)
 	mpol_put(tsk->mempolicy);
 	tsk->mempolicy = NULL;
 	task_unlock(tsk);
+	numa_task_exit(tsk);
 #endif
 #ifdef CONFIG_FUTEX
 	if (unlikely(current->pi_state_cache))
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -14,6 +14,7 @@
 
 #include <linux/mempolicy.h>
 #include <linux/kthread.h>
+#include <linux/compat.h>
 
 #include "sched.h"
 
@@ -302,17 +303,20 @@ static void enqueue_ne(struct numa_entit
 	spin_unlock(&nq->lock);
 }
 
-static void dequeue_ne(struct numa_entity *ne)
+static int dequeue_ne(struct numa_entity *ne)
 {
 	struct node_queue *nq;
+	int node = ne->node; // XXX serialization
 
-	if (ne->node == -1) // XXX serialization
-		return;
+	if (node == -1) // XXX serialization
+		return node;
 
 	nq = lock_ne_nq(ne);
 	ne->node = -1;
 	__dequeue_ne(nq, ne);
 	spin_unlock(&nq->lock);
+
+	return node;
 }
 
 static void init_ne(struct numa_entity *ne, const struct numa_ops *nops)
@@ -400,6 +404,8 @@ static int find_idlest_node(int this_nod
 
 void select_task_node(struct task_struct *p, struct mm_struct *mm, int sd_flags)
 {
+	int node;
+
 	if (!sched_feat(NUMA_SELECT)) {
 		p->node = -1;
 		return;
@@ -424,7 +430,11 @@ void select_task_node(struct task_struct
 		}
 	}
 
-	enqueue_ne(&mm->numa, find_idlest_node(p->node));
+	node = find_idlest_node(p->node);
+	if (node == -1)
+		node = numa_node_id();
+
+	enqueue_ne(&mm->numa, node);
 }
 
 __init void init_sched_numa(void)
@@ -804,3 +814,567 @@ static __init int numa_init(void)
 	return 0;
 }
 early_initcall(numa_init);
+
+
+/*
+ *  numa_group bits
+ */
+
+#include <linux/idr.h>
+#include <linux/srcu.h>
+#include <linux/syscalls.h>
+
+struct numa_group {
+	spinlock_t		lock;
+	int			id;
+
+	struct mm_rss_stat	rss;
+
+	struct list_head	tasks;
+	struct list_head	vmas;
+
+	const struct cred	*cred;
+	atomic_t		ref;
+
+	struct numa_entity	numa_entity;
+
+	struct rcu_head		rcu;
+};
+
+static struct srcu_struct ng_srcu;
+
+static DEFINE_MUTEX(numa_group_idr_lock);
+static DEFINE_IDR(numa_group_idr);
+
+static inline struct numa_group *ne_ng(struct numa_entity *ne)
+{
+	return container_of(ne, struct numa_group, numa_entity);
+}
+
+static inline bool ng_tryget(struct numa_group *ng)
+{
+	return atomic_inc_not_zero(&ng->ref);
+}
+
+static inline void ng_get(struct numa_group *ng)
+{
+	atomic_inc(&ng->ref);
+}
+
+static void __ng_put_rcu(struct rcu_head *rcu)
+{
+	struct numa_group *ng = container_of(rcu, struct numa_group, rcu);
+
+	put_cred(ng->cred);
+	kfree(ng);
+}
+
+static void __ng_put(struct numa_group *ng)
+{
+	mutex_lock(&numa_group_idr_lock);
+	idr_remove(&numa_group_idr, ng->id);
+	mutex_unlock(&numa_group_idr_lock);
+
+	WARN_ON(!list_empty(&ng->tasks));
+	WARN_ON(!list_empty(&ng->vmas));
+
+	dequeue_ne(&ng->numa_entity);
+
+	call_rcu(&ng->rcu, __ng_put_rcu);
+}
+
+static inline void ng_put(struct numa_group *ng)
+{
+	if (atomic_dec_and_test(&ng->ref))
+		__ng_put(ng);
+}
+
+/*
+ * numa_ops
+ */
+
+static unsigned long numa_group_mem_load(struct numa_entity *ne)
+{
+	struct numa_group *ng = ne_ng(ne);
+
+	return atomic_long_read(&ng->rss.count[MM_ANONPAGES]);
+}
+
+static unsigned long numa_group_cpu_load(struct numa_entity *ne)
+{
+	struct numa_group *ng = ne_ng(ne);
+	unsigned long load = 0;
+	struct task_struct *p;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &ng->tasks, ng_entry)
+		load += p->numa_contrib;
+	rcu_read_unlock();
+
+	return load;
+}
+
+static void numa_group_mem_migrate(struct numa_entity *ne, int node)
+{
+	struct numa_group *ng = ne_ng(ne);
+	struct vm_area_struct *vma;
+	struct mempolicy *mpol;
+	struct mm_struct *mm;
+	int idx;
+
+	/*
+	 * Horrid code this..
+	 *
+	 * The main problem is that ng->lock nests inside mmap_sem [
+	 * numa_vma_{,un}link() gets called under mmap_sem ]. But here we need
+	 * to iterate that list and acquire mmap_sem for each entry.
+	 *
+	 * We get here without serialization. We abuse numa_vma_unlink() to add
+	 * an SRCU delayed reference count to the mpols. This allows us to do
+	 * lockless iteration of the list.
+	 *
+	 * Once we have an mpol we need to acquire mmap_sem, this too isn't
+	 * straight fwd, take ng->lock to pin mpol->vma due to its
+	 * serialization against numa_vma_unlink(). While that vma pointer is
+	 * stable the vma->vm_mm pointer must be good too, so acquire an extra
+	 * reference to the mm.
+	 *
+	 * This reference keeps mm stable so we can drop ng->lock and acquire
+	 * mmap_sem. After which mpol->vma is stable again since the memory map
+	 * is stable. So verify ->vma is still good (numa_vma_unlink clears it)
+	 * and the mm is still the same (paranoia, can't see how that could
+	 * happen).
+	 */
+
+	idx = srcu_read_lock(&ng_srcu);
+	list_for_each_entry_rcu(mpol, &ng->vmas, ng_entry) {
+		nodemask_t mask = nodemask_of_node(node);
+
+		spin_lock(&ng->lock); /* pin mpol->vma */
+		vma = mpol->vma;
+		if (!vma) {
+			spin_unlock(&ng->lock);
+			continue;
+		}
+		mm = vma->vm_mm;
+		atomic_inc(&mm->mm_users); /* pin mm */
+		spin_unlock(&ng->lock);
+
+		down_read(&mm->mmap_sem);
+		vma = mpol->vma;
+		if (!vma)
+			goto unlock_next;
+
+		mpol_rebind_policy(mpol, &mask, MPOL_REBIND_ONCE);
+		lazy_migrate_vma(vma, node);
+unlock_next:
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+	srcu_read_unlock(&ng_srcu, idx);
+}
+
+static void numa_group_cpu_migrate(struct numa_entity *ne, int node)
+{
+	struct numa_group *ng = ne_ng(ne);
+	struct task_struct *p;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &ng->tasks, ng_entry)
+		sched_setnode(p, node);
+	rcu_read_unlock();
+}
+
+static bool numa_group_tryget(struct numa_entity *ne)
+{
+	/*
+	 * See process_tryget(), similar but against ng_put().
+	 */
+	return ng_tryget(ne_ng(ne));
+}
+
+static void numa_group_put(struct numa_entity *ne)
+{
+	ng_put(ne_ng(ne));
+}
+
+static const struct numa_ops numa_group_ops = {
+	.mem_load	= numa_group_mem_load,
+	.cpu_load	= numa_group_cpu_load,
+
+	.mem_migrate	= numa_group_mem_migrate,
+	.cpu_migrate	= numa_group_cpu_migrate,
+
+	.tryget		= numa_group_tryget,
+	.put		= numa_group_put,
+};
+
+void __numa_task_exit(struct task_struct *p)
+{
+	struct numa_group *ng = p->numa_group;
+
+	spin_lock(&ng->lock);
+	list_del_rcu(&p->ng_entry);
+	spin_unlock(&ng->lock);
+
+	p->numa_group = NULL; // XXX serialization ?!
+
+	ng_put(ng);
+}
+
+/*
+ * memory (vma) accounting/tracking
+ *
+ * We assume a 1:1 relation between vmas and mpols and keep a list of mpols in
+ * the numa_group, and a vma backlink in the mpol.
+ */
+
+void numa_vma_link(struct vm_area_struct *new, struct vm_area_struct *old)
+{
+	struct numa_group *ng = NULL;
+
+	if (old && old->vm_policy)
+		ng = old->vm_policy->numa_group;
+
+	if (!ng && new->vm_policy)
+		ng = new->vm_policy->numa_group;
+
+	if (!ng)
+		return;
+
+	ng_get(ng);
+	new->vm_policy->numa_group = ng;
+	new->vm_policy->vma = new;
+
+	spin_lock(&ng->lock);
+	list_add_rcu(&new->vm_policy->ng_entry, &ng->vmas);
+	spin_unlock(&ng->lock);
+}
+
+void __numa_add_vma_counter(struct vm_area_struct *vma, int member, long value)
+{
+	/*
+	 * Since the caller passes the vma argument, the caller is responsible
+	 * for making sure the vma is stable, hence the ->vm_policy->numa_group
+	 * dereference is safe. (caller usually has vma->vm_mm->mmap_sem for
+	 * reading).
+	 */
+	atomic_long_add(value, &vma->vm_policy->numa_group->rss.count[member]);
+}
+
+static void __mpol_put_rcu(struct rcu_head *rcu)
+{
+	struct mempolicy *mpol = container_of(rcu, struct mempolicy, rcu);
+	mpol_put(mpol);
+}
+
+void numa_vma_unlink(struct vm_area_struct *vma)
+{
+	struct mempolicy *mpol;
+	struct numa_group *ng;
+
+	if (!vma)
+		return;
+
+	mpol = vma->vm_policy;
+	if (!mpol)
+		return;
+
+	ng = mpol->numa_group;
+	if (!ng)
+		return;
+
+	spin_lock(&ng->lock);
+	list_del_rcu(&mpol->ng_entry);
+	/*
+	 * Rediculous, see numa_group_mem_migrate.
+	 */
+	mpol->vma = NULL;
+	mpol_get(mpol);
+	call_srcu(&ng_srcu, &mpol->rcu, __mpol_put_rcu);
+	spin_unlock(&ng->lock);
+
+	ng_put(ng);
+}
+
+/*
+ * syscall bits
+ */
+
+#define MS_ID_GET	-2
+#define MS_ID_NEW	-1
+
+static struct numa_group *ng_create(struct task_struct *p)
+{
+	struct numa_group *ng;
+	int node, err;
+
+	ng = kzalloc(sizeof(*ng), GFP_KERNEL);
+	if (!ng)
+		goto fail;
+
+	err = idr_pre_get(&numa_group_idr, GFP_KERNEL);
+	if (!err)
+		goto fail_alloc;
+
+	mutex_lock(&numa_group_idr_lock);
+	err = idr_get_new(&numa_group_idr, ng, &ng->id);
+	mutex_unlock(&numa_group_idr_lock);
+
+	if (err)
+		goto fail_alloc;
+
+	spin_lock_init(&ng->lock);
+	atomic_set(&ng->ref, 1);
+	ng->cred = get_task_cred(p);
+	INIT_LIST_HEAD(&ng->tasks);
+	INIT_LIST_HEAD(&ng->vmas);
+	init_ne(&ng->numa_entity, &numa_group_ops);
+
+	dequeue_ne(&p->mm->numa); // XXX
+
+	node = find_idlest_node(tsk_home_node(p));
+	enqueue_ne(&ng->numa_entity, node);
+
+	return ng;
+
+fail_alloc:
+	kfree(ng);
+fail:
+	return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * More or less equal to ptrace_may_access(); XXX
+ */
+static int ng_allowed(struct numa_group *ng, struct task_struct *p)
+{
+	const struct cred *cred = ng->cred, *tcred;
+
+	rcu_read_lock();
+	tcred = __task_cred(p);
+	if (cred->user->user_ns == tcred->user->user_ns &&
+	    (cred->uid == tcred->euid &&
+	     cred->uid == tcred->suid &&
+	     cred->uid == tcred->uid  &&
+	     cred->gid == tcred->egid &&
+	     cred->gid == tcred->sgid &&
+	     cred->gid == tcred->gid))
+		goto ok;
+	if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+		goto ok;
+	rcu_read_unlock();
+	return -EPERM;
+
+ok:
+	rcu_read_unlock();
+	return 0;
+}
+
+static struct numa_group *ng_lookup(int ng_id, struct task_struct *p)
+{
+	struct numa_group *ng;
+
+	rcu_read_lock();
+again:
+	ng = idr_find(&numa_group_idr, ng_id);
+	if (!ng) {
+		rcu_read_unlock();
+		return ERR_PTR(-EINVAL);
+	}
+	if (ng_allowed(ng, p)) {
+		rcu_read_unlock();
+		return ERR_PTR(-EPERM);
+	}
+	if (!ng_tryget(ng))
+		goto again;
+	rcu_read_unlock();
+
+	return ng;
+}
+
+static int ng_task_assign(struct task_struct *p, int ng_id)
+{
+	struct numa_group *old_ng, *ng;
+
+	ng = ng_lookup(ng_id, p);
+	if (IS_ERR(ng))
+		return PTR_ERR(ng);
+
+	old_ng = p->numa_group; // XXX racy
+	if (old_ng) {
+		spin_lock(&old_ng->lock);
+		list_del_rcu(&p->ng_entry);
+		spin_unlock(&old_ng->lock);
+
+		/*
+		 * We have to wait for the old ng_entry users to go away before
+		 * we can re-use the link entry for the new list.
+		 */
+		synchronize_rcu();
+	}
+
+	spin_lock(&ng->lock);
+	p->numa_group = ng;
+	list_add_rcu(&p->ng_entry, &ng->tasks);
+	spin_unlock(&ng->lock);
+
+	sched_setnode(p, ng->numa_entity.node);
+
+	if (old_ng)
+		ng_put(old_ng);
+
+	return ng_id;
+}
+
+static struct task_struct *find_get_task(pid_t tid)
+{
+	struct task_struct *p;
+
+	rcu_read_lock();
+	if (!tid)
+		p = current;
+	else
+		p = find_task_by_vpid(tid);
+	if (p)
+		get_task_struct(p);
+	rcu_read_unlock();
+
+	if (!p)
+		return ERR_PTR(-ESRCH);
+
+	return p;
+}
+
+/*
+ * Bind a thread to a numa group or query its binding or create a new group.
+ *
+ * sys_numa_tbind(tid, -1, 0);	  // create new group, return new ng_id
+ * sys_numa_tbind(tid, -2, 0);	  // returns existing ng_id
+ * sys_numa_tbind(tid, ng_id, 0); // set ng_id
+ *
+ * Returns:
+ *  -ESRCH	tid->task resolution failed
+ *  -EINVAL	task didn't have a ng_id, flags was wrong
+ *  -EPERM	tid isn't in our process
+ *
+ */
+SYSCALL_DEFINE3(numa_tbind, int, tid, int, ng_id, unsigned long, flags)
+{
+	struct task_struct *p = find_get_task(tid);
+	struct numa_group *ng = NULL;
+	int orig_ng_id = ng_id;
+
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	if (flags) {
+		ng_id = -EINVAL;
+		goto out;
+	}
+
+	switch (ng_id) {
+	case MS_ID_GET:
+		ng_id = -EINVAL;
+		rcu_read_lock();
+		ng = rcu_dereference(p->numa_group);
+		if (ng)
+			ng_id = ng->id;
+		rcu_read_unlock();
+		break;
+
+	case MS_ID_NEW:
+		ng = ng_create(p);
+		if (IS_ERR(ng)) {
+			ng_id = PTR_ERR(ng);
+			break;
+		}
+		ng_id = ng->id;
+		/* fall through */
+
+	default:
+		ng_id = ng_task_assign(p, ng_id);
+		if (ng && orig_ng_id < 0)
+			ng_put(ng);
+		break;
+	}
+
+out:
+	put_task_struct(p);
+	return ng_id;
+}
+
+/*
+ * Bind a memory region to a numa group.
+ *
+ * sys_numa_mbind(addr, len, ng_id, 0);
+ *
+ * create a non-mergable vma over [addr,addr+len) and assign a mpol binding it
+ * to the numa group identified by ng_id.
+ *
+ */
+SYSCALL_DEFINE4(numa_mbind, unsigned long, addr, unsigned long, len,
+			    int, ng_id, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct mempolicy *mpol;
+	struct numa_group *ng;
+	nodemask_t mask;
+	int node, err = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	if (addr & ~PAGE_MASK)
+		return -EINVAL;
+
+	ng = ng_lookup(ng_id, current);
+	if (IS_ERR(ng))
+		return PTR_ERR(ng);
+
+	mask = nodemask_of_node(ng->numa_entity.node);
+	mpol = mpol_new(MPOL_BIND, 0, &mask);
+	if (!mpol) {
+		ng_put(ng);
+		return -ENOMEM;
+	}
+	mpol->flags |= MPOL_MF_LAZY;
+	mpol->numa_group = ng;
+
+	node = dequeue_ne(&mm->numa); // XXX
+
+	down_write(&mm->mmap_sem);
+	err = mpol_do_mbind(addr, len, mpol, MPOL_BIND,
+			&mask, MPOL_MF_MOVE|MPOL_MF_LAZY);
+	up_write(&mm->mmap_sem);
+	mpol_put(mpol);
+	ng_put(ng);
+
+	if (err && node != -1)
+		enqueue_ne(&mm->numa, node); // XXX
+
+	return err;
+}
+
+#ifdef CONFIG_COMPAT
+
+asmlinkage long compat_sys_numa_mbind(compat_ulong_t addr, compat_ulong_t len,
+				      compat_int_t ng_id, compat_ulong_t flags)
+{
+	return sys_numa_mbind(addr, len, ng_id, flags);
+}
+
+asmlinkage long compat_sys_numa_tbind(compat_int_t tid, compat_int_t ng_id,
+				      compat_ulong_t flags)
+{
+	return sys_numa_tbind(tid, ng_id, flags);
+}
+
+#endif /* CONFIG_COMPAT */
+
+static __init int numa_group_init(void)
+{
+	init_srcu_struct(&ng_srcu);
+	return 0;
+}
+early_initcall(numa_group_init);
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -103,6 +103,10 @@ cond_syscall(sys_set_mempolicy);
 cond_syscall(compat_sys_mbind);
 cond_syscall(compat_sys_get_mempolicy);
 cond_syscall(compat_sys_set_mempolicy);
+cond_syscall(sys_numa_mbind);
+cond_syscall(compat_sys_numa_mbind);
+cond_syscall(sys_numa_tbind);
+cond_syscall(compat_sys_numa_tbind);
 cond_syscall(sys_add_key);
 cond_syscall(sys_request_key);
 cond_syscall(sys_keyctl);
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -287,12 +287,13 @@ struct mempolicy *mpol_new(unsigned shor
 		}
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
-	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL | __GFP_ZERO);
 	if (!policy)
 		return ERR_PTR(-ENOMEM);
 	atomic_set(&policy->refcnt, 1);
 	policy->mode = mode;
 	policy->flags = flags;
+	INIT_LIST_HEAD(&policy->ng_entry);
 
 	return policy;
 }
@@ -607,6 +608,9 @@ static int policy_vma(struct vm_area_str
 	if (!err) {
 		mpol_get(new);
 		vma->vm_policy = new;
+		numa_vma_link(vma, NULL);
+		if (old)
+			numa_vma_unlink(old->vma);
 		mpol_put(old);
 	}
 	return err;
@@ -1994,11 +1998,13 @@ int vma_dup_policy(struct vm_area_struct
 	if (IS_ERR(mpol))
 		return PTR_ERR(mpol);
 	vma_set_policy(new, mpol);
+	numa_vma_link(new, old);
 	return 0;
 }
 
 void vma_put_policy(struct vm_area_struct *vma)
 {
+	numa_vma_unlink(vma);
 	mpol_put(vma_policy(vma));
 }
 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (22 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-rss.patch --]
[-- Type: text/plain, Size: 7660 bytes --]

Somewhat invasive, add another call next to every
{add,dec}_mm_counter() that takes a vma argument instead.

Should we fold and do a single call taking both mm and vma?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/filemap_xip.c |    1 +
 mm/fremap.c      |    2 ++
 mm/huge_memory.c |    4 ++++
 mm/memory.c      |   26 ++++++++++++++++++++------
 mm/rmap.c        |   14 +++++++++++---
 mm/swapfile.c    |    2 ++
 6 files changed, 40 insertions(+), 9 deletions(-)
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -196,6 +196,7 @@ __xip_unmap (struct address_space * mapp
 			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page);
 			dec_mm_counter(mm, MM_FILEPAGES);
+			numa_add_vma_counter(vma, MM_FILEPAGES, -1);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/syscalls.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mempolicy.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -40,6 +41,7 @@ static void zap_pte(struct mm_struct *mm
 			page_cache_release(page);
 			update_hiwater_rss(mm);
 			dec_mm_counter(mm, MM_FILEPAGES);
+			numa_add_vma_counter(vma, MM_FILEPAGES, -1);
 		}
 	} else {
 		if (!pte_file(pte))
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -672,6 +672,7 @@ static int __do_huge_pmd_anonymous_page(
 		prepare_pmd_huge_pte(pgtable, mm);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm->nr_ptes++;
+		numa_add_vma_counter(vma, MM_ANONPAGES, HPAGE_PMD_NR);
 		spin_unlock(&mm->page_table_lock);
 	}
 
@@ -785,6 +786,7 @@ int copy_huge_pmd(struct mm_struct *dst_
 	get_page(src_page);
 	page_dup_rmap(src_page);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	numa_add_vma_counter(vma, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
@@ -1047,6 +1049,7 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			numa_add_vma_counter(vma, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
@@ -1805,6 +1808,7 @@ static void __collapse_huge_page_copy(pt
 		if (pte_none(pteval)) {
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+			numa_add_vma_counter(vma, MM_ANONPAGES, 1);
 		} else {
 			src_page = pte_page(pteval);
 			copy_user_highpage(page, src_page, address, vma);
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -657,15 +657,19 @@ static inline void init_rss_vec(int *rss
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+static inline
+void add_mm_rss_vec(struct mm_struct *mm, struct vm_area_struct *vma, int *rss)
 {
 	int i;
 
 	if (current->mm == mm)
 		sync_mm_rss(current, mm);
-	for (i = 0; i < NR_MM_COUNTERS; i++)
-		if (rss[i])
+	for (i = 0; i < NR_MM_COUNTERS; i++) {
+		if (rss[i]) {
 			add_mm_counter(mm, i, rss[i]);
+			numa_add_vma_counter(vma, i, rss[i]);
+		}
+	}
 }
 
 /*
@@ -983,7 +987,7 @@ int copy_pte_range(struct mm_struct *dst
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
-	add_mm_rss_vec(dst_mm, rss);
+	add_mm_rss_vec(dst_mm, vma, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
@@ -1217,7 +1221,7 @@ static unsigned long zap_pte_range(struc
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
-	add_mm_rss_vec(mm, rss);
+	add_mm_rss_vec(mm, vma, rss);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(start_pte, ptl);
 
@@ -2024,6 +2028,7 @@ static int insert_page(struct vm_area_st
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
 	inc_mm_counter_fast(mm, MM_FILEPAGES);
+	numa_add_vma_counter(vma, MM_FILEPAGES, 1);
 	page_add_file_rmap(page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
@@ -2680,9 +2685,13 @@ static int do_wp_page(struct mm_struct *
 			if (!PageAnon(old_page)) {
 				dec_mm_counter_fast(mm, MM_FILEPAGES);
 				inc_mm_counter_fast(mm, MM_ANONPAGES);
+				numa_add_vma_counter(vma, MM_FILEPAGES, -1);
+				numa_add_vma_counter(vma, MM_ANONPAGES, 1);
 			}
-		} else
+		} else {
 			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			numa_add_vma_counter(vma, MM_ANONPAGES, 1);
+		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -3006,6 +3015,8 @@ static int do_swap_page(struct mm_struct
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
+	numa_add_vma_counter(vma, MM_ANONPAGES, 1);
+	numa_add_vma_counter(vma, MM_SWAPENTS, -1);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -3146,6 +3157,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
+	numa_add_vma_counter(vma, MM_ANONPAGES, 1);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -3301,9 +3313,11 @@ static int __do_fault(struct mm_struct *
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		if (anon) {
 			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			numa_add_vma_counter(vma, MM_ANONPAGES, 1);
 			page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
+			numa_add_vma_counter(vma, MM_FILEPAGES, 1);
 			page_add_file_rmap(page);
 			if (flags & FAULT_FLAG_WRITE) {
 				dirty_page = page;
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1255,10 +1255,13 @@ int try_to_unmap_one(struct page *page,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (PageAnon(page))
+		if (PageAnon(page)) {
 			dec_mm_counter(mm, MM_ANONPAGES);
-		else
+			numa_add_vma_counter(vma, MM_ANONPAGES, -1);
+		} else {
 			dec_mm_counter(mm, MM_FILEPAGES);
+			numa_add_vma_counter(vma, MM_FILEPAGES, -1);
+		}
 		set_pte_at(mm, address, pte,
 				swp_entry_to_pte(make_hwpoison_entry(page)));
 	} else if (PageAnon(page)) {
@@ -1282,6 +1285,8 @@ int try_to_unmap_one(struct page *page,
 			}
 			dec_mm_counter(mm, MM_ANONPAGES);
 			inc_mm_counter(mm, MM_SWAPENTS);
+			numa_add_vma_counter(vma, MM_ANONPAGES, -1);
+			numa_add_vma_counter(vma, MM_SWAPENTS, 1);
 		} else if (PAGE_MIGRATION) {
 			/*
 			 * Store the pfn of the page in a special migration
@@ -1299,8 +1304,10 @@ int try_to_unmap_one(struct page *page,
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
-	} else
+	} else {
 		dec_mm_counter(mm, MM_FILEPAGES);
+		numa_add_vma_counter(vma, MM_FILEPAGES, -1);
+	}
 
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -1440,6 +1447,7 @@ static int try_to_unmap_cluster(unsigned
 		page_remove_rmap(page);
 		page_cache_release(page);
 		dec_mm_counter(mm, MM_FILEPAGES);
+		numa_add_vma_counter(vma, MM_FILEPAGES, -1);
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -881,6 +881,8 @@ static int unuse_pte(struct vm_area_stru
 
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	numa_add_vma_counter(vma, MM_SWAPENTS, -1);
+	numa_add_vma_counter(vma, MM_ANONPAGES, 1);
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (23 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-07-08 18:34   ` Rik van Riel
  2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-migrate-duration.patch --]
[-- Type: text/plain, Size: 2447 bytes --]

It doesn't make much sense to memory migrate short running things.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/numa.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -15,6 +15,8 @@ struct numa_ops {
 	void		(*mem_migrate)(struct numa_entity *ne, int node);
 	void		(*cpu_migrate)(struct numa_entity *ne, int node);
 
+	u64		(*cpu_runtime)(struct numa_entity *ne);
+
 	bool		(*tryget)(struct numa_entity *ne);
 	void		(*put)(struct numa_entity *ne);
 };
@@ -196,6 +198,21 @@ static void process_mem_migrate(struct n
 	lazy_migrate_process(ne_mm(ne), node);
 }
 
+static u64 process_cpu_runtime(struct numa_entity *ne)
+{
+	struct task_struct *p, *t;
+	u64 runtime = 0;
+
+	rcu_read_lock();
+	t = p = ne_owner(ne);
+	if (p) do {
+		runtime += t->se.sum_exec_runtime; // @#$#@ 32bit
+	} while ((t = next_thread(t)) != p);
+	rcu_read_unlock();
+
+	return runtime;
+}
+
 static bool process_tryget(struct numa_entity *ne)
 {
 	/*
@@ -219,6 +236,8 @@ static const struct numa_ops process_num
 	.mem_migrate	= process_mem_migrate,
 	.cpu_migrate	= process_cpu_migrate,
 
+	.cpu_runtime	= process_cpu_runtime,
+
 	.tryget		= process_tryget,
 	.put		= process_put,
 };
@@ -616,6 +635,14 @@ static bool can_move_ne(struct numa_enti
 	 * XXX: consider mems_allowed, stinking cpusets has mems_allowed
 	 * per task and it can actually differ over a whole process, la-la-la.
 	 */
+
+	/*
+	 * Don't bother migrating memory if there's less than 1 second
+	 * of runtime on the tasks.
+	 */
+	if (ne->nops->cpu_runtime(ne) < NSEC_PER_SEC)
+		return false;
+
 	return true;
 }
 
@@ -1000,6 +1027,20 @@ static void numa_group_cpu_migrate(struc
 	rcu_read_unlock();
 }
 
+static u64 numa_group_cpu_runtime(struct numa_entity *ne)
+{
+	struct numa_group *ng = ne_ng(ne);
+	struct task_struct *p;
+	u64 runtime = 0;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &ng->tasks, ng_entry)
+		runtime += p->se.sum_exec_runtime; // @#$# 32bit
+	rcu_read_unlock();
+
+	return runtime;
+}
+
 static bool numa_group_tryget(struct numa_entity *ne)
 {
 	/*
@@ -1020,6 +1061,8 @@ static const struct numa_ops numa_group_
 	.mem_migrate	= numa_group_mem_migrate,
 	.cpu_migrate	= numa_group_cpu_migrate,
 
+	.cpu_runtime	= numa_group_cpu_runtime,
+
 	.tryget		= numa_group_tryget,
 	.put		= numa_group_put,
 };



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC][PATCH 26/26] sched, numa: A few debug bits
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (24 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
@ 2012-03-16 14:40 ` Peter Zijlstra
  2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: numa-debug.patch --]
[-- Type: text/plain, Size: 3645 bytes --]

These shouldn't ever get in.. 

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/numa.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -219,7 +219,9 @@ static u64 process_cpu_runtime(struct nu
 	rcu_read_lock();
 	t = p = ne_owner(ne);
 	if (p) do {
-		runtime += t->se.sum_exec_runtime; // @#$#@ 32bit
+		u64 tmp = t->se.sum_exec_runtime;
+		trace_printk("pid: %d ran: %llu ns\n", t->pid, tmp);
+		runtime += tmp; // @#$#@ 32bit
 	} while ((t = next_thread(t)) != p);
 	rcu_read_unlock();
 
@@ -518,7 +520,8 @@ static void update_node_load(struct node
 	 * If there was NUMA_FOREIGN load, that means this node was at its
 	 * maximum memory capacity, record that.
 	 */
-	set_max_mem_load(node_pages_load(nq->node));
+	set_max_mem_load(node_pages_load(nq->node) +
+			node_page_state(nq->node, NR_FREE_PAGES));
 }
 
 enum numa_balance_type {
@@ -556,6 +559,10 @@ static int find_busiest_node(int this_no
 		cpu_load = nq->remote_cpu_load;
 		mem_load = nq->remote_mem_load;
 
+		trace_printk("node_load(%d/%d): cpu: %ld, mem: %ld abs_cpu: %ld abs_mem: %ld\n",
+				node, this_node, cpu_load, mem_load,
+				nq->cpu_load, node_pages_load(node));
+
 		/*
 		 * If this node is overloaded on memory, we don't want more
 		 * tasks, bail!
@@ -580,6 +587,12 @@ static int find_busiest_node(int this_no
 		}
 	}
 
+	trace_printk("cpu_node: %d, cpu_load: %ld, mem_load: %ld, sum_cpu_load: %ld\n",
+			cpu_node, max_cpu_load, cpu_mem_load, sum_cpu_load);
+
+	trace_printk("mem_node: %d, cpu_load: %ld, mem_load: %ld, sum_mem_load: %ld\n",
+			mem_node, mem_cpu_load, max_mem_load, sum_mem_load);
+
 	/*
 	 * Nobody had overload of any kind, cool we're done!
 	 */
@@ -626,6 +639,9 @@ static int find_busiest_node(int this_no
 		imb->mem = (long)(node_pages_load(node) - imb->mem_load) / 2;
 	}
 
+	trace_printk("busiest_node: %d, cpu_imb: %ld, mem_imb: %ld, type: %d\n",
+			node, imb->cpu, imb->mem, imb->type);
+
 	return node;
 }
 
@@ -663,6 +679,9 @@ static void move_processes(struct node_q
 				     struct numa_entity,
 				     numa_entry);
 
+		trace_printk("numa_migrate(%d <- %d): candidate: %p\n",
+				this_nq->node, busiest_nq->node, ne);
+
 		ne_cpu = ne->nops->cpu_load(ne);
 		ne_mem = ne->nops->mem_load(ne);
 
@@ -672,20 +691,27 @@ static void move_processes(struct node_q
 			 * on the other end.
 			 */
 			if ((imb->type & NUMA_BALANCE_CPU) &&
-			    imb->cpu - cpu_moved < ne_cpu / 2)
+			    imb->cpu - cpu_moved < ne_cpu / 2) {
+				trace_printk("fail cpu: %ld %ld %ld\n", imb->cpu, cpu_moved, ne_cpu);
 				goto next;
+			}
 
 			/*
 			 * Avoid migrating ne's when we'll know we'll push our
 			 * node over the memory limit.
 			 */
 			if (max_mem_load &&
-			    imb->mem_load + mem_moved + ne_mem > max_mem_load)
+			    imb->mem_load + mem_moved + ne_mem > max_mem_load) {
+				trace_printk("fail mem: %ld %ld %ld %ld\n",
+						imb->mem_load, mem_moved, ne_mem, max_mem_load);
 				goto next;
+			}
 		}
 
-		if (!can_move_ne(ne))
+		if (!can_move_ne(ne)) {
+			trace_printk("%p, can_move_ne() fail\n", ne);
 			goto next;
+		}
 
 		__dequeue_ne(busiest_nq, ne);
 		__enqueue_ne(this_nq, ne);
@@ -702,6 +728,11 @@ static void move_processes(struct node_q
 		cpu_moved += ne_cpu;
 		mem_moved += ne_mem;
 
+		trace_printk("numa_migrate(%d <- %d): cpu_load: %ld mem_load: %ld, "
+				"cpu_moved: %ld, mem_moved: %ld\n",
+				this_nq->node, busiest_nq->node,
+				ne_cpu, ne_mem, cpu_moved, mem_moved);
+
 		if (imb->cpu - cpu_moved <= 0 &&
 		    imb->mem - mem_moved <= 0)
 			break;



^ permalink raw reply	[flat|nested] 153+ messages in thread

* [RFC] AutoNUMA alpha6
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (25 preceding siblings ...)
  2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
@ 2012-03-16 18:25 ` Andrea Arcangeli
  2012-03-19 18:47   ` Peter Zijlstra
  2012-03-20 23:41   ` Dan Smith
  2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
  2012-03-21 22:53 ` Nish Aravamudan
  28 siblings, 2 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-16 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Fri, Mar 16, 2012 at 03:40:28PM +0100, Peter Zijlstra wrote:
> And a few numbers...

Could you try my two trivial benchmarks I sent on lkml too? That
should take less time than the effort you did to add those performance
numbers to perf. I use those benchmarks as a regression test for my
code. They exercise a more complex scenario than "sleep 2" so
supposedly the results will be more interesting.

You find both programs in this link:

http://lists.openwall.net/linux-kernel/2012/01/27/9

These are my results.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120126.pdf

I happened to have released the autonuma source yesterday on my git
tree:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/autonuma

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=30ed50adf6cfe85f7feb12c4279359ec52f5f2cd;hp=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1

It's a big monlithic patch, but I'll split it.

THP native migration isn't complete yet so it degrades more than it
should when comparing THP autonuma vs hard bind THP. But that can be
done later and it'll benefit move_pages or any other userland hard
binding too, not just the autonuma kernel side. I guess you need this
feature too.

The scanning rate must be tuned, it's possibly too fast as default
because all my benchmarks tends to be short lived. There's already
lots of tuning available in /sys/kernel/mm/autonuma .

There's lots of other tuning to facilitate testing the different
algorithms. By default the numa balancing decisions will keep the
process stuck in its own node, unless there's an idle cpu, but there's
a model to let it escape the node for load balancing/fairness reasons
(to be closer to the stock scheduler) by setting load_balance_strict
to zero (default is 1).

There's also a knuma_scand/working_set tweak to scan the working set
and not all memory (so only care about what's hot, if the app has a
ton of memory that isn't using in some node, that won't be accounted
anymore in the memory migration and CPU migration decisions).

There's no syscall or hint userland can give.

The migration doesn't happen during page fault. There's proper
knuma_migrated daemon per node. The daemon has per-node array of
page-lists. Then knuma_migrated0 is waken with some hysteresis, and
will pick pages that wants to go from node1 to node0, from node 2 to
node0 etc.. and it'll pick them in round robin fascion across all
nodes. That stops when the node0 is out of memory and the cache would
be shrunk or in most cases when there aren't more pages to
migrate. One of the missing features is to start balancing cache
around but I'll add that later and I've already reserved one slot in
the pgdat for that. All other numa_migratedN also runs, so we're
guaranteed to make progress when process A going from node0 to node1,
and process B going from node1 to node0.

All memory that isn't shared is migrated, that includes mapped
pagecache.

The basic logic is scheduler following the memory and memory following
CPU, until things converge.

I'm skeptical in general that any NUMA hinting syscall will be used by
anything except qemu and that's what motivated my design. Hopefully in
the future CPU vendors will provide us a better way to track memory
locality than what I'm doing right now in software. The cost is almost
unmeasurable (even if you disable the pmd mode). I'm afraid with virt
the cost could be higher because of the vmexists but virt is long
lived and a slower scanning rate for the memory layout info should be
ok.

Here also huge amount of improvements are possible. Hopefully it's not
too intrusive either.

I also wrote a threaded userland tool that can render visually at
>20frames per sec the status of the memory and shows the memory
migration (the ones I found were on python and with >8G of ram they
just can't deliver). I was going to try to make it per-process instead
of global before releasing it, that may give another speedup (or
slowdown I don't know for sure). It'll help explain what the code does
and see it in action. But for us echo 1 >/sys/kernel/mm/autonuma/debug
may be enough. Still the visual thing is cool and if done generically
it would be interesting. Ideally once it goes per process it should
show which CPU the process is running on too, not just where the
process memory is.

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/kernel/cpu/amd.c            |    4 +-
 arch/x86/kernel/cpu/common.c         |    4 +-
 arch/x86/kernel/setup_percpu.c       |    1 +
 arch/x86/mm/gup.c                    |    2 +-
 arch/x86/mm/numa.c                   |    9 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   13 +
 include/linux/autonuma.h             |   41 +
 include/linux/autonuma_flags.h       |   62 ++
 include/linux/autonuma_sched.h       |   61 ++
 include/linux/autonuma_types.h       |   54 ++
 include/linux/huge_mm.h              |    7 +-
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |   29 +
 include/linux/mmzone.h               |    6 +
 include/linux/sched.h                |    4 +
 kernel/exit.c                        |    1 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    3 +-
 kernel/sched/core.c                  |   13 +-
 kernel/sched/fair.c                  |   55 ++-
 kernel/sched/numa.c                  |  322 ++++++++
 kernel/sched/sched.h                 |   12 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1465 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   32 +-
 mm/memcontrol.c                      |    2 +-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |   19 +
 36 files changed, 2376 insertions(+), 50 deletions(-)

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-03-16 18:34   ` Christoph Lameter
  2012-03-16 21:12     ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-16 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 16 Mar 2012, Peter Zijlstra wrote:

> Add another layer of fallback policy to make the home node concept
> useful from a memory allocation PoV.
>
> This changes the mpol order to:
>
>  - vma->vm_ops->get_policy	[if applicable]
>  - vma->vm_policy		[if applicable]
>  - task->mempolicy
>  - tsk_home_node() preferred	[NEW]
>  - default_policy
>
> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> facilitate efficient on-demand memory migration.

The numa hierachy is already complex. Could we avoid adding another layer
by adding a MPOL_HOME_NODE and make that the default?


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-16 18:34   ` Christoph Lameter
@ 2012-03-16 21:12     ` Peter Zijlstra
  2012-03-19 13:53       ` Christoph Lameter
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 21:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 2012-03-16 at 13:34 -0500, Christoph Lameter wrote:
> On Fri, 16 Mar 2012, Peter Zijlstra wrote:
> 
> > Add another layer of fallback policy to make the home node concept
> > useful from a memory allocation PoV.
> >
> > This changes the mpol order to:
> >
> >  - vma->vm_ops->get_policy	[if applicable]
> >  - vma->vm_policy		[if applicable]
> >  - task->mempolicy
> >  - tsk_home_node() preferred	[NEW]
> >  - default_policy
> >
> > Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> > facilitate efficient on-demand memory migration.
> 
> The numa hierachy is already complex. Could we avoid adding another layer
> by adding a MPOL_HOME_NODE and make that the default?

Not sure that's really a win, the behaviour would be the same we just
have to implement another policy, which is likely more code.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (26 preceding siblings ...)
  2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
@ 2012-03-19  9:57 ` Avi Kivity
  2012-03-19 11:12   ` Peter Zijlstra
  2012-03-21 22:53 ` Nish Aravamudan
  28 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-19  9:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/16/2012 04:40 PM, Peter Zijlstra wrote:
> The home-node migration handles both cpu and memory (anonymous only for now) in
> an integrated fashion. The memory migration uses migrate-on-fault to avoid
> doing a lot of work from the actual numa balancer kernl thread and only
> migrates the active memory.
>

IMO, this needs to be augmented with eager migration, for the following
reasons:

- lazy migration adds a bit of latency to page faults
- doesn't work well with large pages
- doesn't work with dma engines

So I think that in addition to migrate on fault we need a background
thread to do eager migration.  We might prioritize pages based on the
active bit in the PDE (cheaper to clear and scan than the PTE, but gives
less accurate information).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
@ 2012-03-19 11:12   ` Peter Zijlstra
  2012-03-19 11:30     ` Peter Zijlstra
                       ` (3 more replies)
  0 siblings, 4 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 11:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 11:57 +0200, Avi Kivity wrote:
> On 03/16/2012 04:40 PM, Peter Zijlstra wrote:
> > The home-node migration handles both cpu and memory (anonymous only for now) in
> > an integrated fashion. The memory migration uses migrate-on-fault to avoid
> > doing a lot of work from the actual numa balancer kernl thread and only
> > migrates the active memory.
> >
> 
> IMO, this needs to be augmented with eager migration, for the following
> reasons:
> 
> - lazy migration adds a bit of latency to page faults

That's intentional, it keeps the work accounted to the tasks that need
it.

> - doesn't work well with large pages

That's for someone who cares about large pages to sort, isn't it? Also,
I thought you virt people only used THP anyway, and those work just fine
(they get broken down, and presumably something will build them back up
on the other side).

[ note that I equally dislike the THP daemon, I would have much
preferred that to be fault driven as well. ]

> - doesn't work with dma engines

How does that work anyway? You'd have to reprogram your dma engine, so
either the ->migratepage() callback does that and we're good either way,
or it simply doesn't work at all.

> So I think that in addition to migrate on fault we need a background
> thread to do eager migration.  We might prioritize pages based on the
> active bit in the PDE (cheaper to clear and scan than the PTE, but gives
> less accurate information).

I absolutely loathe background threads and page table scanners and will
do pretty much everything to avoid them.

The problem I have with farming work out to other entities is that its
thereafter terribly hard to account it back to whoemever caused the
actual work. Suppose your kworker thread consumes a lot of cpu time --
this time is then obviously not available to your application -- but how
do you find out what/who is causing this and cure it?

As to page table scanners, I simply don't see the point. They tend to
require arch support (I see aa introduces yet another PTE bit -- this
instantly limits the usefulness of the approach as lots of archs don't
have spare bits).

Also, if you go scan memory, you need some storage -- see how aa grows
struct page, sure he wants to move that storage some place else, but the
memory overhead is still there -- this means less memory to actually do
useful stuff in (it also probably means more cache-misses since his
proposed shadow array in pgdat is someplace else).

Also, the only really 'hard' case for the whole auto-numa business is
single processes that are bigger than a single node -- and those I pose
are 'rare'.

Now if you want to be able to scan per-thread, you need per-thread
page-tables and I really don't want to ever see that. That will blow
memory overhead and context switch times.

I guess you can limit the impact by only running the scanners on
selected processes, but that requires you add interfaces and then either
rely on admins or userspace to second guess application developers.

So no, I don't like that at all.

I'm still reading aa's patch, I haven't actually found anything I like
or agree with in there, but who knows, there's still some way to go.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:12   ` Peter Zijlstra
@ 2012-03-19 11:30     ` Peter Zijlstra
  2012-03-19 11:39     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 11:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 12:12 +0100, Peter Zijlstra wrote:
> Also, if you go scan memory, you need some storage -- see how aa grows
> struct page, sure he wants to move that storage some place else, but the
> memory overhead is still there -- this means less memory to actually do
> useful stuff in (it also probably means more cache-misses since his
> proposed shadow array in pgdat is someplace else). 

Going by the sizes in aa's patch, that's 96M of my 16G box gone. That
puts HPC people in a rather awkward position of having to choose between
more memory and slightly smarter kernel. I'm thinking they're going to
opt for going the way they are now (hard affinity/userspace balancers)
and use the extra memory.

This even though typical MPI implementations use the multi-process
scheme, so the simple home-node approach I used works just fine for
them.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:12   ` Peter Zijlstra
  2012-03-19 11:30     ` Peter Zijlstra
@ 2012-03-19 11:39     ` Peter Zijlstra
  2012-03-19 11:42     ` Avi Kivity
  2012-03-19 13:04     ` Andrea Arcangeli
  3 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 11:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 12:12 +0100, Peter Zijlstra wrote:
> > - doesn't work well with large pages
> 
> That's for someone who cares about large pages to sort, isn't it? Also,
> I thought you virt people only used THP anyway, and those work just fine
> (they get broken down, and presumably something will build them back up
> on the other side). 

Note that all it would take is to make THP swap work. That on its own
might make sense too since writing a 2M stip of data is probably as fast
as a single 4K page on many of the rotating rust things. No idea on SSD,
but those things are typically fast either way.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:12   ` Peter Zijlstra
  2012-03-19 11:30     ` Peter Zijlstra
  2012-03-19 11:39     ` Peter Zijlstra
@ 2012-03-19 11:42     ` Avi Kivity
  2012-03-19 11:59       ` Peter Zijlstra
                         ` (3 more replies)
  2012-03-19 13:04     ` Andrea Arcangeli
  3 siblings, 4 replies; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 01:12 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 11:57 +0200, Avi Kivity wrote:
> > On 03/16/2012 04:40 PM, Peter Zijlstra wrote:
> > > The home-node migration handles both cpu and memory (anonymous only for now) in
> > > an integrated fashion. The memory migration uses migrate-on-fault to avoid
> > > doing a lot of work from the actual numa balancer kernl thread and only
> > > migrates the active memory.
> > >
> > 
> > IMO, this needs to be augmented with eager migration, for the following
> > reasons:
> > 
> > - lazy migration adds a bit of latency to page faults
>
> That's intentional, it keeps the work accounted to the tasks that need
> it.

The accounting part is good, the extra latency is not.  If you have
spare resources (processors or dma engines) you can employ for eager
migration why not make use of them.

> > - doesn't work well with large pages
>
> That's for someone who cares about large pages to sort, isn't it? Also,
> I thought you virt people only used THP anyway, and those work just fine
> (they get broken down, and presumably something will build them back up
> on the other side).

Extra work, and more slowness until they get rebuilt.  Why not migrate
entire large pages?

> [ note that I equally dislike the THP daemon, I would have much
> preferred that to be fault driven as well. ]

The scanning part has to be independent, no?

> > - doesn't work with dma engines
>
> How does that work anyway? You'd have to reprogram your dma engine, so
> either the ->migratepage() callback does that and we're good either way,
> or it simply doesn't work at all.

If it's called from the faulting task's context you have to sleep, and
the latency gets increased even more, plus you're dependant on the dma
engine's backlog.  If you do all that from a background thread you don't
have to block (you might have to cancel or discard a migration if the
page was changed while being copied).

> > So I think that in addition to migrate on fault we need a background
> > thread to do eager migration.  We might prioritize pages based on the
> > active bit in the PDE (cheaper to clear and scan than the PTE, but gives
> > less accurate information).
>
> I absolutely loathe background threads and page table scanners and will
> do pretty much everything to avoid them.
>
> The problem I have with farming work out to other entities is that its
> thereafter terribly hard to account it back to whoemever caused the
> actual work. Suppose your kworker thread consumes a lot of cpu time --
> this time is then obviously not available to your application -- but how
> do you find out what/who is causing this and cure it?

I agree with this, but it's really widespread throughout the kernel,
from interrupts to work items to background threads.  It needs to be
solved generically (IIRC vhost has some accouting fix for a similar issue).

Doing everything from task context solves the accounting problem but
introduces others.

> As to page table scanners, I simply don't see the point. They tend to
> require arch support (I see aa introduces yet another PTE bit -- this
> instantly limits the usefulness of the approach as lots of archs don't
> have spare bits).
>
> Also, if you go scan memory, you need some storage -- see how aa grows
> struct page, sure he wants to move that storage some place else, but the
> memory overhead is still there -- this means less memory to actually do
> useful stuff in (it also probably means more cache-misses since his
> proposed shadow array in pgdat is someplace else).

It's the standard space/time tradeoff.  Once solution wants more
storage, the other wants more faults.

Note scanners can use A/D bits which are cheaper than faults.

> Also, the only really 'hard' case for the whole auto-numa business is
> single processes that are bigger than a single node -- and those I pose
> are 'rare'.

I agree, especially as sizeof(node) keeps growing, while nr_nodes == 2
or 4, usually.

> Now if you want to be able to scan per-thread, you need per-thread
> page-tables and I really don't want to ever see that. That will blow
> memory overhead and context switch times.

I thought of only duplicating down to the PDE level, that gets rid of
almost all of the overhead.

> I guess you can limit the impact by only running the scanners on
> selected processes, but that requires you add interfaces and then either
> rely on admins or userspace to second guess application developers.
>
> So no, I don't like that at all.
>
> I'm still reading aa's patch, I haven't actually found anything I like
> or agree with in there, but who knows, there's still some way to go.

IMO we need some combination.  I like the explicit vnode approach and
binding threads explicitly to memory areas, but I think fault-time
migration is too slow.  But maybe migration will be very rare and it
won't matter.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:42     ` Avi Kivity
@ 2012-03-19 11:59       ` Peter Zijlstra
  2012-03-19 12:07         ` Avi Kivity
  2012-03-19 12:09       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 11:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > Now if you want to be able to scan per-thread, you need per-thread
> > page-tables and I really don't want to ever see that. That will blow
> > memory overhead and context switch times.
> 
> I thought of only duplicating down to the PDE level, that gets rid of
> almost all of the overhead. 

You still get the significant CR3 cost for thread switches. 

[ /me grabs the SDM to find that PDE is what we in Linux call the pmd ]

That'll cut the memory overhead down but also the severely impact the
accuracy.

Also, I still don't see how such a scheme would correctly identify
per-cpu memory in guest kernels. While less frequent its still very
common to do remote access to per-cpu data. So even if you did page
granularity you'd get a fair amount of pages that are accesses by all
threads (vcpus) in the scan interval, even thought they're primarily
accesses by just one.

If you go to pmd level you get even less information.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:59       ` Peter Zijlstra
@ 2012-03-19 12:07         ` Avi Kivity
  0 siblings, 0 replies; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 12:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 01:59 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > > Now if you want to be able to scan per-thread, you need per-thread
> > > page-tables and I really don't want to ever see that. That will blow
> > > memory overhead and context switch times.
> > 
> > I thought of only duplicating down to the PDE level, that gets rid of
> > almost all of the overhead. 
>
> You still get the significant CR3 cost for thread switches. 

True.  Not so much for virt, which has one thread per cpu generally.

> [ /me grabs the SDM to find that PDE is what we in Linux call the pmd ]

Yes, sorry.

> That'll cut the memory overhead down but also the severely impact the
> accuracy.
>
> Also, I still don't see how such a scheme would correctly identify
> per-cpu memory in guest kernels. While less frequent its still very
> common to do remote access to per-cpu data. So even if you did page
> granularity you'd get a fair amount of pages that are accesses by all
> threads (vcpus) in the scan interval, even thought they're primarily
> accesses by just one.
>
> If you go to pmd level you get even less information.

That is true.  Which is why I like the explicit vnode thing.  The guest
kernel already knows how to affine vcpus to memory, we don't need to
scan to see if it's actually doing what we told it to do.  Scanning is
good for unmodified non-virt applications, or to prioritize the migration.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:42     ` Avi Kivity
  2012-03-19 11:59       ` Peter Zijlstra
@ 2012-03-19 12:09       ` Peter Zijlstra
  2012-03-19 12:16         ` Avi Kivity
  2012-03-19 12:20       ` Peter Zijlstra
  2012-03-19 13:40       ` Andrea Arcangeli
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 12:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > That's intentional, it keeps the work accounted to the tasks that need
> > it.
> 
> The accounting part is good, the extra latency is not.  If you have
> spare resources (processors or dma engines) you can employ for eager
> migration why not make use of them.

Afaik we do not use dma engines for memory migration. 

In any case, if you do cross-node migration frequently enough that the
overhead of copying pages is a significant part of your time then I'm
guessing there's something wrong.

If not, the latency should be armortised enough to not matter.

> > > - doesn't work with dma engines
> >
> > How does that work anyway? You'd have to reprogram your dma engine, so
> > either the ->migratepage() callback does that and we're good either way,
> > or it simply doesn't work at all.
> 
> If it's called from the faulting task's context you have to sleep, and
> the latency gets increased even more, plus you're dependant on the dma
> engine's backlog.  If you do all that from a background thread you don't
> have to block (you might have to cancel or discard a migration if the
> page was changed while being copied). 

The current MoF implementation simply bails and uses the old page. It
will never block.

Its all a best effort approach, a 'few' stray pages is OK as long as the
bulk of the pages are local.

If you're concerned, we can add per mm/vma counters to track this.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks
  2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
@ 2012-03-19 12:16   ` Srivatsa S. Bhat
  2012-03-19 12:19     ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-19 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/16/2012 08:10 PM, Peter Zijlstra wrote:

> start/stop numa balance threads on-demand using cpu-hotlpug.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched/numa.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 55 insertions(+), 7 deletions(-)
> --- a/kernel/sched/numa.c
> +++ b/kernel/sched/numa.c
> @@ -596,31 +596,79 @@ static int numad_thread(void *data)
>  	return 0;
>  }
> 
> +static int __cpuinit
> +numa_hotplug(struct notifier_block *nb, unsigned long action, void *hcpu)
> +{
> +	int cpu = (long)hcpu;
> +	int node = cpu_to_node(cpu);
> +	struct node_queue *nq = nq_of(node);
> +	struct task_struct *numad;
> +	int err = 0;
> +
> +	switch (action & ~CPU_TASKS_FROZEN) {
> +	case CPU_UP_PREPARE:
> +		if (nq->numad)
> +			break;
> +
> +		numad = kthread_create_on_node(numad_thread,
> +				nq, node, "numad/%d", node);
> +		if (IS_ERR(numad)) {
> +			err = PTR_ERR(numad);
> +			break;
> +		}
> +
> +		nq->numad = numad;
> +		nq->next_schedule = jiffies + HZ; // XXX sync-up?
> +		break;
> +
> +	case CPU_ONLINE:
> +		wake_up_process(nq->numad);
> +		break;
> +
> +	case CPU_DEAD:
> +	case CPU_UP_CANCELED:
> +		if (!nq->numad)
> +			break;
> +
> +		if (cpumask_any_and(cpu_online_mask,
> +				    cpumask_of_node(node)) >= nr_cpu_ids) {
> +			kthread_stop(nq->numad);
> +			nq->numad = NULL;
> +		}
> +		break;
> +	}
> +
> +	return notifier_from_errno(err);
> +}
> +
>  static __init int numa_init(void)
>  {
> -	int node;
> +	int node, cpu, err;
> 
>  	nqs = kzalloc(sizeof(struct node_queue*) * nr_node_ids, GFP_KERNEL);
>  	BUG_ON(!nqs);
> 
> -	for_each_node(node) { // XXX hotplug
> +	for_each_node(node) {
>  		struct node_queue *nq = kmalloc_node(sizeof(*nq),
>  				GFP_KERNEL | __GFP_ZERO, node);
>  		BUG_ON(!nq);
> 
> -		nq->numad = kthread_create_on_node(numad_thread,
> -				nq, node, "numad/%d", node);
> -		BUG_ON(IS_ERR(nq->numad));
> -
>  		spin_lock_init(&nq->lock);
>  		INIT_LIST_HEAD(&nq->entity_list);
> 
>  		nq->next_schedule = jiffies + HZ;
>  		nq->node = node;
>  		nqs[node] = nq;
> +	}
> 
> -		wake_up_process(nq->numad);
> +	get_online_cpus();
> +	cpu_notifier(numa_hotplug, 0);


ABBA deadlock!

CPU 0						CPU1
				echo 0/1 > /sys/devices/.../cpu*/online

					acquire cpu_add_remove_lock

get_online_cpus()
	acquire cpu_hotplug lock
					
					Blocked on cpu hotplug lock

cpu_notifier()
	acquire cpu_add_remove_lock

ABBA DEADLOCK!

[cpu_maps_update_begin/done() deal with cpu_add_remove_lock].

So, basically, at the moment there is no way to register a CPU Hotplug notifier
and do setup for all currently online cpus in a totally race-free manner.

One approach to fix this is to audit whether register_cpu_notifier() really needs
to take cpu_add_remove_lock and if no, then acquire cpu hotplug lock instead.

The other approach is to keep the existing lock ordering as it is and yet provide
a race-free way to register, as I had posted some time ago (incomplete/untested):

http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826


> +	for_each_online_cpu(cpu) {
> +		err = numa_hotplug(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
> +		BUG_ON(notifier_to_errno(err));
> +		numa_hotplug(NULL, CPU_ONLINE, (void *)(long)cpu);
>  	}
> +	put_online_cpus();
> 
>  	return 0;
>  }
> 
> 

 
Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 12:09       ` Peter Zijlstra
@ 2012-03-19 12:16         ` Avi Kivity
  2012-03-19 20:03           ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 02:09 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > > That's intentional, it keeps the work accounted to the tasks that need
> > > it.
> > 
> > The accounting part is good, the extra latency is not.  If you have
> > spare resources (processors or dma engines) you can employ for eager
> > migration why not make use of them.
>
> Afaik we do not use dma engines for memory migration. 

We don't, but I think we should.

> In any case, if you do cross-node migration frequently enough that the
> overhead of copying pages is a significant part of your time then I'm
> guessing there's something wrong.
>
> If not, the latency should be armortised enough to not matter.

Amortization is okay for HPC style applications but not for interactive
applications (including servers).  It all depends on the numbers of
course, maybe migrate on fault is okay, we'll need to measure it somehow.

> > > > - doesn't work with dma engines
> > >
> > > How does that work anyway? You'd have to reprogram your dma engine, so
> > > either the ->migratepage() callback does that and we're good either way,
> > > or it simply doesn't work at all.
> > 
> > If it's called from the faulting task's context you have to sleep, and
> > the latency gets increased even more, plus you're dependant on the dma
> > engine's backlog.  If you do all that from a background thread you don't
> > have to block (you might have to cancel or discard a migration if the
> > page was changed while being copied). 
>
> The current MoF implementation simply bails and uses the old page. It
> will never block.

Then it can not use a dma engine.

> Its all a best effort approach, a 'few' stray pages is OK as long as the
> bulk of the pages are local.
>
> If you're concerned, we can add per mm/vma counters to track this.

These are second and third order effects.  Overall I'm happy, kvm is one
of the workloads most severely impacted by the current numa support and
this looks like it addresses most of the issues.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks
  2012-03-19 12:16   ` Srivatsa S. Bhat
@ 2012-03-19 12:19     ` Peter Zijlstra
  2012-03-19 12:27       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 12:19 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 17:46 +0530, Srivatsa S. Bhat wrote:
> > +     get_online_cpus();
> > +     cpu_notifier(numa_hotplug, 0);
> 
> 
> ABBA deadlock!
> 
Yeah, I know.. luckily it can't actually happen since early_initcalls
are pre-smp. I could just leave out the get_online_cpus() thing.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:42     ` Avi Kivity
  2012-03-19 11:59       ` Peter Zijlstra
  2012-03-19 12:09       ` Peter Zijlstra
@ 2012-03-19 12:20       ` Peter Zijlstra
  2012-03-19 12:24         ` Avi Kivity
  2012-03-19 13:40       ` Andrea Arcangeli
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 12:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> It's the standard space/time tradeoff.  Once solution wants more
> storage, the other wants more faults.
> 
> Note scanners can use A/D bits which are cheaper than faults.

I'm not convinced.. the scanner will still consume time even if the
system is perfectly balanced -- it has to in order to determine this.

So sure, A/D/other page table magic can make scanners faster than faults
however you only need faults when you're actually going to migrate a
task. Whereas you always need to scan, even in the stable state.

So while the per-instance times might be in favour of scanning, I'm
thinking the accumulated time is in favour of faults.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 12:20       ` Peter Zijlstra
@ 2012-03-19 12:24         ` Avi Kivity
  2012-03-19 15:44           ` Avi Kivity
  0 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 02:20 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > It's the standard space/time tradeoff.  Once solution wants more
> > storage, the other wants more faults.
> > 
> > Note scanners can use A/D bits which are cheaper than faults.
>
> I'm not convinced.. the scanner will still consume time even if the
> system is perfectly balanced -- it has to in order to determine this.
>
> So sure, A/D/other page table magic can make scanners faster than faults
> however you only need faults when you're actually going to migrate a
> task. Whereas you always need to scan, even in the stable state.
>
> So while the per-instance times might be in favour of scanning, I'm
> thinking the accumulated time is in favour of faults.

When you migrate a vnode, you don't need the faults at all.  You know
exactly which pages need to be migrated, you can just queue them
immediately when you make that decision.

The scanning therefore only needs to pick up the stragglers and can be
set to a very low frequency.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks
  2012-03-19 12:19     ` Peter Zijlstra
@ 2012-03-19 12:27       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 153+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-19 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 05:49 PM, Peter Zijlstra wrote:

> On Mon, 2012-03-19 at 17:46 +0530, Srivatsa S. Bhat wrote:
>>> +     get_online_cpus();
>>> +     cpu_notifier(numa_hotplug, 0);
>>
>>
>> ABBA deadlock!
>>
> Yeah, I know.. luckily it can't actually happen since early_initcalls
> are pre-smp. I could just leave out the get_online_cpus() thing.
> 


Oh numa_init() is an early_initcall? Ok, I didn't observe.
In that case, its fine either way, with or without get_online_cpus()
stuff.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:12   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2012-03-19 11:42     ` Avi Kivity
@ 2012-03-19 13:04     ` Andrea Arcangeli
  2012-03-19 13:26       ` Peter Zijlstra
                         ` (3 more replies)
  3 siblings, 4 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 12:12:07PM +0100, Peter Zijlstra wrote:
> As to page table scanners, I simply don't see the point. They tend to
> require arch support (I see aa introduces yet another PTE bit -- this
> instantly limits the usefulness of the approach as lots of archs don't
> have spare bits).

Note that for most archs supporting NUMA the pmd/pte is a pure
software representation that the hardware won't ever be able to read,
x86 is almost the exception here.

But the numa pte/pmd bit works identical to PROT_NONE so you're wrong
and it doesn't need a spare bit from the hardware. It's not the same
as PTE_SPECIAL or something the hardware has to be aware about that is
reserved for software use.

This is a bit that is reused from the swap_entry, it is some bit that
become meaningful only when the PRESENT bit is _not_ set, so it only
needs to steal a bit from the swap_entry representation worst
case. It's not a bit required to the hardware, it's a bit I steal from
Linux. On x86 I was careful enough to stole it from an intermediate
place that wasn't used anyway, so I didn't have to alter the
swap_entry representation. I exclude there can be an issue in any
other arch in supporting AutoNUMA in fact it'll be trivial to add and
it shouldn't require almost any arch change except for definiting that
bit (and worst case adjusting the swap_entry representation).

About the cost of the actual pagetable scanner, you're not being
rational about it. You should measure it for once, take khugepaged
make it scan 1G of memory per millisecond and measure the cost.

It is practically zero. The only cost measurable is the cost of the
numa hinting page fault, that concerns me too in virt environment
because of the vmexit cost, but on host even those are quite
insignificant and unmeasurable.

You keep complaining about the unaccountability of the pagetable
scanners in terms of process load, and that's a red herring as far as
I can tell. The irqs and ksoftirqd load in a busy server, is likely
much higher than whatever happens at the pagetable scanner level (sure
thing for khugepaged and by an huge order of magnitude so). I don't
think this is a relevant concern anyway because the pagetable scanners
go over all memory in a equal amount so the cost would be evenly
distributed for all processes over time (the same cannot be said about
the irqs and ksoftrqid that will benefit only a few processes doing
I/O).

That it isn't a real time feature it's obvious, but then on real time
you should use numactl hard binds and never migrate memory in the
first place.

> Also, if you go scan memory, you need some storage -- see how aa grows
> struct page, sure he wants to move that storage some place else, but the

The struct page I didn't clean it up yet (if anyone is interested
patches welcome btw), it'll only be allocated if the system boots on
NUMA hardware and it'll be allocated in the pgdat like in memcg. I
forgot to mention this.

I already cleaned up the mm_struct and task_struct at
least... initially they were also hardcoded inside and not only
allocated if booted on NUMA hardware.

If you boot with memcg compiled in, that's taking an equivalent amount
of memory per-page.

If you can bear the memory loss when memcg is compiled in even when
not enabled, you sure can bear it on NUMA systems that have lots of
memory, so it's perfectly ok to sacrifice a bit of it so that it
performs like not-NUMA but you still have more memory than not-NUMA.

Like for memcg if you'll boot with noautonuma no memory will be
allocated at all and it'll shut itself off without giving a way to
enable it at runtime (for memcg the equivalent would be
cgroup_disable=memory and I'm not sure how many are running that
command to save precious memory allocated at boot for every page on
systems not using memcg at runtime...).

> Also, the only really 'hard' case for the whole auto-numa business is
> single processes that are bigger than a single node -- and those I pose
> are 'rare'.

I don't get what "hard" case means here.

Anyway AutoNUMA handles optimally a ton of processes that are smaller
than one node too. But it handles those that are bigger and spans over
multiple as well without having to modify the application or use soft
binding wrappers.

If you think your home node can do better than AutoNUMA when the
process is smaller than one node, benchmark it and report... I'd be
shocked if the home node can do better than AutoNUMA on any workload
involving processes smaller than one node.

It's actually the numa01 testcase that tests this very case (with
-DTHREAD_ALLOC uses local memory for each thread that fits in the
node, without it forces all thread to share all memory). Feel free to
try.

numa02 tests the case of one process spanning over the whole system
with as thread as physical CPUs (each thread uses local memory or
AutoNUMA should layout the equivalent of MADV_INTERLEAVE there, and
it's not capable of such a thing yet, maybe later).

> Now if you want to be able to scan per-thread, you need per-thread
> page-tables and I really don't want to ever see that. That will blow
> memory overhead and context switch times.

I collect per-thread stats and mm wide stats, but I don't have
per-thread pagetables. That makes it harder to detect the sharing
among different threads as the only disavantage compared to a
per-thread pagetables but like you said it has fewer cons than
per-thread pagetables.

> I guess you can limit the impact by only running the scanners on
> selected processes, but that requires you add interfaces and then either
> rely on admins or userspace to second guess application developers.

No need so far. Besides the processes that aren't running won't run
numa hinting page faults and the scanner cost is not a concern (only
the numa hinting faults are).

You keep worrying about the pagetable scanners, and you don't mention
the numa hinting faults maybe because the numa hinting faults will be
accounted perfectly by the scheduler. But really they're the only
concern here (the pagetable scanner is not and has never been).

> So no, I don't like that at all.
> 
> I'm still reading aa's patch, I haven't actually found anything I like
> or agree with in there, but who knows, there's still some way to go.

If it'll be proven AutoNUMA with automatic collection of all stats,
was a bad idea I can change my mind too no problem. For now I'm
grateful I was given the opportunity to allow my idea to materialize
despite your harsh criticism, until it converged in something that I'm
fully satisfied with in terms of the core algorithms that compute the
data and reacts to it (even if some cleanup is still missing, struct
page etc.. :).

If AutoNUMA is such a bad thing that you can't find anything you like
or agree with, it shall be easy to beat in the numbers especially when
you use syscalls to hint the kernel. So please go ahead and post the
numbers where you beat AutoNUMA.

If I would require the apps to be modified and stop the numa hinting
page faults, then I could probably gain something too. But I believe
we'll be better off with the kernel not requiring the apps the be
modified even if it costs some insignificant CPU.

I believe you're massively underestimating how hard it is for people
to modify the apps or to use wrappers or anything that isn't the
default. I don't care about niche here. For the niche there's the
numactl, cpusets, and all sort of bindings already. No need of more
niche, that is pure kernel API pollution in my view, the niche has all
its hard tools it needs already.

I care only and exclusively about all those people that happen to have
bought one recent two socket node which happens to be a NUMA they want
it or not, and they find sometime their applications runs 50% slower
than they should, and they just want to run faster by upgrading the
kernel without having to touch userland at all.

If AutoNUMA is performing better than your code, I hope will move the
focus not on the small cost of collecting the information to compute,
but in the algorithms we use to compute the collected
information. Because if those are good, the boost is so huge that the
small cost of collecting the stats is lost in the noise.

And if you want to insist complaining about the cost I incur in
collecting the stats, I recommend switching the focus from the
pagetable scanner to the numa hinting page faults (even if the latter
are perfectly accounted by the scheduler).

I tried to extrapolate the smallest possible algorithms that should
handle everything it is being thrown at it while always resulting in a
net gain. I invented 4/5 other algorithms before this one and it took
me weeks of benchmarking to get up with something I was fully
statisfied with. I didn't have to change the brainer part for several
weeks already. And if for anything that isn't clear if you ask me I'll
be willing to explain it. As usual documentation is a bit lacking and
it's one of the items in my list (before any other brainer change).

In implementation terms the scheduler is simplified and it won't work
as well as it should with massive CPU overcommit. But I had to take
shortcuts to keep the complexity down to O(N) where N is the number of
CPUS (not of running processes, as it would have happened if I had it
evaluated the whole thing and handle overcommit well). That will
require some rbtree or other log(n) struct to compute the same
algorithm on all runqueues and not just the rq->curr so that it
handles CPU overcommit optimally. For now it wasn't a big concern. But
if you beat AutoNUMA only with massive CPU overcommit you know why. I
tested overcommit and it still worked almost as fast as the hard
bindings, even if it's only working at the rq->curr level it still
benefits greatly over time, I just don't know how much better it would
run if I extended the math to all the running processes.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:04     ` Andrea Arcangeli
@ 2012-03-19 13:26       ` Peter Zijlstra
  2012-03-19 13:57         ` Andrea Arcangeli
  2012-03-19 14:07         ` Andrea Arcangeli
  2012-03-19 13:26       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 13:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> If you boot with memcg compiled in, that's taking an equivalent amount
> of memory per-page.
> 
> If you can bear the memory loss when memcg is compiled in even when
> not enabled, you sure can bear it on NUMA systems that have lots of
> memory, so it's perfectly ok to sacrifice a bit of it so that it
> performs like not-NUMA but you still have more memory than not-NUMA.
> 
I think the overhead of memcg is quite insane as well. And no I cannot
bear that and have it disabled in all my kernels.

NUMA systems having lots of memory is a false argument, that doesn't
mean we can just waste tons of it, people pay good money for that
memory, they want to use it.

I fact, I know that HPC people want things like swap-over-nfs so they
can push infrequently running system crap out into swap so they can get
these few extra megabytes of memory. And you're proposing they give up
~100M just like that?


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:04     ` Andrea Arcangeli
  2012-03-19 13:26       ` Peter Zijlstra
@ 2012-03-19 13:26       ` Peter Zijlstra
  2012-03-19 14:16         ` Andrea Arcangeli
  2012-03-19 13:29       ` Peter Zijlstra
  2012-03-19 13:39       ` Peter Zijlstra
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 13:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:

> About the cost of the actual pagetable scanner, you're not being
> rational about it. You should measure it for once, take khugepaged
> make it scan 1G of memory per millisecond and measure the cost.

Death by a thousand cuts.. 

> You keep complaining about the unaccountability of the pagetable
> scanners in terms of process load, and that's a red herring as far as
> I can tell. The irqs and ksoftirqd load in a busy server, is likely
> much higher than whatever happens at the pagetable scanner level (sure
> thing for khugepaged and by an huge order of magnitude so). 

Who says I agree with ksoftirqd? I would love to get rid of all things
softirq. And I also think workqueues are over-/ab-used.

> I don't
> think this is a relevant concern anyway because the pagetable scanners
> go over all memory in a equal amount so the cost would be evenly
> distributed for all processes over time (the same cannot be said about
> the irqs and ksoftrqid that will benefit only a few processes doing
> I/O). 

So what about the case where all I do is compile kernels and we already
have near perfect locality because everything is short running? You're
still scanning that memory, and I get no benefit.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:04     ` Andrea Arcangeli
  2012-03-19 13:26       ` Peter Zijlstra
  2012-03-19 13:26       ` Peter Zijlstra
@ 2012-03-19 13:29       ` Peter Zijlstra
  2012-03-19 14:19         ` Andrea Arcangeli
  2012-03-19 13:39       ` Peter Zijlstra
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 13:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> For the niche there's the
> numactl, cpusets, and all sort of bindings already. No need of more
> niche, that is pure kernel API pollution in my view, the niche has all
> its hard tools it needs already. 

Not quite, I've heard that some HPC people would very much like to relax
some of that hard binding because its just as big a pain for them as it
is for kvm.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:04     ` Andrea Arcangeli
                         ` (2 preceding siblings ...)
  2012-03-19 13:29       ` Peter Zijlstra
@ 2012-03-19 13:39       ` Peter Zijlstra
  2012-03-19 14:20         ` Andrea Arcangeli
  3 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 13:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> In implementation terms the scheduler is simplified and it won't work
> as well as it should with massive CPU overcommit. But I had to take
> shortcuts to keep the complexity down to O(N) where N is the number of
> CPUS 

Yeah I saw that, you essentially put a nr_cpus loop inside schedule(),
obviously that's not going to ever happen.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 11:42     ` Avi Kivity
                         ` (2 preceding siblings ...)
  2012-03-19 12:20       ` Peter Zijlstra
@ 2012-03-19 13:40       ` Andrea Arcangeli
  2012-03-19 20:06         ` Peter Zijlstra
  3 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 13:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 01:42:08PM +0200, Avi Kivity wrote:
> Extra work, and more slowness until they get rebuilt.  Why not migrate
> entire large pages?

The main problem is the double copy, first copy for migrate, second
for khugepaged. This is why we want it native over time. So it also
only stops the accesses to the pages for a shorter period of time.

> I agree with this, but it's really widespread throughout the kernel,
> from interrupts to work items to background threads.  It needs to be
> solved generically (IIRC vhost has some accouting fix for a similar issue).

Exactly.

> It's the standard space/time tradeoff.  Once solution wants more
> storage, the other wants more faults.

I didn't grow it much more than memcg, and at least if you boot on
NUMA hardware you'll be sure to use AutoNUMA. The fact it's in the
struct page it's an implementation detail, it'll only be allocated if
the kernel is booted on NUMA hardware later.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-16 21:12     ` Peter Zijlstra
@ 2012-03-19 13:53       ` Christoph Lameter
  2012-03-19 14:05         ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 16 Mar 2012, Peter Zijlstra wrote:

> > > Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> > > facilitate efficient on-demand memory migration.
> >
> > The numa hierachy is already complex. Could we avoid adding another layer
> > by adding a MPOL_HOME_NODE and make that the default?
>
> Not sure that's really a win, the behaviour would be the same we just
> have to implement another policy, which is likely more code.

A HOME_NODE policy would also help to ensure that existing applications
continue to work as expected. Given that people in the HPC industry and
elsewhere have been fine tuning around the scheduler for years this is a
desirable goal and ensures backward compatibility.



^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:26       ` Peter Zijlstra
@ 2012-03-19 13:57         ` Andrea Arcangeli
  2012-03-19 14:06           ` Avi Kivity
                             ` (2 more replies)
  2012-03-19 14:07         ` Andrea Arcangeli
  1 sibling, 3 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 02:26:31PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> > If you boot with memcg compiled in, that's taking an equivalent amount
> > of memory per-page.
> > 
> > If you can bear the memory loss when memcg is compiled in even when
> > not enabled, you sure can bear it on NUMA systems that have lots of
> > memory, so it's perfectly ok to sacrifice a bit of it so that it
> > performs like not-NUMA but you still have more memory than not-NUMA.
> > 
> I think the overhead of memcg is quite insane as well. And no I cannot
> bear that and have it disabled in all my kernels.
> 
> NUMA systems having lots of memory is a false argument, that doesn't
> mean we can just waste tons of it, people pay good money for that
> memory, they want to use it.
> 
> I fact, I know that HPC people want things like swap-over-nfs so they
> can push infrequently running system crap out into swap so they can get
> these few extra megabytes of memory. And you're proposing they give up
> ~100M just like that?

With your code they will get -ENOMEM from split_vma and a slowdown in
all regular page faults and vma mangling operations, before they run
out of memory...

The per-page memory loss is 24bytes, AutoNUMA in page terms costs 0.5%
of ram (and only if booted on NUMA hardware, unless noautonuma is
passed as parameter), and I can't imagine that to be a problem on a
system where hardware vendor took shortcuts to install massive amounts
of RAM that is fast to access only locally. If you buy that kind of
hardware losing the cost of 0.5% of RAM of it, is ridiculous compared
to the programmer cost of patching all apps.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 13:53       ` Christoph Lameter
@ 2012-03-19 14:05         ` Peter Zijlstra
  2012-03-19 15:16           ` Christoph Lameter
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 14:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 08:53 -0500, Christoph Lameter wrote:
> On Fri, 16 Mar 2012, Peter Zijlstra wrote:
> 
> > > > Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> > > > facilitate efficient on-demand memory migration.
> > >
> > > The numa hierachy is already complex. Could we avoid adding another layer
> > > by adding a MPOL_HOME_NODE and make that the default?
> >
> > Not sure that's really a win, the behaviour would be the same we just
> > have to implement another policy, which is likely more code.
> 
> A HOME_NODE policy would also help to ensure that existing applications
> continue to work as expected. Given that people in the HPC industry and
> elsewhere have been fine tuning around the scheduler for years this is a
> desirable goal and ensures backward compatibility.

I really have no idea what you're saying. Existing applications that use
mbind/set_mempolicy already continue to function exactly like before,
see how the new layer is below all that.



^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:57         ` Andrea Arcangeli
@ 2012-03-19 14:06           ` Avi Kivity
  2012-03-19 14:30             ` Andrea Arcangeli
  2012-03-19 14:07           ` Peter Zijlstra
  2012-03-19 19:13           ` Peter Zijlstra
  2 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 14:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 03:57 PM, Andrea Arcangeli wrote:
> On Mon, Mar 19, 2012 at 02:26:31PM +0100, Peter Zijlstra wrote:
> > On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> > > If you boot with memcg compiled in, that's taking an equivalent amount
> > > of memory per-page.
> > > 
> > > If you can bear the memory loss when memcg is compiled in even when
> > > not enabled, you sure can bear it on NUMA systems that have lots of
> > > memory, so it's perfectly ok to sacrifice a bit of it so that it
> > > performs like not-NUMA but you still have more memory than not-NUMA.
> > > 
> > I think the overhead of memcg is quite insane as well. And no I cannot
> > bear that and have it disabled in all my kernels.
> > 
> > NUMA systems having lots of memory is a false argument, that doesn't
> > mean we can just waste tons of it, people pay good money for that
> > memory, they want to use it.
> > 
> > I fact, I know that HPC people want things like swap-over-nfs so they
> > can push infrequently running system crap out into swap so they can get
> > these few extra megabytes of memory. And you're proposing they give up
> > ~100M just like that?
>
> With your code they will get -ENOMEM from split_vma and a slowdown in
> all regular page faults and vma mangling operations, before they run
> out of memory...
>
> The per-page memory loss is 24bytes, AutoNUMA in page terms costs 0.5%
> of ram (and only if booted on NUMA hardware, unless noautonuma is
> passed as parameter), and I can't imagine that to be a problem on a
> system where hardware vendor took shortcuts to install massive amounts
> of RAM that is fast to access only locally. If you buy that kind of
> hardware losing the cost of 0.5% of RAM of it, is ridiculous compared
> to the programmer cost of patching all apps.

I agree with Peter on this, but perhaps it's because my app will take
all of 15 minutes to patch.  Up front knowledge is better than the
kernel discovering it on its own.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:26       ` Peter Zijlstra
  2012-03-19 13:57         ` Andrea Arcangeli
@ 2012-03-19 14:07         ` Andrea Arcangeli
  2012-03-19 19:05           ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 02:26:31PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> > If you boot with memcg compiled in, that's taking an equivalent amount
> > of memory per-page.
> > 
> > If you can bear the memory loss when memcg is compiled in even when
> > not enabled, you sure can bear it on NUMA systems that have lots of
> > memory, so it's perfectly ok to sacrifice a bit of it so that it
> > performs like not-NUMA but you still have more memory than not-NUMA.
> > 
> I think the overhead of memcg is quite insane as well. And no I cannot
> bear that and have it disabled in all my kernels.
> 
> NUMA systems having lots of memory is a false argument, that doesn't
> mean we can just waste tons of it, people pay good money for that
> memory, they want to use it.
> 
> I fact, I know that HPC people want things like swap-over-nfs so they
> can push infrequently running system crap out into swap so they can get
> these few extra megabytes of memory. And you're proposing they give up
> ~100M just like that?

If they run 20% faster absolutely they will give up the 100M.

You may want to check how many gigabytes they swap... going through
the mess of swap-over-nfs to swap _only_ ~100M would be laughable. If
they push to swap several gigabytes ok, but then 100M more or less
won't matter.

If you intend to proof AutoNUMA design isn't ok, do not complain about
the memory use per page, do not complain about the pagetable scanner,
only complain about the cost of the numa hinting page fault in
presence of virt and vmexists. That is frankly my only slight concern
and it largely depends on hardware and not enough benchmarking has
been done to give it a green light yet. I am optimistic though because
worst case the page fault numa hinting fault frequency should be
reduced for tasks with mmu notifier attached to it and in turn
secondary mmus and higher page fault costs.

Pagetable scanner and memory use will be absolutely ok.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:57         ` Andrea Arcangeli
  2012-03-19 14:06           ` Avi Kivity
@ 2012-03-19 14:07           ` Peter Zijlstra
  2012-03-19 14:34             ` Andrea Arcangeli
  2012-03-19 19:13           ` Peter Zijlstra
  2 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 14:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:57 +0100, Andrea Arcangeli wrote:
> If you buy that kind of
> hardware losing the cost of 0.5% of RAM of it, is ridiculous compared
> to the programmer cost of patching all apps. 

What all apps? Only apps that are larger than a single node need
patching.

And no, I really don't think giving up 0.5% of RAM is acceptable.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:26       ` Peter Zijlstra
@ 2012-03-19 14:16         ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 02:26:34PM +0100, Peter Zijlstra wrote:
> So what about the case where all I do is compile kernels and we already
> have near perfect locality because everything is short running? You're
> still scanning that memory, and I get no benefit.

I could add an option to delay the scan and enable it only on long
lived "mm". In practice I measured the scanning cost and it was in the
unmeasurable range on host this is why I didn't yet, plus I tried to
avoid all special cases and to keep things as generic as possible so
treating everything the same. Maybe it's good idea, maybe not as it
delays more the time it takes to react to wrong memory layout.

If you stop knuma_scand with sysfs (echo 0 >...) the whole thing
eventually stops. It's like 3 gears, where first gear is knuma_scand,
second gear is the numa hinting page fault, the third gears are
knuma_migrated and CPU scheduler that gets driven.

So it's easy to benchmark the fixed cost.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:29       ` Peter Zijlstra
@ 2012-03-19 14:19         ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 02:29:36PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> > For the niche there's the
> > numactl, cpusets, and all sort of bindings already. No need of more
> > niche, that is pure kernel API pollution in my view, the niche has all
> > its hard tools it needs already. 
> 
> Not quite, I've heard that some HPC people would very much like to relax
> some of that hard binding because its just as big a pain for them as it
> is for kvm.

Then I guess if they call hard bindings a big pain, they won't be
excited by the pain you offer them through your new soft binding
syscalls.

It's totally ok for qemu, which will just run 2 syscalls per
vnode.

But with your solution some apps will suffer from the same massive
pain that they're currently suffering. This is why is still niche to me.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:39       ` Peter Zijlstra
@ 2012-03-19 14:20         ` Andrea Arcangeli
  2012-03-19 20:17           ` Christoph Lameter
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 02:39:31PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 14:04 +0100, Andrea Arcangeli wrote:
> > In implementation terms the scheduler is simplified and it won't work
> > as well as it should with massive CPU overcommit. But I had to take
> > shortcuts to keep the complexity down to O(N) where N is the number of
> > CPUS 
> 
> Yeah I saw that, you essentially put a nr_cpus loop inside schedule(),
> obviously that's not going to ever happen.

lol Would be fun if such a simplification would still perform better
than your code :).

Yeah I'll try to fix that but it's massively complex and frankly
benchmarking wise it won't help much fixing that... so it's beyond the
end of my todo list.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:06           ` Avi Kivity
@ 2012-03-19 14:30             ` Andrea Arcangeli
  2012-03-19 18:42               ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 04:06:43PM +0200, Avi Kivity wrote:
> I agree with Peter on this, but perhaps it's because my app will take
> all of 15 minutes to patch.  Up front knowledge is better than the
> kernel discovering it on its own.

I agree for qemu those soft bindings are fine.

But how you compute the statistical data is most difficult part, how
you collect them not so important after all.

The scheduler code Peter said is "that's not going to ever happen"
took several weeks of benchmarking and several rewrites to actual
materialize and be super happy about the core algorithm. If it's
simplified and just loops over the CPUs is because when you do
research and invent you can't waste time on actual implementation
details that just slow you down on the next rewrite of the algorithm
as you have to try again. So if my algorithms (abeit in a simplified
form compared to a real full implementation) works better, not having
the background scanning won't help Peter at all and you'll still be
better off with AutoNUMA.

When you focus only on the cost of collecting the information and no
actual discussion was spent yet on how to compute or react to it,
something's wrong... as that's the really interesting part of the code.

The fact the autonuma_balance() is simplified implementation is still
perfectly ok right now, as that's totally hidden kernel internal
thing, not even affecting kABI. Can be improved any time. Plus it's
not like it will backfire, it just won't be running as good as a full
more complex implementation that takes into account all the
runqueues. And implementating that won't be easy at all, there are
simpler things that are more urgent at this point.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:07           ` Peter Zijlstra
@ 2012-03-19 14:34             ` Andrea Arcangeli
  2012-03-19 18:41               ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, Mar 19, 2012 at 03:07:59PM +0100, Peter Zijlstra wrote:
> And no, I really don't think giving up 0.5% of RAM is acceptable.

Fine it's up to you :).

Also note 16 bytes of those 24 bytes, you need to spend them too if
you remotely hope to perform as good as AutoNUMA (I can already tell
you...), they've absolutely nothing to do with the background scanning
that AutoNUMA does to avoid modifying the apps.

The blame on autonuma you can give is 8 bytes per page only, so 0.07%,
which I can probably reduce 0.03% if I screw the natural alignment of
the list pointers and MAX_NUMNODES is < 32768 at build time, not sure
if it's worth it.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 14:05         ` Peter Zijlstra
@ 2012-03-19 15:16           ` Christoph Lameter
  2012-03-19 15:23             ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 19 Mar 2012, Peter Zijlstra wrote:

> > A HOME_NODE policy would also help to ensure that existing applications
> > continue to work as expected. Given that people in the HPC industry and
> > elsewhere have been fine tuning around the scheduler for years this is a
> > desirable goal and ensures backward compatibility.
>
> I really have no idea what you're saying. Existing applications that use
> mbind/set_mempolicy already continue to function exactly like before,
> see how the new layer is below all that.

No they wont work the same way as before. Applications may be relying on
MPOL_DEFAULT behavior now expecting node local allocations. The home-node
functionality would cause a difference in behavior because it would
perform remote node allocs when a thread has been moved to a different
socket. The changes also cause migrations that may cause additional
latencies as well as change the location of memory in surprising ways for
the applications


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 15:16           ` Christoph Lameter
@ 2012-03-19 15:23             ` Peter Zijlstra
  2012-03-19 15:31               ` Christoph Lameter
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 15:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 10:16 -0500, Christoph Lameter wrote:
> On Mon, 19 Mar 2012, Peter Zijlstra wrote:
> 
> > > A HOME_NODE policy would also help to ensure that existing applications
> > > continue to work as expected. Given that people in the HPC industry and
> > > elsewhere have been fine tuning around the scheduler for years this is a
> > > desirable goal and ensures backward compatibility.
> >
> > I really have no idea what you're saying. Existing applications that use
> > mbind/set_mempolicy already continue to function exactly like before,
> > see how the new layer is below all that.
> 
> No they wont work the same way as before. Applications may be relying on
> MPOL_DEFAULT behavior now expecting node local allocations. The home-node
> functionality would cause a difference in behavior because it would
> perform remote node allocs when a thread has been moved to a different
> socket. The changes also cause migrations that may cause additional
> latencies as well as change the location of memory in surprising ways for
> the applications

Still not sure what you're suggesting though, you argue to keep the
default what it is, this is in direct conflict with making the default
do something saner for most of the time.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 15:23             ` Peter Zijlstra
@ 2012-03-19 15:31               ` Christoph Lameter
  2012-03-19 17:09                 ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 19 Mar 2012, Peter Zijlstra wrote:

> > No they wont work the same way as before. Applications may be relying on
> > MPOL_DEFAULT behavior now expecting node local allocations. The home-node
> > functionality would cause a difference in behavior because it would
> > perform remote node allocs when a thread has been moved to a different
> > socket. The changes also cause migrations that may cause additional
> > latencies as well as change the location of memory in surprising ways for
> > the applications
>
> Still not sure what you're suggesting though, you argue to keep the
> default what it is, this is in direct conflict with making the default
> do something saner for most of the time.

MPOL_DEFAULT is a certain type of behavior right now that applications
rely on. If you change that then these applications will no longer work as
expected.

MPOL_DEFAULT is currently set to be the default policy on bootup. You can
change that of course and allow setting MPOL_DEFAULT manually for
applications that rely on old behavor. Instead set the default behavior on
bootup for MPOL_HOME_NODE.

So the default system behavior would be MPOL_HOME_NODE but it could be
overriding by numactl to allow old apps to run as they are used to run.


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 12:24         ` Avi Kivity
@ 2012-03-19 15:44           ` Avi Kivity
  0 siblings, 0 replies; 153+ messages in thread
From: Avi Kivity @ 2012-03-19 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 02:24 PM, Avi Kivity wrote:
> On 03/19/2012 02:20 PM, Peter Zijlstra wrote:
> > On Mon, 2012-03-19 at 13:42 +0200, Avi Kivity wrote:
> > > It's the standard space/time tradeoff.  Once solution wants more
> > > storage, the other wants more faults.
> > > 
> > > Note scanners can use A/D bits which are cheaper than faults.
> >
> > I'm not convinced.. the scanner will still consume time even if the
> > system is perfectly balanced -- it has to in order to determine this.
> >
> > So sure, A/D/other page table magic can make scanners faster than faults
> > however you only need faults when you're actually going to migrate a
> > task. Whereas you always need to scan, even in the stable state.
> >
> > So while the per-instance times might be in favour of scanning, I'm
> > thinking the accumulated time is in favour of faults.
>
> When you migrate a vnode, you don't need the faults at all.  You know
> exactly which pages need to be migrated, you can just queue them
> immediately when you make that decision.
>
> The scanning therefore only needs to pick up the stragglers and can be
> set to a very low frequency.

Running the numbers, 4GB = 1Mpages, at 2us per page migration that's 2
seconds to migrate an entire process, perhaps 2x-3x that for kvm.  So as
long numa balancing happens at a lower frequency than once every few
minutes, the gains should be higher than the loss.  If those numbers are
not too wrong then migrate on fault should be a win.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 15:31               ` Christoph Lameter
@ 2012-03-19 17:09                 ` Peter Zijlstra
  2012-03-19 17:28                   ` Peter Zijlstra
                                     ` (2 more replies)
  0 siblings, 3 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 17:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 10:31 -0500, Christoph Lameter wrote:

> MPOL_DEFAULT is a certain type of behavior right now that applications
> rely on. If you change that then these applications will no longer work as
> expected.
> 
> MPOL_DEFAULT is currently set to be the default policy on bootup. You can
> change that of course and allow setting MPOL_DEFAULT manually for
> applications that rely on old behavor. Instead set the default behavior on
> bootup for MPOL_HOME_NODE.
> 
> So the default system behavior would be MPOL_HOME_NODE but it could be
> overriding by numactl to allow old apps to run as they are used to run.

Ah, OK. Although that's a mightily confusing usage of the word DEFAULT.
How about instead we make MPOL_LOCAL a real policy and allow explicitly
setting that?

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 17:09                 ` Peter Zijlstra
@ 2012-03-19 17:28                   ` Peter Zijlstra
  2012-03-19 19:06                   ` Christoph Lameter
  2012-03-19 20:28                   ` Lee Schermerhorn
  2 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 18:09 +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 10:31 -0500, Christoph Lameter wrote:
> 
> > MPOL_DEFAULT is a certain type of behavior right now that applications
> > rely on. If you change that then these applications will no longer work as
> > expected.
> > 
> > MPOL_DEFAULT is currently set to be the default policy on bootup. You can
> > change that of course and allow setting MPOL_DEFAULT manually for
> > applications that rely on old behavor. Instead set the default behavior on
> > bootup for MPOL_HOME_NODE.
> > 
> > So the default system behavior would be MPOL_HOME_NODE but it could be
> > overriding by numactl to allow old apps to run as they are used to run.
> 
> Ah, OK. Although that's a mightily confusing usage of the word DEFAULT.
> How about instead we make MPOL_LOCAL a real policy and allow explicitly
> setting that?

I suspect something like the below might suffice.. still need to test it
though.

---
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_NOOP,		/* retain existing policy for range */
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -285,6 +285,10 @@ struct mempolicy *mpol_new(unsigned shor
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL | __GFP_ZERO);
@@ -2446,7 +2450,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2499,12 +2502,12 @@ int mpol_parse_str(char *str, struct mem
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:34             ` Andrea Arcangeli
@ 2012-03-19 18:41               ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 18:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 15:34 +0100, Andrea Arcangeli wrote:
> On Mon, Mar 19, 2012 at 03:07:59PM +0100, Peter Zijlstra wrote:
> > And no, I really don't think giving up 0.5% of RAM is acceptable.
> 
> Fine it's up to you :).
> 
> Also note 16 bytes of those 24 bytes, you need to spend them too if
> you remotely hope to perform as good as AutoNUMA (I can already tell
> you...), they've absolutely nothing to do with the background scanning
> that AutoNUMA does to avoid modifying the apps.

Going by that size it can only be the list head and you use that for
enqueueing the page on target node lists for page-migration. The thing
is, since you work on page granular objects you have to have this
information per-page. I work on vma objects and can do with this
information per vma.

It would be ever so much more helpful if, instead of talking in clues
and riddles you just say what you mean. Also, try and say it without
writing a book. I still haven't completely read your first email of
today (and probably never will -- its just too big).

> The blame on autonuma you can give is 8 bytes per page only, so 0.07%,
> which I can probably reduce 0.03% if I screw the natural alignment of
> the list pointers and MAX_NUMNODES is < 32768 at build time, not sure
> if it's worth it.

Well, no, I can blame the entire size increase on auto-numa. I don't
need to enqueue individual pages to a target node, I simply unmap
everything that's on the wrong node and the migrate-on-fault stuff will
compute the target node based on the vma information.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:30             ` Andrea Arcangeli
@ 2012-03-19 18:42               ` Peter Zijlstra
  2012-03-20 22:18                 ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 18:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 15:30 +0100, Andrea Arcangeli wrote:

> I agree for qemu those soft bindings are fine.

So for what exact program(s) are you working? Your solution seems purely
focused on the hard case of a multi-threaded application that's larger
than a single node.

While I'm sure such applications exist, how realistic is it that they're
the majority?

> But how you compute the statistical data is most difficult part, how
> you collect them not so important after all.

> When you focus only on the cost of collecting the information and no
> actual discussion was spent yet on how to compute or react to it,
> something's wrong... as that's the really interesting part of the code.

Yeah, the thing that's wrong is you dumping a ~2300 line patch of dense
code over the wall without any high-level explanation.

I just about got to the policy parts but its not like its easy reading.

Also, you giving clues but not really saying what you mean attitude
combined with your tendency to write books instead of emails isn't
really conductive to me wanting to ask for any explanation either.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
@ 2012-03-19 18:47   ` Peter Zijlstra
  2012-03-19 19:02     ` Andrea Arcangeli
  2012-03-20 23:41   ` Dan Smith
  1 sibling, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 18:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Fri, 2012-03-16 at 19:25 +0100, Andrea Arcangeli wrote:
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=30ed50adf6cfe85f7feb12c4279359ec52f5f2cd;hp=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1
> 
> It's a big monlithic patch, but I'll split it.

I applied this big patch to a fairly recent tree from Linus but it
failed to boot. It got stuck somewhere in SMP bringup.

I waited for several seconds but pressed the remote power switch when
nothing more came out..

The last bit out of my serial console looked like:

---

Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 98000
cpu 1 node 0
cpu 1 apicid 2 node 0
Pid: 0, comm: swapper/1 Not tainted 3.3.0-rc7-00048-g762ad8a-dirty #32
Call Trace:
 [<ffffffff81942a37>] numa_set_node+0x50/0x6a
 [<ffffffff8193f0b4>] init_intel+0x13c/0x232
 [<ffffffff8193e50a>] ? get_cpu_cap+0xa3/0xa7
 [<ffffffff8193e74e>] identify_cpu+0x240/0x347
 [<ffffffff8193e869>] identify_secondary_cpu+0x14/0x1b
 [<ffffffff8194131b>] smp_store_cpu_info+0x3c/0x3e
 [<ffffffff8194176a>] start_secondary+0x109/0x21e
numa cpu 1 node 0
NMI watchdog enabled, takes one hw-pmu counter.
 #2
smpboot cpu 2: start_ip = 98000
cpu 2 node 0
cpu 2 apicid 4 node 0
Pid: 0, comm: swapper/2 Not tainted 3.3.0-rc7-00048-g762ad8a-dirty #32
Call Trace:
 [<ffffffff81942a37>] numa_set_node+0x50/0x6a
 [<ffffffff8193f0b4>] init_intel+0x13c/0x232
 [<ffffffff8193e50a>] ? get_cpu_cap+0xa3/0xa7
 [<ffffffff8193e74e>] identify_cpu+0x240/0x347
 [<ffffffff8193e869>] identify_secondary_cpu+0x14/0x1b
 [<ffffffff8194131b>] smp_store_cpu_info+0x3c/0x3e
 [<ffffffff8194176a>] start_secondary+0x109/0x21e
numa cpu 2 node 0
NMI

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-19 18:47   ` Peter Zijlstra
@ 2012-03-19 19:02     ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-19 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Mon, Mar 19, 2012 at 07:47:22PM +0100, Peter Zijlstra wrote:
> On Fri, 2012-03-16 at 19:25 +0100, Andrea Arcangeli wrote:
> > http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=30ed50adf6cfe85f7feb12c4279359ec52f5f2cd;hp=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1
> > 
> > It's a big monlithic patch, but I'll split it.
> 
> I applied this big patch to a fairly recent tree from Linus but it
> failed to boot. It got stuck somewhere in SMP bringup.
> 
> I waited for several seconds but pressed the remote power switch when
> nothing more came out..
> 
> The last bit out of my serial console looked like:

btw, the dump_stack in your trace are very superflous and I should
remove them... They were meant to debug a problem in numa emulation
that I fixed some time ago and sent to Andrew just a few days ago.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1

You may want to try to checkout a full git tree to be sure it's not a
collision with something else, at that point of the boot stage
autonuma shouldn't run at all so it's unlikely related.

Hillf just sent me a fix for large systems which I already committed,
maybe that's your problem?

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/autonuma

I also added checks for cpu_online that are probably needed but those
aren't visible yet but you don't need them to boot...

If you want to extract the patch and all other patches to apply it
hand, the simplest is to:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
git diff origin/master origin/autonuma > x

Or "git format-patch origin/master..origin/autonuma"

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:07         ` Andrea Arcangeli
@ 2012-03-19 19:05           ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 19:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 15:07 +0100, Andrea Arcangeli wrote:
> You may want to check how many gigabytes they swap... going through
> the mess of swap-over-nfs to swap _only_ ~100M would be laughable. If
> they push to swap several gigabytes ok, but then 100M more or less
> won't matter. 

They explicitly said the regular system services that get spawned at
boot and are convenient to have around but are mostly just there sucking
up memory. Thinks like sshd, crond etc..

ps -deo pid,rss,comm | awk '{t += $2} END { print t }'

On my (otherwise idle) box gives me ~62M.



^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 17:09                 ` Peter Zijlstra
  2012-03-19 17:28                   ` Peter Zijlstra
@ 2012-03-19 19:06                   ` Christoph Lameter
  2012-03-19 20:28                   ` Lee Schermerhorn
  2 siblings, 0 replies; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 19:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 19 Mar 2012, Peter Zijlstra wrote:

> > So the default system behavior would be MPOL_HOME_NODE but it could be
> > overriding by numactl to allow old apps to run as they are used to run.
>
> Ah, OK. Although that's a mightily confusing usage of the word DEFAULT.
> How about instead we make MPOL_LOCAL a real policy and allow explicitly
> setting that?

True. MPOL_LOCAL would be a good choice.


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:57         ` Andrea Arcangeli
  2012-03-19 14:06           ` Avi Kivity
  2012-03-19 14:07           ` Peter Zijlstra
@ 2012-03-19 19:13           ` Peter Zijlstra
  2 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 19:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:57 +0100, Andrea Arcangeli wrote:
> With your code they will get -ENOMEM from split_vma and a slowdown in
> all regular page faults and vma mangling operations, before they run
> out of memory... 

But why would you want to create that many vmas? If you're going to call
sys_numa_mbind() at object level you're doing it wrong. 

Typical usage would be to call it on the chunks your allocator asks from
the system. Depending on how your application decomposes this is per
thread or per thread-pool.

But again, who is writing such large threaded apps. The shared address
space thing is cute, but the shared address space thing is also the
bottleneck. Sharing mmap_sem et al across the entire machine has been
enough reason not to use threads for plenty people.



^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 12:16         ` Avi Kivity
@ 2012-03-19 20:03           ` Peter Zijlstra
  2012-03-20 10:18             ` Avi Kivity
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 20:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:16 +0200, Avi Kivity wrote:
> > Afaik we do not use dma engines for memory migration. 
> 
> We don't, but I think we should. 

ISTR we (the community) had this discussion once. I also seem to
remember the general consensus being that DMA engines would mostly
likely not be worth the effort, although I can't really recall the
specifics.

Esp. for 4k pages the setup of the offload will likely be more expensive
than actually doing the memcpy.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 13:40       ` Andrea Arcangeli
@ 2012-03-19 20:06         ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 20:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Mon, 2012-03-19 at 14:40 +0100, Andrea Arcangeli wrote:
> > I agree with this, but it's really widespread throughout the kernel,
> > from interrupts to work items to background threads.  It needs to be
> > solved generically (IIRC vhost has some accouting fix for a similar issue).
> 
> Exactly. 

The fact that we all agree its a problem and that nobody has a sane idea
on how to solve it might argue against this.

Also, the fact that there's existing ugly isn't excuse to stop worrying
about it and add more.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 14:20         ` Andrea Arcangeli
@ 2012-03-19 20:17           ` Christoph Lameter
  2012-03-19 20:28             ` Ingo Molnar
  0 siblings, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Avi Kivity, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	linux-kernel, linux-mm

On Mon, 19 Mar 2012, Andrea Arcangeli wrote:

> Yeah I'll try to fix that but it's massively complex and frankly
> benchmarking wise it won't help much fixing that... so it's beyond the
> end of my todo list.

Well a word of caution here: SGI tried to implement automatic migration
schemes back in the 90's but they were never able to show a general
benefit of migration. The overhead added because of auto migration often
was not made up by true acceleration of the applications running on the
system. They were able to tune the automatic migration to work on
particular classes of applications but it never turned out to be generally
advantageous.

I wonder how we can verify that the automatic migration schemes are a real
benefit to the application? We have a history of developing a kernel that
decreases in performance as development proceeds. How can we make sure
that these schemes are actually beneficial overall for all loads and do
not cause regressions elsewhere?

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 17:09                 ` Peter Zijlstra
  2012-03-19 17:28                   ` Peter Zijlstra
  2012-03-19 19:06                   ` Christoph Lameter
@ 2012-03-19 20:28                   ` Lee Schermerhorn
  2012-03-19 21:21                     ` Peter Zijlstra
  2 siblings, 1 reply; 153+ messages in thread
From: Lee Schermerhorn @ 2012-03-19 20:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	linux-kernel, linux-mm

On Mon, 2012-03-19 at 18:09 +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 10:31 -0500, Christoph Lameter wrote:
> 
> > MPOL_DEFAULT is a certain type of behavior right now that applications
> > rely on. If you change that then these applications will no longer work as
> > expected.
> > 
> > MPOL_DEFAULT is currently set to be the default policy on bootup. You can
> > change that of course and allow setting MPOL_DEFAULT manually for
> > applications that rely on old behavor. Instead set the default behavior on
> > bootup for MPOL_HOME_NODE.
> > 
> > So the default system behavior would be MPOL_HOME_NODE but it could be
> > overriding by numactl to allow old apps to run as they are used to run.
> 
> Ah, OK. Although that's a mightily confusing usage of the word DEFAULT.
> How about instead we make MPOL_LOCAL a real policy and allow explicitly
> setting that?
> 

Maybe less confusing if you don't think of MPOL_DEFAULT as a real
mempolicy.  As the value of the mode parameter to mbind(2) and
internally, it indicates that "default behavior" is requested or being
used.   It's not stored in the mode member of a mempolicy structure like
MPOL_BIND and others.  Nor is it used in the page allocation path.  The
actual implementation is the absence of a non-default mempolicy -- i.e.,
a NULL task or vma/shared policy pointer.

Because default behavior for task policy is local allocation,
MPOL_DEFAULT does sometimes get confused with local allocation. The
NOTES section and the description of MPOL_DEFAULT in the mbind(2) man
page attempt to clarify this.





^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 20:17           ` Christoph Lameter
@ 2012-03-19 20:28             ` Ingo Molnar
  2012-03-19 20:43               ` Christoph Lameter
  2012-03-20  0:05               ` Linus Torvalds
  0 siblings, 2 replies; 153+ messages in thread
From: Ingo Molnar @ 2012-03-19 20:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Peter Zijlstra, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm


* Christoph Lameter <cl@linux.com> wrote:

> On Mon, 19 Mar 2012, Andrea Arcangeli wrote:
> 
> > Yeah I'll try to fix that but it's massively complex and 
> > frankly benchmarking wise it won't help much fixing that... 
> > so it's beyond the end of my todo list.
> 
> Well a word of caution here: SGI tried to implement automatic 
> migration schemes back in the 90's but they were never able to 
> show a general benefit of migration. The overhead added 
> because of auto migration often was not made up by true 
> acceleration of the applications running on the system. They 
> were able to tune the automatic migration to work on 
> particular classes of applications but it never turned out to 
> be generally advantageous.

Obviously any such scheme must be a win in general for it to be 
default on. We don't have the numbers to justify that - and I'm 
sceptical whether it will be possible, but I'm willing to be 
surprised.

I'm especially sceptical since most mainstream NUMA systems tend 
to have a low NUMA factor. Thus the actual cost of being NUMA is 
pretty low.

That having said PeterZ's numbers showed some pretty good 
improvement for the streams workload:

 before: 512.8M
  after: 615.7M

i.e. a +20% improvement on a not very heavily NUMA box.

That kind of raw speedup of a CPU execution workload like 
streams is definitely not something to ignore out of hand. *IF* 
there is a good automatism that can activate it for the apps 
that are very likely to benefit from it then we can possibly do 
it.

But a lot more measurements have to be done, and I'd be also 
very interested in the areas that regress.

Otherwise, if no robust automation is possible, it will have to 
be opt-in, on a per app basis, with both programmatic and 
sysadmin knobs available. (who will hopefully make use if it...)

That's the best we can do I think.

> I wonder how we can verify that the automatic migration 
> schemes are a real benefit to the application? We have a 
> history of developing a kernel that decreases in performance 
> as development proceeds. How can we make sure that these 
> schemes are actually beneficial overall for all loads and do 
> not cause regressions elsewhere? [...]

The usual way?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 20:28             ` Ingo Molnar
@ 2012-03-19 20:43               ` Christoph Lameter
  2012-03-19 21:34                 ` Ingo Molnar
  2012-03-20  0:05               ` Linus Torvalds
  1 sibling, 1 reply; 153+ messages in thread
From: Christoph Lameter @ 2012-03-19 20:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Peter Zijlstra, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm

On Mon, 19 Mar 2012, Ingo Molnar wrote:

> > I wonder how we can verify that the automatic migration
> > schemes are a real benefit to the application? We have a
> > history of developing a kernel that decreases in performance
> > as development proceeds. How can we make sure that these
> > schemes are actually beneficial overall for all loads and do
> > not cause regressions elsewhere? [...]
>
> The usual way?

Which is merge after a couple of benchmarks and then deal with the
regressions for a couple of years?

Patch verification occurs in an artificial bubble of software run/known by
kernel developers. It can take years before the code is exposed to
real life situations.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware
  2012-03-19 20:28                   ` Lee Schermerhorn
@ 2012-03-19 21:21                     ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-19 21:21 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	linux-kernel, linux-mm

On Mon, 2012-03-19 at 16:28 -0400, Lee Schermerhorn wrote:
> Because default behavior for task policy is local allocation,
> MPOL_DEFAULT does sometimes get confused with local allocation. 

Right, its this confusion I wanted to avoid.


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 20:43               ` Christoph Lameter
@ 2012-03-19 21:34                 ` Ingo Molnar
  0 siblings, 0 replies; 153+ messages in thread
From: Ingo Molnar @ 2012-03-19 21:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Peter Zijlstra, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm


* Christoph Lameter <cl@linux.com> wrote:

> On Mon, 19 Mar 2012, Ingo Molnar wrote:
> 
> > > I wonder how we can verify that the automatic migration 
> > > schemes are a real benefit to the application? We have a 
> > > history of developing a kernel that decreases in 
> > > performance as development proceeds. How can we make sure 
> > > that these schemes are actually beneficial overall for all 
> > > loads and do not cause regressions elsewhere? [...]
> >
> > The usual way?
> 
> Which is merge after a couple of benchmarks and then deal with 
> the regressions for a couple of years?
>
> [...]

No, and I gave you my answer:

> Obviously any such scheme must be a win in general for it to be 
> default on. We don't have the numbers to justify that - and I'm 
> sceptical whether it will be possible, but I'm willing to be 
> surprised.
> 
> I'm especially sceptical since most mainstream NUMA systems tend 
> to have a low NUMA factor. Thus the actual cost of being NUMA is 
> pretty low.
> 
> That having said PeterZ's numbers showed some pretty good 
> improvement for the streams workload:
> 
>  before: 512.8M
>   after: 615.7M
> 
> i.e. a +20% improvement on a not very heavily NUMA box.
> 
> That kind of raw speedup of a CPU execution workload like 
> streams is definitely not something to ignore out of hand. *IF* 
> there is a good automatism that can activate it for the apps 
> that are very likely to benefit from it then we can possibly do 
> it.
> 
> But a lot more measurements have to be done, and I'd be also 
> very interested in the areas that regress.
> 
> Otherwise, if no robust automation is possible, it will have to 
> be opt-in, on a per app basis, with both programmatic and 
> sysadmin knobs available. (who will hopefully make use if it...)
> 
> That's the best we can do I think.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 20:28             ` Ingo Molnar
  2012-03-19 20:43               ` Christoph Lameter
@ 2012-03-20  0:05               ` Linus Torvalds
  2012-03-20  7:31                 ` Ingo Molnar
  1 sibling, 1 reply; 153+ messages in thread
From: Linus Torvalds @ 2012-03-20  0:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Andrea Arcangeli, Peter Zijlstra, Avi Kivity,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm

On Mon, Mar 19, 2012 at 1:28 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> That having said PeterZ's numbers showed some pretty good
> improvement for the streams workload:
>
>  before: 512.8M
>  after: 615.7M
>
> i.e. a +20% improvement on a not very heavily NUMA box.

Well, streams really isn't a very interesting benchmark. It's the
traditional single-threaded cpu-only thing that just accesses things
linearly, and I'm not convinced the numbers should be taken to mean
anything at all.

The HPC people want to multi-thread things these days, and "cpu/memory
affinity" is a lot less clear then.

So I can easily imagine that the performance improvement is real, but
I really don't think "streams improves by X %" is all that
interesting. Are there any more relevant loads that actually matter to
people that we could show improvement on?

                     Linus

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20  0:05               ` Linus Torvalds
@ 2012-03-20  7:31                 ` Ingo Molnar
  0 siblings, 0 replies; 153+ messages in thread
From: Ingo Molnar @ 2012-03-20  7:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Andrea Arcangeli, Peter Zijlstra, Avi Kivity,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Mar 19, 2012 at 1:28 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > That having said PeterZ's numbers showed some pretty good
> > improvement for the streams workload:
> >
> >  before: 512.8M
> >  after: 615.7M
> >
> > i.e. a +20% improvement on a not very heavily NUMA box.
> 
> Well, streams really isn't a very interesting benchmark. It's 
> the traditional single-threaded cpu-only thing that just 
> accesses things linearly, and I'm not convinced the numbers 
> should be taken to mean anything at all.

Yeah, I considered it the 'ideal improvement' for memory-bound, 
private-working-set workloads on commodity hardware - i.e. the 
upper envelope of anything that might matter. We don't know the 
worst-case regression percentage, nor the median improvement - 
which might very well be a negative number.

More fundamentally we don't even know whether such access 
patterns matter at all.

> The HPC people want to multi-thread things these days, and 
> "cpu/memory affinity" is a lot less clear then.
> 
> So I can easily imagine that the performance improvement is 
> real, but I really don't think "streams improves by X %" is 
> all that interesting. Are there any more relevant loads that 
> actually matter to people that we could show improvement on?

That would be interesting to see.

I could queue this up in a topical branch in a pure opt-in 
fashion, to make it easier to test.

Assuming there will be real improvements on real workloads, do 
you have any fundamental objections against the 'home node' 
concept itself and its placement into mm_struct? I think it 
makes sense and mm_struct is the most logical place to host it.

The rest looks rather non-controversial to me, apps that want 
more memory affinity should get it and both the VM and the 
scheduler should help achieve that goal, within memory and CPU 
allocation constraints.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 20:03           ` Peter Zijlstra
@ 2012-03-20 10:18             ` Avi Kivity
  2012-03-20 10:48               ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-20 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 10:03 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 14:16 +0200, Avi Kivity wrote:
> > > Afaik we do not use dma engines for memory migration. 
> > 
> > We don't, but I think we should. 
>
> ISTR we (the community) had this discussion once. I also seem to
> remember the general consensus being that DMA engines would mostly
> likely not be worth the effort, although I can't really recall the
> specifics.
>
> Esp. for 4k pages the setup of the offload will likely be more expensive
> than actually doing the memcpy.

If you're copying a page, yes.  If you're copying a large vma, the
per-page setup cost is likely to be very low.

Especially if you're copying across nodes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 10:18             ` Avi Kivity
@ 2012-03-20 10:48               ` Peter Zijlstra
  2012-03-20 10:52                 ` Avi Kivity
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-20 10:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Tue, 2012-03-20 at 12:18 +0200, Avi Kivity wrote:
> On 03/19/2012 10:03 PM, Peter Zijlstra wrote:
> > On Mon, 2012-03-19 at 14:16 +0200, Avi Kivity wrote:
> > > > Afaik we do not use dma engines for memory migration. 
> > > 
> > > We don't, but I think we should. 
> >
> > ISTR we (the community) had this discussion once. I also seem to
> > remember the general consensus being that DMA engines would mostly
> > likely not be worth the effort, although I can't really recall the
> > specifics.
> >
> > Esp. for 4k pages the setup of the offload will likely be more expensive
> > than actually doing the memcpy.
> 
> If you're copying a page, yes.  If you're copying a large vma, the
> per-page setup cost is likely to be very low.
> 
> Especially if you're copying across nodes.

But wouldn't you then have to wait for the entire copy to complete
before accessing any of the memory? That sounds like a way worse latency
hit than the per-page lazy-migrate.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 10:48               ` Peter Zijlstra
@ 2012-03-20 10:52                 ` Avi Kivity
  2012-03-20 11:07                   ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Avi Kivity @ 2012-03-20 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/20/2012 12:48 PM, Peter Zijlstra wrote:
> On Tue, 2012-03-20 at 12:18 +0200, Avi Kivity wrote:
> > On 03/19/2012 10:03 PM, Peter Zijlstra wrote:
> > > On Mon, 2012-03-19 at 14:16 +0200, Avi Kivity wrote:
> > > > > Afaik we do not use dma engines for memory migration. 
> > > > 
> > > > We don't, but I think we should. 
> > >
> > > ISTR we (the community) had this discussion once. I also seem to
> > > remember the general consensus being that DMA engines would mostly
> > > likely not be worth the effort, although I can't really recall the
> > > specifics.
> > >
> > > Esp. for 4k pages the setup of the offload will likely be more expensive
> > > than actually doing the memcpy.
> > 
> > If you're copying a page, yes.  If you're copying a large vma, the
> > per-page setup cost is likely to be very low.
> > 
> > Especially if you're copying across nodes.
>
> But wouldn't you then have to wait for the entire copy to complete
> before accessing any of the memory? That sounds like a way worse latency
> hit than the per-page lazy-migrate.

You use the dma engine for eager copying, not on demand.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 10:52                 ` Avi Kivity
@ 2012-03-20 11:07                   ` Peter Zijlstra
  2012-03-20 11:48                     ` Avi Kivity
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-20 11:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Tue, 2012-03-20 at 12:52 +0200, Avi Kivity wrote:

> You use the dma engine for eager copying, not on demand.

Sure, but during that time no access to that entire vma is allowed, so
you have to unmap it, and any fault in there will have to wait for the
entire copy to complete.

Or am I misunderstanding how things would work?

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 11:07                   ` Peter Zijlstra
@ 2012-03-20 11:48                     ` Avi Kivity
  0 siblings, 0 replies; 153+ messages in thread
From: Avi Kivity @ 2012-03-20 11:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/20/2012 01:07 PM, Peter Zijlstra wrote:
> On Tue, 2012-03-20 at 12:52 +0200, Avi Kivity wrote:
>
> > You use the dma engine for eager copying, not on demand.
>
> Sure, but during that time no access to that entire vma is allowed, so
> you have to unmap it, and any fault in there will have to wait for the
> entire copy to complete.
>
> Or am I misunderstanding how things would work?

Option 1: write-protect the area you are migrating, on write fault allow
write access and discard the migration target (marking the page for
migration later)

Option 2: clear the dirty bits on the area you are migrating, after
migration completes examine the dirty bit, and if dirty, discard the
migration target.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-19 18:42               ` Peter Zijlstra
@ 2012-03-20 22:18                 ` Rik van Riel
  2012-03-21 16:50                   ` Andrea Arcangeli
  2012-04-02 16:34                   ` Pekka Enberg
  0 siblings, 2 replies; 153+ messages in thread
From: Rik van Riel @ 2012-03-20 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner, linux-kernel,
	linux-mm

On 03/19/2012 02:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-19 at 15:30 +0100, Andrea Arcangeli wrote:
>
>> I agree for qemu those soft bindings are fine.
>
> So for what exact program(s) are you working? Your solution seems purely
> focused on the hard case of a multi-threaded application that's larger
> than a single node.
>
> While I'm sure such applications exist, how realistic is it that they're
> the majority?

I suspect Java and other runtimes may have issues where
they simply do not know which thread will end up using
which objects from the heap heavily.

However, even for those migrate-on-fault could be a
reasonable alternative to scanning + proactive migration.

It is really too early to tell which of the approaches is
going to work best.

>> When you focus only on the cost of collecting the information and no
>> actual discussion was spent yet on how to compute or react to it,
>> something's wrong... as that's the really interesting part of the code.
>
> Yeah, the thing that's wrong is you dumping a ~2300 line patch of dense
> code over the wall without any high-level explanation.
>
> I just about got to the policy parts but its not like its easy reading.
>
> Also, you giving clues but not really saying what you mean attitude
> combined with your tendency to write books instead of emails isn't
> really conductive to me wanting to ask for any explanation either.

Getting high level documentation of the ideas behind both
of the NUMA implementations could really help smooth out
the debate.

The more specifics on the ideas and designs we have, the
easier it will be to look past small details in the code
and debate the merits of the design (before we get to
cleaning up whichever bits of code we end up choosing).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
  2012-03-19 18:47   ` Peter Zijlstra
@ 2012-03-20 23:41   ` Dan Smith
  2012-03-21  1:00     ` Andrea Arcangeli
                       ` (2 more replies)
  1 sibling, 3 replies; 153+ messages in thread
From: Dan Smith @ 2012-03-20 23:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA> Could you try my two trivial benchmarks I sent on lkml too?

I just got around to running your numa01 test on mainline, autonuma, and
numasched.  This is on a 2-socket, 6-cores-per-socket,
2-threads-per-core machine, with your test configured to run 24
threads. I also ran Peter's modified stream_d on all three as well, with
24 instances in parallel. I know it's already been pointed out that it's
not the ideal or end-all benchmark, but I figured it was still
worthwhile to see if the trend continued.

On your numa01 test:

  Autonuma is 22% faster than mainline
  Numasched is 42% faster than mainline

On Peter's modified stream_d test:

  Autonuma is 35% *slower* than mainline
  Numasched is 55% faster than mainline

I know that the "real" performance guys here are going to be posting
some numbers from more interesting benchmarks soon, but since nobody
had answered Andrea's question, I figured I'd do it.

-- 
Dan Smith
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-20 23:41   ` Dan Smith
@ 2012-03-21  1:00     ` Andrea Arcangeli
  2012-03-21  2:12     ` Andrea Arcangeli
  2012-03-21  7:53     ` Ingo Molnar
  2 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21  1:00 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Tue, Mar 20, 2012 at 04:41:06PM -0700, Dan Smith wrote:
> AA> Could you try my two trivial benchmarks I sent on lkml too?
> 
> I just got around to running your numa01 test on mainline, autonuma, and
> numasched.  This is on a 2-socket, 6-cores-per-socket,
> 2-threads-per-core machine, with your test configured to run 24
> threads. I also ran Peter's modified stream_d on all three as well, with
> 24 instances in parallel. I know it's already been pointed out that it's
> not the ideal or end-all benchmark, but I figured it was still
> worthwhile to see if the trend continued.
> 
> On your numa01 test:
> 
>   Autonuma is 22% faster than mainline
>   Numasched is 42% faster than mainline

Can you please disable THP for the benchmarks? Until native THP
migration is available that tends to skews the results because the
migrated memory is not backed by THP.

Or if you prefer not to disable THP, just set
khugepaged/scan_sleep_millisecs to 10.

Can you also build it with?

gcc -DNO_BIND_FORCE_SAME_NODE -O2 -o numa01 kernel/proggy/numa01.c -lnuma -lpthread

If you can run numa02 as well (no special -D flags there), that would
be interesting.

You could report the results of -DHARD_BIND and -DINVERSE_BIND too as
a sanity check.

My raw numbers are:

numa01 -DNO_BIND_FORCE_SAME_NODE (12 thread per process, 2 process)
thread uses shared memory

upstream 3.2	bind	reverse bind	autonuma
305.36	        196.07	378.34	        207.47

What's the percentage if you calculate the same way you did on your
numbers?

(I don't know how you calculated it)

Maybe it was the lack of get -DDNO_BIND_FORCE_SAME_NODE that reduced
the difference but it shouldn't have been so different. Maybe it was a
THP effect dunno.

> On Peter's modified stream_d test:
> 
>   Autonuma is 35% *slower* than mainline
>   Numasched is 55% faster than mainline

Is the modified stream_d posted somewhere? I missed it. How long it
takes to run? What's the measurement error of it? On my tests the
measurement error is within 2%.

In the meantime I've more benchmark data too (including a worst case
kernel build benchmark with autonuma on and off with THP on) and
specjbb (with THP on).

You can jump to slide 8 if you already read the previous pdf:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120321.pdf

If numbers like above will be be confirmed across the board including
specjbb and pretty much everything I'll happily "rm -r autonuma"
:). For now your numa01 numbers are so out of sync with mine that I
wouldn't take too many conclusions from them, I'm so close to the hard
bindings already in numa01 that there's no way anything else can
perform 20% faster than AutoNUMA.

The numbers in slide 10 of the pdf were provided to me by a
professional, I didn't measure it myself.

And about measurement errors: numa01 is 100% reproducible here, I run
it in a loop for months and not a single time it deviates more than
10sec from the average.

About stream_d, the only chance for autonuma to underperform like -35%
is if you get massive amount of migration going to the wrong place in
a trashing way. I never seen it happening here since I run these
algorithms, but hopefully fixable if that really has happened... So
I'm very interested to gain access to the source of modified stream_d.

Enjoy,
Andrea

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-20 23:41   ` Dan Smith
  2012-03-21  1:00     ` Andrea Arcangeli
@ 2012-03-21  2:12     ` Andrea Arcangeli
  2012-03-21  4:01       ` Dan Smith
  2012-03-21  7:12       ` Ingo Molnar
  2012-03-21  7:53     ` Ingo Molnar
  2 siblings, 2 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21  2:12 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Tue, Mar 20, 2012 at 04:41:06PM -0700, Dan Smith wrote:
> AA> Could you try my two trivial benchmarks I sent on lkml too?
> 
> I just got around to running your numa01 test on mainline, autonuma, and
> numasched.  This is on a 2-socket, 6-cores-per-socket,
> 2-threads-per-core machine, with your test configured to run 24
> threads. I also ran Peter's modified stream_d on all three as well, with
> 24 instances in parallel. I know it's already been pointed out that it's
> not the ideal or end-all benchmark, but I figured it was still
> worthwhile to see if the trend continued.
> 
> On your numa01 test:
> 
>   Autonuma is 22% faster than mainline
>   Numasched is 42% faster than mainline
> 
> On Peter's modified stream_d test:
> 
>   Autonuma is 35% *slower* than mainline
>   Numasched is 55% faster than mainline

I repeated the benchmark here after applying all Peter's patches in
the same setup where I run this loop of benchmarks on the AutoNUMA
code 24/7 for the last 3 months. So it was pretty quick to do it for
me.

THP was disabled as the only kernel tune tweak to compare apples with
apples (it was already disabled in all my measurements with the
numa01/numa02).

        upstream autonuma numasched hard inverse
numa02  64       45       66        42   81
numa01  491      328      607       321  623 -D THREAD_ALLOC
numa01  305      207      338       196  378 -D NO_BIND_FORCE_SAME_NODE

So give me a break... you must have made a real mess in your
benchmarking. numasched is always doing worse than upstream here, in
fact two times massively worse. Almost as bad as the inverse binds.

Maybe you've more than 16g? I've 16G and that leaves 1G free on both
nodes at the peak load with AutoNUMA. That shall be enough for
numasched too (Peter complained me I waste 80MB on a 16G system, so he
can't possibly be intentionally wasting me 2GB).

In any case your results were already _obviously_ broken without me
having to benchmark numasched to verify, because it's impossible
numasched could be 20% faster than autonuma on numa01, because
otherwise it would mean that numasched is like 18% faster than hard
bindings which is mathematically impossible unless your hardware is
not NUMA or superNUMAbroken.

Also note that I had to even "reboot -f" after the first run of -D
NO_BIND_FORCE_SAME_NODE because otherwise it would never end and it
went 3G in swap already when I rebooted. Maybe a memleak from previous
runs? no idea. After rebooting I run numa01 -D NO_BIND_FORCE_SAME_NODE
after fresh after reboot and after rebooted it looked not in swap. I
just did a "ssh host vmstat 1" to see if it was swapping again and
never ending, and I killed vmstat it after a second, otherwise the
systems are totally undisturbed and there's no cron or anything so the
results are reliable.

I'll repeat the benchmarks for numasched tomorrow with lockdep
disabled (lockdep on or off won't alter autonuma runtime) and to also
run the last numa01+numa02 test. Then I'll update the pdf and
overwrite it so that the pages 3-6 of the pdf will include a 5h column
showing numasched results.

Note that I didn't alter my .config, I just checkout origin/master and
git am the patchset and run make oldconfig (after fixing one trivial
reject in the syscall registration).

Maybe there's a slight chance I won't have to throw autonuma into the
trash after all considering how staggering the difference is.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21  2:12     ` Andrea Arcangeli
@ 2012-03-21  4:01       ` Dan Smith
  2012-03-21 12:49         ` Andrea Arcangeli
  2012-03-21  7:12       ` Ingo Molnar
  1 sibling, 1 reply; 153+ messages in thread
From: Dan Smith @ 2012-03-21  4:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA>         upstream autonuma numasched hard inverse
AA> numa02  64       45       66        42   81
AA> numa01  491      328      607       321  623 -D THREAD_ALLOC
AA> numa01  305      207      338       196  378 -D NO_BIND_FORCE_SAME_NODE

AA> So give me a break... you must have made a real mess in your
AA> benchmarking.

I'm just running what you posted, dude :)

AA> numasched is always doing worse than upstream here, in fact two
AA> times massively worse. Almost as bad as the inverse binds.

Well, something clearly isn't right, because my numbers don't match
yours at all. This time with THP disabled, and compared to the rest of
the numbers from my previous runs:

            autonuma   HARD   INVERSE   NO_BIND_FORCE_SAME_MODE

numa01      366        335    356       377
numa01THP   388        336    353       399

That shows that autonuma is worse than inverse binds here. If I'm
running your stuff incorrectly, please tell me and I'll correct
it. However, I've now compiled the binary exactly as you asked, with THP
disabled, and am seeing surprisingly consistent results.

AA> Maybe you've more than 16g? I've 16G and that leaves 1G free on both
AA> nodes at the peak load with AutoNUMA. That shall be enough for
AA> numasched too (Peter complained me I waste 80MB on a 16G system, so
AA> he can't possibly be intentionally wasting me 2GB).

Yep, 24G here. Do I need to tweak the test?

AA> In any case your results were already _obviously_ broken without me
AA> having to benchmark numasched to verify, because it's impossible
AA> numasched could be 20% faster than autonuma on numa01, because
AA> otherwise it would mean that numasched is like 18% faster than hard
AA> bindings which is mathematically impossible unless your hardware is
AA> not NUMA or superNUMAbroken.

How do you figure? I didn't post any hard binding numbers. In fact,
numasched performed about equal to hard binding...definitely within your
stated 2% error interval. That was with THP enabled, tomorrow I'll be
glad to run them all again without THP.

-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21  2:12     ` Andrea Arcangeli
  2012-03-21  4:01       ` Dan Smith
@ 2012-03-21  7:12       ` Ingo Molnar
  2012-03-21 12:08         ` Andrea Arcangeli
  1 sibling, 1 reply; 153+ messages in thread
From: Ingo Molnar @ 2012-03-21  7:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> [...]
> 
> So give me a break... you must have made a real mess in your 
> benchmarking. numasched is always doing worse than upstream 
> here, in fact two times massively worse. Almost as bad as the 
> inverse binds.

Andrea, please stop attacking the messenger.

We wanted and needed more testing, and I'm glad that we got it.

Can we please figure out all the details *without* accusing 
anyone of having made a mess? It is quite possible as well that 
*you* made a mess of it somewhere, either at the conceptual 
stage or at the implementational stage, right?

numasched getting close to the hard binding numbers is pretty 
much what I'd expect to see from it: it is an 
automatic/intelligent CPU and memory affinity (and migration) 
method to approximate the results of manual hard binding of 
threads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-20 23:41   ` Dan Smith
  2012-03-21  1:00     ` Andrea Arcangeli
  2012-03-21  2:12     ` Andrea Arcangeli
@ 2012-03-21  7:53     ` Ingo Molnar
  2012-03-21 12:17       ` Andrea Arcangeli
  2 siblings, 1 reply; 153+ messages in thread
From: Ingo Molnar @ 2012-03-21  7:53 UTC (permalink / raw)
  To: Dan Smith
  Cc: Andrea Arcangeli, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm


* Dan Smith <danms@us.ibm.com> wrote:

> On your numa01 test:
> 
>   Autonuma is 22% faster than mainline
>   Numasched is 42% faster than mainline
> 
> On Peter's modified stream_d test:
> 
>   Autonuma is 35% *slower* than mainline
>   Numasched is 55% faster than mainline
> 
> I know that the "real" performance guys here are going to be 
> posting some numbers from more interesting benchmarks soon, 
> but since nobody had answered Andrea's question, I figured I'd 
> do it.

It would also be nice to find and run *real* HPC workloads that 
were not written by Andrea or Peter and which computes something 
non-trivial and real - and then compare the various methods.

Ideally we'd like to measure the two conceptual working set 
corner cases:

  - global working set HPC with a large shared working set:

      - Many types of Monte-Carlo optimizations tend to be
        like this - they have a large shared time series and
        threads compute on those with comparatively little
        private state.

      - 3D rendering with physical modelling: a large, complex
        3D scene set with private worker threads. (much of this 
        tends to be done in GPUs these days though.)

  - private working set HPC with little shared/global working 
    set and lots of per process/thread private memory 
    allocations:

      - Quantum chemistry optimization runs tend to be like this
        with their often gigabytes large matrices.

      - Gas, fluid, solid state and gravitational particle
        simulations - most ab initio methods tend to have very
        little global shared state, each thread iterates its own
        version of the universe.

      - More complex runs of ray tracing as well IIRC.

My impression is that while threading is on the rise due to its 
ease of use, many threaded HPC workloads still fall into the 
second category.

In fact they are often explicitly *turned* into the second 
category at the application level by duplicating shared global 
data explicitly and turning it into per thread local data.

So we need to cover these major HPC usecases - we won't merge 
any of this based on just synthetic benchmarks.

And to default-enable any of this on stock kernels we'd need to 
even more testing and widespread, feel-good speedups in almost 
every key Linux workload... I don't see that happening though, 
so the best we can get are probably some easy and flexible knobs 
for HPC.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21  7:12       ` Ingo Molnar
@ 2012-03-21 12:08         ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 12:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Wed, Mar 21, 2012 at 08:12:58AM +0100, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > [...]
> > 
> > So give me a break... you must have made a real mess in your 
> > benchmarking. numasched is always doing worse than upstream 
> > here, in fact two times massively worse. Almost as bad as the 
> > inverse binds.
> 
> Andrea, please stop attacking the messenger.

I am simply informing him. Why should not inform him that the way he
performed the benchmark wasn't the best way?

I informed him because it wasn't entirely documented how to properly
run by benchmark set. I would have expected people to read my pdf I
posted 2 months ago already that explains it:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/
http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120126.pdf

Jump to page 7.

Two modes:

numa01 -DNO_BIND_FORCE_SAME_NODE
numa01 -DTHREAD_ALLOC

I recommend Dan to now as last thing repeat the numasched benchmark
with the numa01 built was -DNO_BIND_FORCE_SAME_NODE.

For me neither -DNO_BIND_FORCE_SAME_NODE nor DTHREAD_ALLOC nor numa02
perform, in fact numa01 tends to hang and they never end.

> We wanted and needed more testing, and I'm glad that we got it.

Yes, I also posted the specjbb and I did a kernel build as measurement
of the worst case overhead of the numa hinting page fault.

You can see it here:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120321.pdf

> Can we please figure out all the details *without* accusing 
> anyone of having made a mess? It is quite possible as well that 
> *you* made a mess of it somewhere, either at the conceptual 
> stage or at the implementational stage, right?

I didn't make a mess. I also repeated without lockdep still same
thing, in fact now it never ends. I'll have to reboot a few more times
to see if I can get at least some number out.

Maybe it takes -DNO_BIND_FORCE_SAME_NODE to show the brokeness, I'll
wait Dan to repeat the numasched test with either
-DNO_BIND_FORCE_SAME_NODE or -DTHREAD_ALLOC.

Or maybe the higher ram (24G vs my 16G) could have played a role.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21  7:53     ` Ingo Molnar
@ 2012-03-21 12:17       ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 12:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Wed, Mar 21, 2012 at 08:53:49AM +0100, Ingo Molnar wrote:
> My impression is that while threading is on the rise due to its 
> ease of use, many threaded HPC workloads still fall into the 
> second category.

This is why after Peter's initial complains that a threaded
application had to be handled perfectly by AutoNUMA even if it had
more threads than CPU in a node, I had to take a break, and rewrite
part of AutoNUMA to handle this scenario automatically, by introducing
the numa hinting page faults. Before Peter complains I only had the
pagetable scanner. So I appreciate his criticism for having convinced
me that AutoNUMA had to have this working immediately.

Perhaps somebody remembers what I told at KVMForum on stage about
this, back then I was planning to automatically handle only processes
that fit in a node. So the talk with Peter has been fundamental to add
one more gear to the design or I wouldn't be able to compete with his
syscalls.

> In fact they are often explicitly *turned* into the second 
> category at the application level by duplicating shared global 
> data explicitly and turning it into per thread local data.

per-thread local data is the best case of AutoNUMA. AutoNUMA already
detects and reacts to false sharing putting all false-sharing threads
in the same node statistically over time. It also cancels pending
migration pages queued, and requires two more consecutive hits from
threads in the same node before re-allowing migration. There's quite a
bit of work I did to make false sharing handled properly. But the
absolute best case is per-thread local storage (both numa01
-DTHREAD_ALLOC and numa02, numa02 spans over the whole system with the
same process, numa01 has two processes, where each fit in a node, with
local thread storage).

> And to default-enable any of this on stock kernels we'd need to 
> even more testing and widespread, feel-good speedups in almost 
> every key Linux workload... I don't see that happening though, 
> so the best we can get are probably some easy and flexible knobs 
> for HPC.

This is a very good point. We can merge AutoNUMA in a disabled way. It
won't ever do anything unless explicitly enabled, and even more
important if you disable it (echo 0 >enabled) it will deactivate
completely and everything will settle down like if has never run, it
will leave zero signs in the VM and scheduler.

There are three gears, if the pagetable scanner never runs (first
gear), all other gears never activates and it is a complete bypass (noop).

There are environments like virt that are quite memory static and
predictable, so if demonstrated it would work for them, it would be
real easy for virt admin to echo 1 >/sys/kernel/mm/autonuma/enabled .

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21  4:01       ` Dan Smith
@ 2012-03-21 12:49         ` Andrea Arcangeli
  2012-03-21 22:05           ` Dan Smith
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 12:49 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

Hi Dan,

On Tue, Mar 20, 2012 at 09:01:58PM -0700, Dan Smith wrote:
> AA>         upstream autonuma numasched hard inverse
> AA> numa02  64       45       66        42   81
> AA> numa01  491      328      607       321  623 -D THREAD_ALLOC
> AA> numa01  305      207      338       196  378 -D NO_BIND_FORCE_SAME_NODE
> 
> AA> So give me a break... you must have made a real mess in your
> AA> benchmarking.
> 
> I'm just running what you posted, dude :)

Apologies if it felt like I was attacking you, that wasn't my
intention, I actually appreciate your effort!

My exclamation was because I was shocked by the staggering difference
in results, nothing else.

Here I still get the results I posted above from numasched. In fact
even worse, now even -D THREAD_ALLOC wouldn't end (and I disabled
lockdep just in case), I'll try to reboot some more time to see if I
can get some number out of it again.

numa02 at least repeats at 66 sec reproducibly with numasched with or
without lockdep.

> AA> numasched is always doing worse than upstream here, in fact two
> AA> times massively worse. Almost as bad as the inverse binds.
> 
> Well, something clearly isn't right, because my numbers don't match
> yours at all. This time with THP disabled, and compared to the rest of
> the numbers from my previous runs:
> 
>             autonuma   HARD   INVERSE   NO_BIND_FORCE_SAME_MODE
> 
> numa01      366        335    356       377
> numa01THP   388        336    353       399
> 
> That shows that autonuma is worse than inverse binds here. If I'm
> running your stuff incorrectly, please tell me and I'll correct
> it. However, I've now compiled the binary exactly as you asked, with THP
> disabled, and am seeing surprisingly consistent results.

HARD and INVERSE should be the min and max you get.

I would ask you before you test AutoNUMA again, or numasched again, to
repeat this "HARD" vs "INVERSE" vs "NO_BIND_FORCE_SAME_MODE"
benchmark and be sure the above numbers are correct for the above
three cases.

On my hardware you can see on page 7 of my pdf what I get:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120321.pdf

numa01 -DHARD_BIND | -DNO_BIND_FORCE_SAME_NODE | -DINVERSE_BIND
             196      305                        378

You can do this benchmark on an upstream kernel 3.3-rc, no need of any
patch to collect the above three numbers.

For me this is always true: HARD_BIND <= NO_BIND_FORCE_SAME_NODE <= INVERSE_BIND.

Checking if numa01 HARD_BIND and INVERSE_BIND cases are setting up
your hardware topology correctly may be good idea too.

If it's not a benchmarking error or a topology error in
HARD_BIND/INVERSE_BIND, it may be the hardware you're using is very
different. That would be bad news though, I thought you were using the
same common 2 socket exacore setup that I'm using and I wouldn't have
expected such a staggering difference in results (even for HARD vs
INVERSE vs NO_BIND_FORCE_SAME_NODE, even before we put autonuma or
numasched into the equation).

> AA> Maybe you've more than 16g? I've 16G and that leaves 1G free on both
> AA> nodes at the peak load with AutoNUMA. That shall be enough for
> AA> numasched too (Peter complained me I waste 80MB on a 16G system, so
> AA> he can't possibly be intentionally wasting me 2GB).
> 
> Yep, 24G here. Do I need to tweak the test?

Well maybe you could try to repeat at 16G if you still see numasched
performing great after running it with -DNO_BIND_FORCE_SAME_MODE.

What -DNO_BIND_FORCE_SAME_MODE is meant to do, is to start the "NUMA
migration" races from the worst possible condition.

Imagine it like doing a hiking race consistently always from the
_bottom_ of the mountain, and not randomly from the middle like it
would happen without -DNO_BIND_FORCE_SAME_MODE.

> How do you figure? I didn't post any hard binding numbers. In fact,
> numasched performed about equal to hard binding...definitely within your
> stated 2% error interval. That was with THP enabled, tomorrow I'll be
> glad to run them all again without THP.

Again thanks so much for your effort. I hope others will run more
benchmarks too on both solution. And I repeat what I said yesterday
clear and stright: if numasched will be shown to have the lead on the
vast majority of workloads, I will be happy to "rm -r autonuma" to
stop wasting time on an inferior dead project, and work on something
else entirely or to contribute to numasched in case they will need
help for something.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 22:18                 ` Rik van Riel
@ 2012-03-21 16:50                   ` Andrea Arcangeli
  2012-04-02 16:34                   ` Pekka Enberg
  1 sibling, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 16:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Avi Kivity, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner, linux-kernel,
	linux-mm

Hi,

On Tue, Mar 20, 2012 at 06:18:21PM -0400, Rik van Riel wrote:
> Getting high level documentation of the ideas behind both
> of the NUMA implementations could really help smooth out
> the debate.

If I didn't explain it in detail so far, has been because of lack of
time. Plus nobody asked specific questions on the internals yet. And I
do my best to avoid writing books through email even if I fail at that
sometime.

I wanted do a few more code changes to make it faster first (like
badly needed THP migration).

For sure, I'll prepare some initial highlevel documentation on the
AutoNUMA algorithms for the MM summit (the resulting pdf I will
publish it of course), in addition to more benchmark data.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21 12:49         ` Andrea Arcangeli
@ 2012-03-21 22:05           ` Dan Smith
  2012-03-21 22:52             ` Andrea Arcangeli
  0 siblings, 1 reply; 153+ messages in thread
From: Dan Smith @ 2012-03-21 22:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA> HARD and INVERSE should be the min and max you get.

AA> I would ask you before you test AutoNUMA again, or numasched again,
AA> to repeat this "HARD" vs "INVERSE" vs "NO_BIND_FORCE_SAME_MODE"
AA> benchmark and be sure the above numbers are correct for the above
AA> three cases.

I've always been running all three, knowing that hard and inverse should
be the bounds. Not knowing (until today) what the third was, I wasn't
sure where it was supposed to lie. However, I've yet to see the spread
that you describe, regardless of the configuration. If that means
something isn't right about my setup, point it out. I've even gone so
far as to print debug from inside numa01 and numa02 to make sure the
-DFOO's are working.

Re-running all the configurations with THP disabled seems to yield very
similar results to what I reported before:

        mainline autonuma numasched hard inverse same_node
numa01  483      366      335       335  483     483

The inverse and same_node numbers above are on mainline, and both are
lower on autonuma and numasched:

           numa01_hard numa01_inverse numa01_same_node
mainline   335         483            483
autonuma   335         356            377
numasched  335         375            491

I also ran your numa02, which seems to correlate to your findings:

        mainline autonuma numasched hard inverse
numa02  54       42       55        37   53

So, I'm not seeing the twofold penalty of running with numasched, and in
fact, it seems to basically do no worse than current mainline (within
the error interval). However, I hope the matching trend somewhat
validates the fact that I'm running your stuff correctly.

I also ran your numa01 with my system clamped to 16G and saw no change
in the positioning of the metrics (i.e. same_node was still higher than
inverse and everything was shifted slightly up linearly).

AA> If it's not a benchmarking error or a topology error in
AA> HARD_BIND/INVERSE_BIND, it may be the hardware you're using is very
AA> different. That would be bad news though, I thought you were using
AA> the same common 2 socket exacore setup that I'm using and I wouldn't
AA> have expected such a staggering difference in results (even for HARD
AA> vs INVERSE vs NO_BIND_FORCE_SAME_NODE, even before we put autonuma
AA> or numasched into the equation).

Well, it's bad in either case, because it means either it's too
temperamental to behave the same on two similar but differently-sized
machines, or that it doesn't properly balance the load for machines with
differing topologies.

I'll be glad to post details of the topology if you tell me specifically
what you want (above and beyond what I've already posted).

AA> I hope others will run more benchmarks too on both solution.

Me too. Unless you have specific things for me to try, it's probably
best to let someone else step in with more interesting and
representative benchmarks, as all of my numbers seem to continue to
point in the same direction...

Thanks!

-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21 22:05           ` Dan Smith
@ 2012-03-21 22:52             ` Andrea Arcangeli
  2012-03-21 23:13               ` Dan Smith
  2012-03-22  0:17               ` Andrea Arcangeli
  0 siblings, 2 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 22:52 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Wed, Mar 21, 2012 at 03:05:30PM -0700, Dan Smith wrote:
> something isn't right about my setup, point it out. I've even gone so
> far as to print debug from inside numa01 and numa02 to make sure the
> -DFOO's are working.

That's good check indeed.

> Re-running all the configurations with THP disabled seems to yield very
> similar results to what I reported before:
> 
>         mainline autonuma numasched hard inverse same_node
> numa01  483      366      335       335  483     483

I assume you didn't run the numa01_same_node on the "numasched" kernel
here.

Now if you want I can fix this and boost autonuma for the numa01
without any parameter.

With the first 5 sec of runtime, I thought I'd be ok with the
MPOL_DEFAULT behavior unchanged (where autonuma behaves as a bypass
for those initial seconds).

Now if we're going to measure who places memory better within the
first 10 seconds of startup, I may have to resurrect
autonuma_balance_blind. I disabled that function because I didn't want
blind heuristics that may backfire for some apps.

It's really numa01_same node the interesting benchmark meant
to start from a fixed position and it is the thing that really
exercises ability of the algorithm to converge.

> The inverse and same_node numbers above are on mainline, and both are
> lower on autonuma and numasched:
> 
>            numa01_hard numa01_inverse numa01_same_node
> mainline   335         483            483
> autonuma   335         356            377
> numasched  335         375            491

In these numbers the numa01_inverse column is suspect for
autonuma/numasched.

The numa01_inverse and numa01_hard you should duplicate it from
mainline to be sure. That is an "hardware" not software measurement.

The exact numbers shall be like this:

            numa01_hard numa01_inverse numa01_same_node
 mainline   335         483            483
 autonuma   335         483            377
 numasched  335         483            491



And it pretty much matches what I get. Well I tried many times again
but I couldn't complete any more numa01 runs with numasched, I was
real lucky last night. It never ends... it becomes incredibly slow and
misbehave until it's almost unusable and I reboot it. So I stopped
worrying about benchmarking numasched as it's too unstable for that.

> I also ran your numa02, which seems to correlate to your findings:
> 
>         mainline autonuma numasched hard inverse
> numa02  54       42       55        37   53
> 
> So, I'm not seeing the twofold penalty of running with numasched, and in
> fact, it seems to basically do no worse than current mainline (within
> the error interval). However, I hope the matching trend somewhat
> validates the fact that I'm running your stuff correctly.

I still see it even in your numbers:

numasched 55
mainline  54
autonuma  42
hard      37

numasched 491
mainline  483
autonuma  377
hard      335

Yes I think you're running everything correctly.

I'm only wondering why numa01_inverse is faster than on upstream when
run on autonuma (and numasched), I'll try to reproduce it. I thought I
wasn't messing with anything except MPOL_DEFAULT but I'll have to
re-check that.

> I also ran your numa01 with my system clamped to 16G and saw no change
> in the positioning of the metrics (i.e. same_node was still higher than
> inverse and everything was shifted slightly up linearly).

Yes it shall run fine on all kernels. But for me running that on
numasched (and only on numasched) never ends.

> Well, it's bad in either case, because it means either it's too
> temperamental to behave the same on two similar but differently-sized
> machines, or that it doesn't properly balance the load for machines with
> differing topologies.

Your three numbers of mainline looked ok, it's still strange that
numa01_same_node is identical to numa01_inverse_bind though. It
shoudln't. same_node uses 1 numa node. inverse uses both nodes but
always with remote memory. It's surprising to see an identical value
there.

> I'll be glad to post details of the topology if you tell me specifically
> what you want (above and beyond what I've already posted).

It should look like this to be correct for my -DHARD_BIND and
-DINVERSE_BIND to work as intended:

numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23

If your topology is different than above, then updates are required to
numa*.c.

> Me too. Unless you have specific things for me to try, it's probably
> best to let someone else step in with more interesting and
> representative benchmarks, as all of my numbers seem to continue to
> point in the same direction...

It's all good! Thanks for the help.

If you want to keep benchmarking I'm about to upload the autonuma-dev
branch (same git-tree) with alpha8 based on post-3.3 scheduler
codebase and with some more fix.

Andrea

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
                   ` (27 preceding siblings ...)
  2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
@ 2012-03-21 22:53 ` Nish Aravamudan
  2012-03-22  9:45   ` Peter Zijlstra
  28 siblings, 1 reply; 153+ messages in thread
From: Nish Aravamudan @ 2012-03-21 22:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

Hi Peter,

Sorry if this has already been reported, but

On Fri, Mar 16, 2012 at 7:40 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Hi All,
>
> While the current scheduler has knowledge of the machine topology, including
> NUMA (although there's room for improvement there as well [1]), it is
> completely insensitive to which nodes a task's memory actually is on.
>
> Current upstream task memory allocation prefers to use the node the task is
> currently running on (unless explicitly told otherwise, see
> mbind()/set_mempolicy()), and with the scheduler free to move the task about at
> will, the task's memory can end up being spread all over the machine's nodes.
>
> While the scheduler does a reasonable job of keeping short running tasks on a
> single node (by means of simply not doing the cross-node migration very often),
> it completely blows for long-running processes with a large memory footprint.
>
> This patch-set aims at improving this situation. It does so by assigning a
> preferred, or home, node to every process/thread_group. Memory allocation is
> then directed by this preference instead of the node the task might actually be
> running on momentarily. The load-balancer is also modified to prefer running
> the task on its home-node, although not at the cost of letting CPUs go idle or
> at the cost of execution fairness.
<snip>

>  [24/26] mm, mpol: Implement numa_group RSS accounting

I was going to try and test this on power, but it fails to build:

  mm/filemap_xip.c: In function ‘__xip_unmap’:
  mm/filemap_xip.c:199: error: implicit declaration of function
‘numa_add_vma_counter’

and I think


>  [26/26] sched, numa: A few debug bits

introduced a new warning:

  kernel/sched/numa.c: In function ‘process_cpu_runtime’:
  kernel/sched/numa.c:210: warning: format ‘%lu’ expects type ‘long
unsigned int’, but argument 3 has type ‘u64’
  kernel/sched/numa.c:210: warning: format ‘%lu’ expects type ‘long
unsigned int’, but argument 4 has type ‘u64’

Thanks,
Nish

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21 22:52             ` Andrea Arcangeli
@ 2012-03-21 23:13               ` Dan Smith
  2012-03-21 23:41                 ` Andrea Arcangeli
  2012-03-22  0:17               ` Andrea Arcangeli
  1 sibling, 1 reply; 153+ messages in thread
From: Dan Smith @ 2012-03-21 23:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA> available: 2 nodes (0-1)
AA> node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
AA> node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 12276 MB
node 0 free: 11769 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 12288 MB
node 1 free: 11727 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

Same enough?

-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21 23:13               ` Dan Smith
@ 2012-03-21 23:41                 ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-21 23:41 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

Hi Dan,

On Wed, Mar 21, 2012 at 04:13:45PM -0700, Dan Smith wrote:
> AA> available: 2 nodes (0-1)
> AA> node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
> AA> node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
> node 0 size: 12276 MB
> node 0 free: 11769 MB
> node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
> node 1 size: 12288 MB
> node 1 free: 11727 MB
> node distances:
> node   0   1 
>   0:  10  21 
>   1:  21  10 
> 
> Same enough?

Yes. Just more RAM than me.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-21 22:52             ` Andrea Arcangeli
  2012-03-21 23:13               ` Dan Smith
@ 2012-03-22  0:17               ` Andrea Arcangeli
  2012-03-22 13:58                 ` Dan Smith
  1 sibling, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-22  0:17 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Wed, Mar 21, 2012 at 11:52:42PM +0100, Andrea Arcangeli wrote:
> Your three numbers of mainline looked ok, it's still strange that
> numa01_same_node is identical to numa01_inverse_bind though. It
> shoudln't. same_node uses 1 numa node. inverse uses both nodes but

The only reasonable explanation I can imagine for the weird stuff
going on with "numa01_inverse" is that maybe it was compiled without
-DHARD_BIND? I forgot to specify -DINVERSE_BIND is a noop unless
-DHARD_BIND is specified too at the same time. -DINVERSE_BIND alone
results in the default build without -D parameters.

Now AutoNUMA has a bug and is real inverse bind too, I need to fix that.

In the meantime this is possible:

echo 0 >/sys/kernel/mm/autonuma/enabled
run numa01_inverse
echo 1 >/sys/kernel/mm/autonuma/enabled

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-21 22:53 ` Nish Aravamudan
@ 2012-03-22  9:45   ` Peter Zijlstra
  2012-03-22 10:34     ` Ingo Molnar
  2012-03-24  1:41     ` Nish Aravamudan
  0 siblings, 2 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-22  9:45 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm


> I was going to try and test this on power, but it fails to build:
> 
>   mm/filemap_xip.c: In function ‘__xip_unmap’:
>   mm/filemap_xip.c:199: error: implicit declaration of function
> ‘numa_add_vma_counter’

Add: 

#include <linux/mempolicy.h>

to that file and it should build.

> >  [26/26] sched, numa: A few debug bits
> 
> introduced a new warning:
> 
>   kernel/sched/numa.c: In function ‘process_cpu_runtime’:
>   kernel/sched/numa.c:210: warning: format ‘%lu’ expects type ‘long
> unsigned int’, but argument 3 has type ‘u64’
>   kernel/sched/numa.c:210: warning: format ‘%lu’ expects type ‘long
> unsigned int’, but argument 4 has type ‘u64’

Yeah, that's a known trainwreck, some archs define u64 as unsigned long
others as unsigned long long, so whatever you write: %ul or %ull is
wrong and I can't be arsed to add an explict cast since its all debug
bits that won't ever make it in anyway.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-22  9:45   ` Peter Zijlstra
@ 2012-03-22 10:34     ` Ingo Molnar
  2012-03-24  1:41     ` Nish Aravamudan
  1 sibling, 0 replies; 153+ messages in thread
From: Ingo Molnar @ 2012-03-22 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nish Aravamudan, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, linux-kernel, linux-mm


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 
> > I was going to try and test this on power, but it fails to build:
> > 
> >   mm/filemap_xip.c: In function ‘__xip_unmap’:
> >   mm/filemap_xip.c:199: error: implicit declaration of function
> > ‘numa_add_vma_counter’
> 
> Add: 
> 
> #include <linux/mempolicy.h>
> 
> to that file and it should build.

I could stick your patches into tip:sched/numa (rebasing branch 
for now) to make it pullable and testable?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22  0:17               ` Andrea Arcangeli
@ 2012-03-22 13:58                 ` Dan Smith
  2012-03-22 14:27                   ` Andrea Arcangeli
  0 siblings, 1 reply; 153+ messages in thread
From: Dan Smith @ 2012-03-22 13:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA> The only reasonable explanation I can imagine for the weird stuff
AA> going on with "numa01_inverse" is that maybe it was compiled without
AA> -DHARD_BIND? I forgot to specify -DINVERSE_BIND is a noop unless
AA> -DHARD_BIND is specified too at the same time. -DINVERSE_BIND alone
AA> results in the default build without -D parameters.

Ah, yeah, that's probably it. Later I'll try re-running some of the
cases to verify.

-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 13:58                 ` Dan Smith
@ 2012-03-22 14:27                   ` Andrea Arcangeli
  2012-03-22 18:49                     ` Andrea Arcangeli
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-22 14:27 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2341 bytes --]

On Thu, Mar 22, 2012 at 06:58:29AM -0700, Dan Smith wrote:
> AA> The only reasonable explanation I can imagine for the weird stuff
> AA> going on with "numa01_inverse" is that maybe it was compiled without
> AA> -DHARD_BIND? I forgot to specify -DINVERSE_BIND is a noop unless
> AA> -DHARD_BIND is specified too at the same time. -DINVERSE_BIND alone
> AA> results in the default build without -D parameters.
> 
> Ah, yeah, that's probably it. Later I'll try re-running some of the
> cases to verify.

Ok! Please checkout also autonuma branch again, or autonuma-dev, if you
re-run on autonuma, because I had a bug that autonuma would optimize
the hard binds too :). Now they're fully obeyed (I already obeyed
vma_policy(vma) but I forgot to add a check on current->mempolicy to
be null).

BTW, in the meantime I've some virt bench.. attached screenshot of
vnc, first run is with autonuma off, second and third run are with
autonuma on. The full_scan is increased every 10 sec with autonuma on,
so the scanning overhead is being measured. Autonuma off picks the
wrong node roughly 50% of the time and you can see the difference in
elapsed time when it happens. AutoNUMA gets it right 100% of the time
thanks to autonuma_balance (always "16sec" vs "16 sec or 26 sec" is a
great improvement).

I also tried to measure a kernel build of a VM that fits in one node
(in CPU and RAM) but I get badly bitten by HT effects, I should
basically notice in autonuma_balance that it's better to spread the
load to remote nodes if the remote nodes have full-core idle, while
the local node has only the HT sibling idle. What a mess. Anyway
current code would optimally perform, if all nodes are busy and there
aren't idle cores (or only idle siblings). I guess I'll leave the HT
optimizations for later. I probably shall measure this again with HT off.

Running the kernel build on a VM that spans over the whole system
wouldn't be ok until I run autonuma in the guest too, but to do that I
need a vtopology and I haven't tried how to tell qemu to do that yet.
Otherwise without autonuma in guest too, the guest scheduler will
freely move guest gcc tasks from vCPU0 to vCPU1 and maybe those two
are on threads that lives in different nodes on the host, so
triggering potentially spurious memory migrations triggered by the
guest scheduler not being aware.

[-- Attachment #2: snapshot11.png --]
[-- Type: image/png, Size: 29424 bytes --]

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 14:27                   ` Andrea Arcangeli
@ 2012-03-22 18:49                     ` Andrea Arcangeli
  2012-03-22 18:56                       ` Dan Smith
  0 siblings, 1 reply; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-22 18:49 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

Hi Dan,

On Thu, Mar 22, 2012 at 03:27:35PM +0100, Andrea Arcangeli wrote:
> current code would optimally perform, if all nodes are busy and there
> aren't idle cores (or only idle siblings). I guess I'll leave the HT
> optimizations for later. I probably shall measure this again with HT off.

I added the latest virt measurement with KVM for kernel build and
memhog. I also measured how much I'd save by increasing the
knuma_scand pass frequency (scan_sleep_pass_millisecs) from 10sec
default (5000 value) to 30sec. I also tried 1min but it was within
error range of 30sec. 10sec -> 30sec is also almost within error range
showing the cost is really tiny. Luckily the numbers were totally
stable by running a -j16 loop on both VM (each VM had 12 vcpus on a
host with 24 CPUs) and the error was less than 1sec for each kernel
build (on tmpfs obviously and totally stripped down userland in both
guest and host).

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120322.pdf

slide 11 and 12.

This is with THP on, with THP off things would be different likely but
hey THP off is like 20% slower or more on a kernel build in guest in
the first place.

I'm satisfied with the benchmarks results so far and more will come
soon, but now it's time to go back coding and add THP native
migration. That will benefit everyone, from cpuset in userland to
numa/sched.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 18:49                     ` Andrea Arcangeli
@ 2012-03-22 18:56                       ` Dan Smith
  2012-03-22 19:11                         ` Andrea Arcangeli
                                           ` (2 more replies)
  0 siblings, 3 replies; 153+ messages in thread
From: Dan Smith @ 2012-03-22 18:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

AA> but now it's time to go back coding and add THP native
AA> migration. That will benefit everyone, from cpuset in userland to
AA> numa/sched.

I dunno about everyone else, but I think the thing I'd like to see most
(other than more interesting benchmarks) is a broken out and documented
set of patches instead of the monolithic commit you have now. I know you
weren't probably planning to do that until numasched came along, but it
sure would help me digest the differences in the two approaches.

-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 18:56                       ` Dan Smith
@ 2012-03-22 19:11                         ` Andrea Arcangeli
  2012-03-23 14:15                         ` Andrew Theurer
  2012-03-25 13:30                         ` Andrea Arcangeli
  2 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-22 19:11 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

On Thu, Mar 22, 2012 at 11:56:37AM -0700, Dan Smith wrote:
> AA> but now it's time to go back coding and add THP native
> AA> migration. That will benefit everyone, from cpuset in userland to
> AA> numa/sched.
> 
> I dunno about everyone else, but I think the thing I'd like to see most
> (other than more interesting benchmarks) is a broken out and documented
> set of patches instead of the monolithic commit you have now. I know you
> weren't probably planning to do that until numasched came along, but it
> sure would help me digest the differences in the two approaches.

I uploaded AutoNUMA public to my git tree autonuma branch, the day
before numa/sched was posted to allow people to start testing it. I
didn't announce it yet because I wasn't sure if it was worth posting
it until I had the time to split the patches. Then I changed my mind
and posted it as the monolith that it was.

I think I'll try to attack the THP native migration, if it looks like
it takes more than one or two days to do it, I'll abort it and do the
patch-splitting/cleanup/documentation work first so you can review
review the code better ASAP.

The advantage of doing it sooner is, it gets more of the testing that
is going on right now from you and everyone else, plus I dislike
leaving that important feature missing while many benchmarks are being
run, as it's going to certainly be measurable when the workload
changes massively and lots of hugepages are moved around by
knuma_migrated. Boosting khugepaged tends to hide it for now though
(as shown by specjbb).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
@ 2012-03-23 11:50   ` Mel Gorman
  2012-07-06 16:38     ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Mel Gorman @ 2012-03-23 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, Mar 16, 2012 at 03:40:31PM +0100, Peter Zijlstra wrote:
> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> This patch adds another mbind() flag to request "lazy migration".
> The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> pages are simply unmapped from the calling task's page table ['_MOVE]
> or from all referencing page tables [_MOVE_ALL].  Anon pages will first
> be added to the swap [or migration?] cache, if necessary.  The pages
> will be migrated in the fault path on "first touch", if the policy
> dictates at that time.
> 
> <SNIP>
>
> @@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
>  }
>  
>  /*
> + * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
> + * necessary.  Migration will occur, if policy dictates, when a task faults
> + * an unmapped page back into its page table--i.e., on "first touch" after
> + * unmapping.  Note that migrate-on-fault only migrates pages whose mapping
> + * [e.g., file system] supplies a migratepage op, so we skip pages that
> + * wouldn't migrate on fault.
> + *
> + * Pages are placed back on the lru whether or not they were successfully
> + * unmapped.  Like migrate_pages().
> + *
> + * Unline migrate_pages(), this function is only called in the context of
> + * a task that is unmapping it's own pages while holding its map semaphore
> + * for write.
> + */
> +int migrate_pages_unmap_only(struct list_head *pagelist)

I'm not properly reviewing these patches at the moment but am taking a
quick look as I play some catch up on linux-mm.

I think it's worth pointing out that this potentially will confuse
reclaim. Lets say a process is being migrated to another node and it
gets unmapped like this then some heuristics will change.

1. If the page was referenced prior to the unmapping then it should be
   activated if the page reached the end of the LRU due to the checks
   in page_check_references(). If the process has been unmapped for
   migrate-on-fault, the pages will instead be reclaimed.

2. The heuristic that applies pressure to slab pages if pages are mapped
   is changed. Prior to migrate-on-fault sc->nr_scanned is incremented
   for mapped pages to increase the number of slab pages scanned to
   avoid swapping. During migrate-on-fault, this pressure is relieved

3. zone_reclaim_mode in default mode will reclaim pages it would
   previously have skipped over. It potentially will call shrink_zone more
   for the local node than falling back to other nodes because it thinks
   most pages are unmapped. This could lead to some trashing.

It may not even be a major problem but it's worth thinking about. If it
is a problem, it will be necessary to account for migrate-on-fault pages
similar to mapped pages during reclaim.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 18:56                       ` Dan Smith
  2012-03-22 19:11                         ` Andrea Arcangeli
@ 2012-03-23 14:15                         ` Andrew Theurer
  2012-03-23 16:01                           ` Andrea Arcangeli
  2012-03-25 13:30                         ` Andrea Arcangeli
  2 siblings, 1 reply; 153+ messages in thread
From: Andrew Theurer @ 2012-03-23 14:15 UTC (permalink / raw)
  To: Dan Smith
  Cc: Andrea Arcangeli, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On 03/22/2012 01:56 PM, Dan Smith wrote:
> AA>  but now it's time to go back coding and add THP native
> AA>  migration. That will benefit everyone, from cpuset in userland to
> AA>  numa/sched.
>
> I dunno about everyone else, but I think the thing I'd like to see most
> (other than more interesting benchmarks)

We are working on the "more interesting benchmarks", starting with KVM 
workloads.  However, I must warn you all, more interesting = a lot more 
time to run.  These are a lot more complex in that they have real I/O, 
and they can be a lot more challenging because there are response time 
requirements (so fairness is an absolute requirement).  We are getting a 
baseline right now and re-running with our user-space VM-to-numa-node 
placement program, which in the past achieved manual binding performance 
or just slightly lower.  We can then compare to these two solutions.  If 
there's something specific to collect (perhaps you have a lot of stats 
or data in debugfs, etc) please let me know.

-Andrew Theurer
>   is a broken out and documented
> set of patches instead of the monolithic commit you have now. I know you
> weren't probably planning to do that until numasched came along, but it
> sure would help me digest the differences in the two approaches.
>


^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-23 14:15                         ` Andrew Theurer
@ 2012-03-23 16:01                           ` Andrea Arcangeli
  0 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-23 16:01 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, Mar 23, 2012 at 09:15:22AM -0500, Andrew Theurer wrote:
> We are working on the "more interesting benchmarks", starting with KVM 
> workloads.  However, I must warn you all, more interesting = a lot more 
> time to run.  These are a lot more complex in that they have real I/O, 
> and they can be a lot more challenging because there are response time 
> requirements (so fairness is an absolute requirement).  We are getting a 

Awesome effort!

The reason I intended to get THP native migration ASAP was exactly to
avoid having to repeat the complex "long" benchmark later to have a
more reliable figure of what is possible to achieve in the long term.

For both KVM and even for AutoNUMA internals, it's very beneficial to
run with THP on so please keep it on at all times. Very important: you
also should make sure /sys/kernel/debug/kvm/largepages is increasing
along with `grep Anon /proc/meminfo` while KVM allocates anonymous
memory (the official qemu binary is still not patched to align the
guest physical address space I'm afraid).

I changed plans and I'm doing the cleanups and documentation first
because that seems the bigger obstacle now as also pointed out by
Dan. I'll submit a more documented and splitted version of AutoNUMA
(autonuma-dev branch) by early next week.

> baseline right now and re-running with our user-space VM-to-numa-node 
> placement program, which in the past achieved manual binding performance 
> or just slightly lower.  We can then compare to these two solutions.  If 
> there's something specific to collect (perhaps you have a lot of stats 
> or data in debugfs, etc) please let me know.

If you get bad performance you can log debug info with:

echo 1 >/sys/kernel/mm/autonuma/debug

Other than that, the only tweak I would suggest for virt usage is:

echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs

and if you notice the THP numbers are too low during the benchmark in
 `grep Anon /proc/meminfo` you can use:

echo 10 >/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

With current autonuma and autonuma-dev branches, I already set the
latter to 100 on NUMA hardware (upstream default was an unconditional
10000), but 10 would make khugepaged even faster at rebuilding
THP. Not sure if getting as low as 10 is needed. But I mention it
because 10 was used during specjbb and worked great. I would try with
100 first and lower to 10 as last resort. The workload changes for
virt should not be as fast as with normal host workloads so a value of
100 should be enough. Once we get THP native migration this value can
return to 10000.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-22  9:45   ` Peter Zijlstra
  2012-03-22 10:34     ` Ingo Molnar
@ 2012-03-24  1:41     ` Nish Aravamudan
  2012-03-26 11:42       ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Nish Aravamudan @ 2012-03-24  1:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

Hi Peter,

On Thu, Mar 22, 2012 at 2:45 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
>> I was going to try and test this on power, but it fails to build:
>>
>>   mm/filemap_xip.c: In function ‘__xip_unmap’:
>>   mm/filemap_xip.c:199: error: implicit declaration of function
>> ‘numa_add_vma_counter’
>
> Add:
>
> #include <linux/mempolicy.h>
>
> to that file and it should build.

Thanks, I was able to get it to build, but it panic'd on one of my
machines. Full dmesg is below...

[2012-03-20 00:46:24]	boot: autotest
[2012-03-20 00:46:24]	Please wait, loading kernel...
[2012-03-20 00:46:28]	   Elf64 kernel loaded...
[2012-03-20 00:46:28]	Loading ramdisk...
[2012-03-20 00:46:32]	ramdisk loaded at 03180000, size: 11950 Kbytes
[2012-03-20 00:46:32]	OF stdout device is: /vdevice/vty@30000000
[2012-03-20 00:46:32]	Preparing to boot Linux version 3.3.0-rc7
(root@fbird-lp8.austin.ibm.com) (gcc version 4.6.2 20111027 (Red Hat
4.6.2-1) (GCC) ) #1 SMP Fri Mar 23 20:15:37 EDT 2012
[2012-03-20 00:46:32]	Detected machine type: 0000000000000101
[2012-03-20 00:46:32]	Max number of cores passed to firmware: 8 (NR_CPUS = 32)
[2012-03-20 00:46:32]	Calling ibm,client-architecture-support... done
[2012-03-20 00:46:32]	command line:
root=/dev/mapper/vg_fbirdlp8-lv_root console=hvc0 IDENT=1332548257
[2012-03-20 00:46:32]	memory layout at init:
[2012-03-20 00:46:32]	  memory_limit : 0000000000000000 (16 MB aligned)
[2012-03-20 00:46:32]	  alloc_bottom : 0000000003d2c000
[2012-03-20 00:46:32]	  alloc_top    : 0000000010000000
[2012-03-20 00:46:32]	  alloc_top_hi : 0000000010000000
[2012-03-20 00:46:32]	  rmo_top      : 0000000010000000
[2012-03-20 00:46:32]	  ram_top      : 0000000010000000
[2012-03-20 00:46:32]	instantiating rtas at 0x000000000ed90000... done
[2012-03-20 00:46:32]	Querying for OPAL presence... not there.
[2012-03-20 00:46:32]	boot cpu hw idx 0
[2012-03-20 00:46:32]	starting cpu hw idx 4... done
[2012-03-20 00:46:32]	starting cpu hw idx 8... done
[2012-03-20 00:46:33]	starting cpu hw idx 12... done
[2012-03-20 00:46:33]	copying OF device tree...
[2012-03-20 00:46:33]	Building dt strings...
[2012-03-20 00:46:33]	Building dt structure...
[2012-03-20 00:46:33]	Device tree strings 0x0000000003e2d000 ->
0x0000000003e2e5af
[2012-03-20 00:46:33]	Device tree struct  0x0000000003e2f000 ->
0x0000000003e44000
[2012-03-20 00:46:33]	Calling quiesce...
[2012-03-20 00:46:33]	returning from prom_init
[2012-03-20 00:46:33]	Using pSeries machine description
[2012-03-20 00:46:33]	Using 1TB segments
[2012-03-20 00:46:33]	Found initrd at 0xc000000003180000:0xc000000003d2b800
[2012-03-20 00:46:33]	bootconsole [udbg0] enabled
[2012-03-20 00:46:33]	Partition configured for 16 cpus.
[2012-03-20 00:46:33]	CPU maps initialized for 4 threads per core
[2012-03-20 00:46:33]	Starting Linux PPC64 #1 SMP Fri Mar 23 20:15:37 EDT 2012
[2012-03-20 00:46:33]	-----------------------------------------------------
[2012-03-20 00:46:33]	ppc64_pft_size                = 0x1c
[2012-03-20 00:46:33]	physicalMemorySize            = 0x500000000
[2012-03-20 00:46:33]	htab_hash_mask                = 0x1fffff
[2012-03-20 00:46:33]	-----------------------------------------------------
[2012-03-20 00:46:33]	Initializing cgroup subsys cpuset
[2012-03-20 00:46:33]	Linux version 3.3.0-rc7
(root@fbird-lp8.austin.ibm.com) (gcc version 4.6.2 20111027 (Red Hat
4.6.2-1) (GCC) ) #1 SMP Fri Mar 23 20:15:37 EDT 2012
[2012-03-20 00:46:33]	[boot]0012 Setup Arch
[2012-03-20 00:46:33]	EEH: No capable adapters found
[2012-03-20 00:46:33]	PPC64 nvram contains 15360 bytes
[2012-03-20 00:46:33]	Zone PFN ranges:
[2012-03-20 00:46:33]	  DMA      0x00000000 -> 0x00500000
[2012-03-20 00:46:33]	  Normal   empty
[2012-03-20 00:46:33]	Movable zone start PFN for each node
[2012-03-20 00:46:33]	Early memory PFN ranges
[2012-03-20 00:46:33]	    1: 0x00000000 -> 0x00140000
[2012-03-20 00:46:33]	    3: 0x00140000 -> 0x00500000
[2012-03-20 00:46:33]	Could not find start_pfn for node 0
[2012-03-20 00:46:33]	[boot]0015 Setup Done
[2012-03-20 00:46:33]	PERCPU: Embedded 13 pages/cpu @c000000000f00000
s20864 r0 d32384 u262144
[2012-03-20 00:46:33]	Built 3 zonelists in Node order, mobility
grouping on.  Total pages: 5171200
[2012-03-20 00:46:33]	Policy zone: DMA
[2012-03-20 00:46:33]	Kernel command line:
root=/dev/mapper/vg_fbirdlp8-lv_root console=hvc0 IDENT=1332548257
[2012-03-20 00:46:33]	PID hash table entries: 4096 (order: 3, 32768 bytes)
[2012-03-20 00:46:33]	freeing bootmem node 1
[2012-03-20 00:46:33]	freeing bootmem node 3
[2012-03-20 00:46:33]	Memory: 20628296k/20971520k available (12100k
kernel code, 343224k reserved, 1324k data, 948k bss, 468k init)
[2012-03-20 00:46:33]	SLUB: Genslabs=15, HWalign=128, Order=0-3,
MinObjects=0, CPUs=16, Nodes=256
[2012-03-20 00:46:33]	Hierarchical RCU implementation.
[2012-03-20 00:46:33]	NR_IRQS:512
[2012-03-20 00:46:33]	clocksource: timebase mult[1f40000] shift[24] registered
[2012-03-20 00:46:33]	Console: colour dummy device 80x25
[2012-03-20 00:46:33]	console [hvc0] enabled, bootconsole disabled
[2012-03-20 00:46:33]	console [hvc0] enabled, bootconsole disabled
[2012-03-20 00:46:33]	pid_max: default: 32768 minimum: 301
[2012-03-20 00:46:33]	Dentry cache hash table entries: 4194304 (order:
13, 33554432 bytes)
[2012-03-20 00:46:33]	Inode-cache hash table entries: 2097152 (order:
12, 16777216 bytes)
[2012-03-20 00:46:33]	Mount-cache hash table entries: 256
[2012-03-20 00:46:33]	POWER7 performance monitor hardware support registered
[2012-03-20 00:46:33]	Unable to handle kernel paging request for data
at address 0x00001688
[2012-03-20 00:46:33]	Faulting instruction address: 0xc000000000168338
[2012-03-20 00:46:33]	Oops: Kernel access of bad area, sig: 11 [#1]
[2012-03-20 00:46:33]	SMP NR_CPUS=32 NUMA pSeries
[2012-03-20 00:46:33]	Modules linked in:
[2012-03-20 00:46:33]	NIP: c000000000168338 LR: c0000000001b523c CTR:
0000000000000000
[2012-03-20 00:46:33]	REGS: c00000013d887700 TRAP: 0300   Not tainted
(3.3.0-rc7)
[2012-03-20 00:46:33]	MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR:
24004022  XER: 00000008
[2012-03-20 00:46:33]	CFAR: 0000000000005374
[2012-03-20 00:46:33]	DAR: 0000000000001688, DSISR: 40000000
[2012-03-20 00:46:33]	TASK = c00000013d888000[1] 'swapper/0' THREAD:
c00000013d884000 CPU: 0
[2012-03-20 00:46:33]	GPR00: 0000000000000000 c00000013d887980
c000000000ce7990 00000000000012d0
[2012-03-20 00:46:33]	GPR04: 0000000000000000 0000000000001680
0000000000000000 0003005500000001
[2012-03-20 00:46:33]	GPR08: 0000000000000001 0000000000000000
c000000000d25000 0000000000000010
[2012-03-20 00:46:33]	GPR12: 0000000044004024 c00000000fffa000
0000000000000000 0000000000000060
[2012-03-20 00:46:33]	GPR16: c000000000a69040 c000000000a66828
0000000002e317f0 0000000001a3f930
[2012-03-20 00:46:33]	GPR20: 0000000000000000 0000000000001680
0000000000000001 0000000000210d00
[2012-03-20 00:46:33]	GPR24: c000000000d193a0 0000000000000000
0000000000001680 00000000000012d0
[2012-03-20 00:46:33]	GPR28: 0000000000000000 0000000000000000
c000000000c5d6e8 c00000013e009200
[2012-03-20 00:46:33]	NIP [c000000000168338] .__alloc_pages_nodemask+0xb8/0x860
[2012-03-20 00:46:33]	LR [c0000000001b523c] .new_slab+0xcc/0x3d0
[2012-03-20 00:46:33]	Call Trace:
[2012-03-20 00:46:33]	[c00000013d887980] [c0000000001683dc]
.__alloc_pages_nodemask+0x15c/0x860 (unreliable)
[2012-03-20 00:46:33]	[c00000013d887b00] [c0000000001b523c] .new_slab+0xcc/0x3d0
[2012-03-20 00:46:33]	[c00000013d887bb0] [c0000000007fc780]
.__slab_alloc+0x388/0x4e0
[2012-03-20 00:46:33]	[c00000013d887cd0] [c0000000001b5af8]
.kmem_cache_alloc_node_trace+0x98/0x230
[2012-03-20 00:46:33]	[c00000013d887d90] [c000000000b83ed0]
.numa_init+0x90/0x1d0
[2012-03-20 00:46:33]	[c00000013d887e20] [c00000000000ab60]
.do_one_initcall+0x60/0x1e0
[2012-03-20 00:46:33]	[c00000013d887ee0] [c000000000b5cad4]
.kernel_init+0xf0/0x1e0
[2012-03-20 00:46:33]	[c00000013d887f90] [c000000000021e14]
.kernel_thread+0x54/0x70
[2012-03-20 00:46:33]	Instruction dump:
[2012-03-20 00:46:33]	0b000000 eb1e8000 3ba00000 801800a8 2f800000
409e001c 7860efe3 38000000
[2012-03-20 00:46:33]	41820008 38000002 7b7d6fe2 7fbd0378 <e81a0008>
827800a4 3be00000 2fa00000
[2012-03-20 00:46:33]	---[ end trace 31fd0ba7d8756001 ]---
[2012-03-20 00:46:33]	
[2012-03-20 00:46:35]	swapper/0 used greatest stack depth: 10832 bytes left
[2012-03-20 00:46:35]	Kernel panic - not syncing: Attempted to kill init!
[2012-03-20 00:46:48]	Rebooting in 10 seconds..

I can debug more next week, let me know if there is something specific
you want me to look at.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC] AutoNUMA alpha6
  2012-03-22 18:56                       ` Dan Smith
  2012-03-22 19:11                         ` Andrea Arcangeli
  2012-03-23 14:15                         ` Andrew Theurer
@ 2012-03-25 13:30                         ` Andrea Arcangeli
  2 siblings, 0 replies; 153+ messages in thread
From: Andrea Arcangeli @ 2012-03-25 13:30 UTC (permalink / raw)
  To: Dan Smith
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm,
	Hillf Danton

On Thu, Mar 22, 2012 at 11:56:37AM -0700, Dan Smith wrote:
> I dunno about everyone else, but I think the thing I'd like to see most
> (other than more interesting benchmarks) is a broken out and documented
> set of patches instead of the monolithic commit you have now. I know you
> weren't probably planning to do that until numasched came along, but it
> sure would help me digest the differences in the two approaches.

Ok this is a start. I'll have to review it again tomorrow and add more
docs before I can do proper submit by email. If you're willing to
contribute you can review it already using "git format-patch 9ca11f1"
after fetching the repo. Comments welcome!

git clone --reference linux -b autonuma-dev-smt git://git.kernel.org/pub/scm/linux/kernel/git/andaa.git

The last patch in that branch is the last feature I worked on
yesterday and it fixes the SMT load with numa02.c modified to use only
1 thread per core, which means changing THREADS from 24 to 12 in the
numa02.c source at the top (and then building it again in the
-DHARD_BIND and -DHARD_BIND -DINVERSE_BIND versions to compare with
autonuma on and off). It also fixes building the kernel in a loop in
KVM with 12 vcpus (now the load spreads over the two nodes). echo 0
>/sys/kernel/mm/autonuma/scheduler/smt would disable the SMT
awareness.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-24  1:41     ` Nish Aravamudan
@ 2012-03-26 11:42       ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-26 11:42 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 2012-03-23 at 18:41 -0700, Nish Aravamudan wrote:
> [2012-03-20 00:46:33]   Unable to handle kernel paging request for data at address 0x00001688
> [2012-03-20 00:46:33]   Faulting instruction address: 0xc000000000168338
> [2012-03-20 00:46:33]   Oops: Kernel access of bad area, sig: 11 [#1]
> [2012-03-20 00:46:33]   SMP NR_CPUS=32 NUMA pSeries
> [2012-03-20 00:46:33]   Modules linked in:
> [2012-03-20 00:46:33]   NIP: c000000000168338 LR: c0000000001b523c CTR: 0000000000000000
> [2012-03-20 00:46:33]   REGS: c00000013d887700 TRAP: 0300   Not tainted (3.3.0-rc7)
> [2012-03-20 00:46:33]   MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 24004022  XER: 00000008
> [2012-03-20 00:46:33]   CFAR: 0000000000005374
> [2012-03-20 00:46:33]   DAR: 0000000000001688, DSISR: 40000000
> [2012-03-20 00:46:33]   TASK = c00000013d888000[1] 'swapper/0' THREAD: c00000013d884000 CPU: 0
> [2012-03-20 00:46:33]   GPR00: 0000000000000000 c00000013d887980 c000000000ce7990 00000000000012d0
> [2012-03-20 00:46:33]   GPR04: 0000000000000000 0000000000001680 0000000000000000 0003005500000001
> [2012-03-20 00:46:33]   GPR08: 0000000000000001 0000000000000000 c000000000d25000 0000000000000010
> [2012-03-20 00:46:33]   GPR12: 0000000044004024 c00000000fffa000 0000000000000000 0000000000000060
> [2012-03-20 00:46:33]   GPR16: c000000000a69040 c000000000a66828 0000000002e317f0 0000000001a3f930
> [2012-03-20 00:46:33]   GPR20: 0000000000000000 0000000000001680 0000000000000001 0000000000210d00
> [2012-03-20 00:46:33]   GPR24: c000000000d193a0 0000000000000000 0000000000001680 00000000000012d0
> [2012-03-20 00:46:33]   GPR28: 0000000000000000 0000000000000000 c000000000c5d6e8 c00000013e009200
> [2012-03-20 00:46:33]   NIP [c000000000168338] .__alloc_pages_nodemask+0xb8/0x860
> [2012-03-20 00:46:33]   LR [c0000000001b523c] .new_slab+0xcc/0x3d0
> [2012-03-20 00:46:33]   Call Trace:
> [2012-03-20 00:46:33]   [c00000013d887980] [c0000000001683dc] .__alloc_pages_nodemask+0x15c/0x860 (unreliable)
> [2012-03-20 00:46:33]   [c00000013d887b00] [c0000000001b523c] .new_slab+0xcc/0x3d0
> [2012-03-20 00:46:33]   [c00000013d887bb0] [c0000000007fc780] .__slab_alloc+0x388/0x4e0
> [2012-03-20 00:46:33]   [c00000013d887cd0] [c0000000001b5af8] .kmem_cache_alloc_node_trace+0x98/0x230
> [2012-03-20 00:46:33]   [c00000013d887d90] [c000000000b83ed0] .numa_init+0x90/0x1d0
> [2012-03-20 00:46:33]   [c00000013d887e20] [c00000000000ab60] .do_one_initcall+0x60/0x1e0
> [2012-03-20 00:46:33]   [c00000013d887ee0] [c000000000b5cad4] .kernel_init+0xf0/0x1e0
> [2012-03-20 00:46:33]   [c00000013d887f90] [c000000000021e14] .kernel_thread+0x54/0x70
> [2012-03-20 00:46:33]   Instruction dump:
> [2012-03-20 00:46:33]   0b000000 eb1e8000 3ba00000 801800a8 2f800000 409e001c 7860efe3 38000000
> [2012-03-20 00:46:33]   41820008 38000002 7b7d6fe2 7fbd0378 <e81a0008> 827800a4 3be00000 2fa00000
> [2012-03-20 00:46:33]   ---[ end trace 31fd0ba7d8756001 ]--- 

Can't say I've ever seen that one before.. that looks to be the
kzalloc() in numa_init() which is ran as an early_initcall(), which is
way after mm_init() and numa_policy_init() in init/main.c.

Where exactly in __alloc_pages_nodemask() is this?

The only thing I can think of is that the policy returned by
get_task_policy() is wonky and we get a weird zone_list, but that would
mean this is the first kmalloc() ever.. also all that should be set up
by now.

Hmm..

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-03-20 22:18                 ` Rik van Riel
  2012-03-21 16:50                   ` Andrea Arcangeli
@ 2012-04-02 16:34                   ` Pekka Enberg
  2012-04-02 16:55                     ` Rik van Riel
  1 sibling, 1 reply; 153+ messages in thread
From: Pekka Enberg @ 2012-04-02 16:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	linux-kernel, linux-mm

On Wed, Mar 21, 2012 at 12:18 AM, Rik van Riel <riel@redhat.com> wrote:
> I suspect Java and other runtimes may have issues where
> they simply do not know which thread will end up using
> which objects from the heap heavily.

What kind of JVM workloads are you thinking of? Modern GCs use
thread-local allocation for performance reasons so I'd assume that
most of accesses are on local node.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-04-02 16:55                     ` Rik van Riel
@ 2012-04-02 16:54                       ` Pekka Enberg
  2012-04-02 17:12                         ` Pekka Enberg
  0 siblings, 1 reply; 153+ messages in thread
From: Pekka Enberg @ 2012-04-02 16:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	linux-kernel, linux-mm

Hi Rik,

On Wed, Mar 21, 2012 at 12:18 AM, Rik van Riel<riel@redhat.com>  wrote:
>>> I suspect Java and other runtimes may have issues where
>>> they simply do not know which thread will end up using
>>> which objects from the heap heavily.

On 04/02/2012 12:34 PM, Pekka Enberg wrote:
>> What kind of JVM workloads are you thinking of? Modern GCs use
>> thread-local allocation for performance reasons so I'd assume that
>> most of accesses are on local node.

On Mon, Apr 2, 2012 at 7:55 PM, Rik van Riel <riel@redhat.com> wrote:
> Yes, the use thread-local allocation.
>
> However, I suspect that after the memory has been allocated
> locally, it may quite often end up with another thread for
> further processing...

Do you have any specific workloads in mind? My experience makes me
assume the opposite for common JVM server workloads. (And yes, I'm
hand-waving here, I have no data to back that up.)

On Mon, Apr 2, 2012 at 7:55 PM, Rik van Riel <riel@redhat.com> wrote:
> The JVM doing the right thing only helps so much, when the
> Java program has no way to know about underlying things,
> or influence how the threads get scheduled on the JVM.
>
> Allowing us to discover which threads are accessing the
> same data, and figuring out what data each thread uses,
> could be useful for NUMA placement...

Sure, it's probably going to help for the kinds of workloads you're
describing. I'm just wondering how typical they are in the real world.

                        Pekka

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-04-02 16:34                   ` Pekka Enberg
@ 2012-04-02 16:55                     ` Rik van Riel
  2012-04-02 16:54                       ` Pekka Enberg
  0 siblings, 1 reply; 153+ messages in thread
From: Rik van Riel @ 2012-04-02 16:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	linux-kernel, linux-mm

On 04/02/2012 12:34 PM, Pekka Enberg wrote:
> On Wed, Mar 21, 2012 at 12:18 AM, Rik van Riel<riel@redhat.com>  wrote:
>> I suspect Java and other runtimes may have issues where
>> they simply do not know which thread will end up using
>> which objects from the heap heavily.
>
> What kind of JVM workloads are you thinking of? Modern GCs use
> thread-local allocation for performance reasons so I'd assume that
> most of accesses are on local node.

Yes, the use thread-local allocation.

However, I suspect that after the memory has been allocated
locally, it may quite often end up with another thread for
further processing...

The JVM doing the right thing only helps so much, when the
Java program has no way to know about underlying things,
or influence how the threads get scheduled on the JVM.

Allowing us to discover which threads are accessing the
same data, and figuring out what data each thread uses,
could be useful for NUMA placement...

I have some ideas on how to gather the information that
Andrea is gathering, with less space overhead. I will
try to present that idea today...

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-04-02 16:54                       ` Pekka Enberg
@ 2012-04-02 17:12                         ` Pekka Enberg
  2012-04-02 17:23                           ` Pekka Enberg
  0 siblings, 1 reply; 153+ messages in thread
From: Pekka Enberg @ 2012-04-02 17:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	linux-kernel, linux-mm

On Mon, Apr 2, 2012 at 7:54 PM, Pekka Enberg <penberg@kernel.org> wrote:
> Sure, it's probably going to help for the kinds of workloads you're
> describing. I'm just wondering how typical they are in the real world.

I don't have a NUMA machine to test this with but it'd be interesting
to see how AutoNUMA and sched/numa affect DaCapo benchmarks:

http://dacapobench.org/

I guess benchmarks that represent typical JVM server workloads are
tomcat and tradesoap. You can run them easily with this small shell
script:

#!/bin/sh

JAR=dacapo-9.12-bach.jar

if [ ! -f $JAR ];
then
  wget http://sourceforge.net/projects/dacapobench/files/9.12-bach/$JAR/download
fi

java -jar $JAR tomcat tradesoap | grep PASSED

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 00/26] sched/numa
  2012-04-02 17:12                         ` Pekka Enberg
@ 2012-04-02 17:23                           ` Pekka Enberg
  0 siblings, 0 replies; 153+ messages in thread
From: Pekka Enberg @ 2012-04-02 17:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Dan Smith, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	linux-kernel, linux-mm

On Mon, Apr 2, 2012 at 8:12 PM, Pekka Enberg <penberg@kernel.org> wrote:
> I guess benchmarks that represent typical JVM server workloads are
> tomcat and tradesoap. You can run them easily with this small shell
> script:

The default configuration for tomcat is too small to be interesting.
Here's a fixed up version which uses larger data set and 2 threads per
CPU.

#!/bin/sh

JAR=dacapo-9.12-bach.jar

if [ ! -f $JAR ];
then
  wget http://sourceforge.net/projects/dacapobench/files/9.12-bach/$JAR/download
fi

java -jar $JAR -k 2 -s large tomcat | grep PASSED

java -jar $JAR tradesoap | grep PASSED

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 06/26] mm: Migrate misplaced page
  2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
@ 2012-04-03 17:32   ` Dan Smith
  0 siblings, 0 replies; 153+ messages in thread
From: Dan Smith @ 2012-04-03 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Andrea Arcangeli,
	Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

PZ> XXX: hnaz, dansmith saw some bad_page() reports when using memcg, I
PZ> could not reproduce -- is there something funny with the mem_cgroup
PZ> calls in the below patch?

I think the problem stems from the final put_page() on the old page
being called before the charge commit. I think something like the
following should do the trick (and appears to work for me):

diff --git a/mm/migrate.c b/mm/migrate.c
index b7fa472..fd88f4b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1590,7 +1590,6 @@ migrate_misplaced_page(struct page *page, struct mm_struct
                put_page(page);         /* drop       "          "  */
 
                unlock_page(page);
-               put_page(page);         /* drop fault path ref & free */
 
                page = newpage;
        }
@@ -1599,6 +1598,9 @@ out:
        if (!charge)
                mem_cgroup_end_migration(mcg, oldpage, newpage, !rc);
 
+       if (oldpage != page)
+               put_page(oldpage);
+
        if (rc) {
                unlock_page(newpage);
                __free_page(newpage);


-- 
Dan Smith
IBM Linux Technology Center

^ permalink raw reply related	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment
  2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
@ 2012-06-15 18:16   ` Tony Luck
  2012-06-20 19:12     ` [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n Luck, Tony
  0 siblings, 1 reply; 153+ messages in thread
From: Tony Luck @ 2012-06-15 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, Mar 16, 2012 at 7:40 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Rework the scheduler fork,exec hooks to allow home-node assignment.

Some compile errors on the (somewhat bizarre) CONFIG_SMP=n,
CONFIG_NUMA=y case:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> +
> +       select_task_node(p, p->mm, SD_BALANCE_FORK);
kernel/sched/core.c: In function ‘sched_fork’:
kernel/sched/core.c:1802: error: ‘SD_BALANCE_FORK’ undeclared (first
use in this function)

Also (from an earlier patch?)

In file included from kernel/sched/core.c:84:
kernel/sched/sched.h: In function ‘offnode_tasks’:
kernel/sched/sched.h:477: error: ‘struct rq’ has no member named ‘offnode_tasks’

-Tony

^ permalink raw reply	[flat|nested] 153+ messages in thread

* [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n
  2012-06-15 18:16   ` Tony Luck
@ 2012-06-20 19:12     ` Luck, Tony
  0 siblings, 0 replies; 153+ messages in thread
From: Luck, Tony @ 2012-06-20 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

It is possible to have a single cpu system with both local
and remote memory.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Broken in linux-next for the past couple of days. Perhaps
we need some more stubs though - sched_fork() seems to need
#ifdef CONFIG_SMP around every other line ... not pretty.

Another approach would be to outlaw such strange configurations
and make sure that CONFIG_SMP is set whenever CONFIG_NUMA is set.
We had such a discussion a long time ago, and at that time
decided to keep supporting it. But with multi-core cpus now
the norm - perhaps it is time to change our minds.

 kernel/sched/core.c  |  2 ++
 kernel/sched/numa.c  | 16 ++++++++++++++++
 kernel/sched/sched.h |  2 +-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 46460ac..f261599 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1799,7 +1799,9 @@ void sched_fork(struct task_struct *p)
 #endif
 	put_cpu();
 
+#ifdef CONFIG_SMP
 	select_task_node(p, p->mm, SD_BALANCE_FORK);
+#endif
 }
 
 /*
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
index 002f71c..4ff3b7c 100644
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -18,6 +18,21 @@
 #include "sched.h"
 
 
+#ifndef CONFIG_SMP
+void mm_init_numa(struct mm_struct *mm)
+{
+}
+void exit_numa(struct mm_struct *mm)
+{
+}
+void account_numa_dequeue(struct task_struct *p)
+{
+}
+__init void init_sched_numa(void)
+{
+}
+#else
+
 static const int numa_balance_interval = 2 * HZ; /* 2 seconds */
 
 struct numa_ops {
@@ -853,3 +868,4 @@ static __init int numa_init(void)
 	return 0;
 }
 early_initcall(numa_init);
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4134d37..9bf5ba8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -473,7 +473,7 @@ struct rq {
 
 static inline struct list_head *offnode_tasks(struct rq *rq)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) && defined(CONFIG_SMP)
 	return &rq->offnode_tasks;
 #else
 	return NULL;
-- 
1.7.10.2.552.gaa3bb87


^ permalink raw reply related	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-07-06 10:32   ` Johannes Weiner
  2012-07-06 13:46     ` [tip:sched/core] mm: Fix vmstat names-values off-by-one tip-bot for Johannes Weiner
  2012-07-06 14:48     ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Minchan Kim
  2012-07-06 14:54   ` Kyungmin Park
  1 sibling, 2 replies; 153+ messages in thread
From: Johannes Weiner @ 2012-07-06 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, linux-kernel, linux-mm

Hi Peter,

On Fri, Mar 16, 2012 at 03:40:30PM +0100, Peter Zijlstra wrote:
> Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
> to be compared to either a total of interleave allocations or to a miss
> count, remove it.
> 
> Fixing it would be possible, but since we've gone years without these
> statistics I figure we can continue that way.
> 
> This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---

> @@ -111,7 +111,6 @@ enum zone_stat_item {
>  	NUMA_HIT,		/* allocated in intended node */
>  	NUMA_MISS,		/* allocated in non intended node */
>  	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif

Can you guys include/fold this?

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: fix vmstat names-values off-by-one

"mm/mpol: Remove NUMA_INTERLEAVE_HIT" removed the NUMA_INTERLEAVE_HIT
item from the zone_stat_item enum, but left the corresponding name
string for it in the vmstat_text array.  As a result, all counters
that follow it have their name offset by one from their value.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmstat.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..e4db312 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -717,7 +717,6 @@ const char * const vmstat_text[] = {
 	"numa_hit",
 	"numa_miss",
 	"numa_foreign",
-	"numa_interleave",
 	"numa_local",
 	"numa_other",
 #endif
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 153+ messages in thread

* [tip:sched/core] mm: Fix vmstat names-values off-by-one
  2012-07-06 10:32   ` Johannes Weiner
@ 2012-07-06 13:46     ` tip-bot for Johannes Weiner
  2012-07-06 14:48     ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Minchan Kim
  1 sibling, 0 replies; 153+ messages in thread
From: tip-bot for Johannes Weiner @ 2012-07-06 13:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, a.p.zijlstra, efault, riel, aarcange,
	suresh.b.siddha, tglx, laijs, linux-kernel, hpa, pjt, hannes,
	paulmck, bharata.rao, Lee.Schermerhorn, danms

Commit-ID:  1d349dc80b3191fb654a04e3648e025258e80d46
Gitweb:     http://git.kernel.org/tip/1d349dc80b3191fb654a04e3648e025258e80d46
Author:     Johannes Weiner <hannes@cmpxchg.org>
AuthorDate: Fri, 6 Jul 2012 12:32:55 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 6 Jul 2012 12:52:09 +0200

mm: Fix vmstat names-values off-by-one

Commit e975d6ac08f3 (""mm/mpol: Remove NUMA_INTERLEAVE_HIT") removed
the NUMA_INTERLEAVE_HIT item from the zone_stat_item enum, but left
the corresponding name string for it in the vmstat_text array.

As a result, all counters that follow it have their name offset
by one from their value.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Dan Smith <danms@us.ibm.com>
Cc: Bharata B Rao <bharata.rao@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20120706103255.GA23680@cmpxchg.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/vmstat.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..e4db312 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -717,7 +717,6 @@ const char * const vmstat_text[] = {
 	"numa_hit",
 	"numa_miss",
 	"numa_foreign",
-	"numa_interleave",
 	"numa_local",
 	"numa_other",
 #endif

^ permalink raw reply related	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-06 10:32   ` Johannes Weiner
  2012-07-06 13:46     ` [tip:sched/core] mm: Fix vmstat names-values off-by-one tip-bot for Johannes Weiner
@ 2012-07-06 14:48     ` Minchan Kim
  2012-07-06 15:02       ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Minchan Kim @ 2012-07-06 14:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Rik van Riel, linux-kernel,
	linux-mm

Hi Hannes,

I alreay sent a patch about that but didn't have a reply from
Peter/Ingo.

https://lkml.org/lkml/2012/7/3/477

On Fri, Jul 06, 2012 at 12:32:55PM +0200, Johannes Weiner wrote:
> Hi Peter,
> 
> On Fri, Mar 16, 2012 at 03:40:30PM +0100, Peter Zijlstra wrote:
> > Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
> > to be compared to either a total of interleave allocations or to a miss
> > count, remove it.
> > 
> > Fixing it would be possible, but since we've gone years without these
> > statistics I figure we can continue that way.
> > 
> > This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> 
> > @@ -111,7 +111,6 @@ enum zone_stat_item {
> >  	NUMA_HIT,		/* allocated in intended node */
> >  	NUMA_MISS,		/* allocated in non intended node */
> >  	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> > -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> >  	NUMA_LOCAL,		/* allocation from local node */
> >  	NUMA_OTHER,		/* allocation from other node */
> >  #endif
> 
> Can you guys include/fold this?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: fix vmstat names-values off-by-one
> 
> "mm/mpol: Remove NUMA_INTERLEAVE_HIT" removed the NUMA_INTERLEAVE_HIT
> item from the zone_stat_item enum, but left the corresponding name
> string for it in the vmstat_text array.  As a result, all counters
> that follow it have their name offset by one from their value.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmstat.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1bbbbd9..e4db312 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -717,7 +717,6 @@ const char * const vmstat_text[] = {
>  	"numa_hit",
>  	"numa_miss",
>  	"numa_foreign",
> -	"numa_interleave",
>  	"numa_local",
>  	"numa_other",
>  #endif
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
  2012-07-06 10:32   ` Johannes Weiner
@ 2012-07-06 14:54   ` Kyungmin Park
  2012-07-06 15:00     ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Kyungmin Park @ 2012-07-06 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, Mar 16, 2012 at 11:40 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
> to be compared to either a total of interleave allocations or to a miss
> count, remove it.
>
> Fixing it would be possible, but since we've gone years without these
> statistics I figure we can continue that way.
>
> This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  drivers/base/node.c    |    2 +-
>  include/linux/mmzone.h |    1 -
>  mm/mempolicy.c         |   66 +++++++++++++++--------------------------------
>  3 files changed, 22 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 5693ece..942cdbc 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -172,7 +172,7 @@ static ssize_t node_read_numastat(struct sys_device * dev,
>                        node_page_state(dev->id, NUMA_HIT),
>                        node_page_state(dev->id, NUMA_MISS),
>                        node_page_state(dev->id, NUMA_FOREIGN),
> -                      node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
> +                      0UL,
>                        node_page_state(dev->id, NUMA_LOCAL),
>                        node_page_state(dev->id, NUMA_OTHER));
>  }
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3ac040f..3a3be81 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,7 +111,6 @@ enum zone_stat_item {
>         NUMA_HIT,               /* allocated in intended node */
>         NUMA_MISS,              /* allocated in non intended node */
>         NUMA_FOREIGN,           /* was intended here, hit elsewhere */
> -       NUMA_INTERLEAVE_HIT,    /* interleaver preferred this zone */
>         NUMA_LOCAL,             /* allocation from local node */
>         NUMA_OTHER,             /* allocation from other node */
>  #endif
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index c3fdbcb..2c48c45 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1530,11 +1530,29 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>         return NULL;
>  }
>
> +/* Do dynamic interleaving for a process */
> +static unsigned interleave_nodes(struct mempolicy *policy)
> +{
> +       unsigned nid, next;
> +       struct task_struct *me = current;
> +
> +       nid = me->il_next;
> +       next = next_node(nid, policy->v.nodes);
> +       if (next >= MAX_NUMNODES)
> +               next = first_node(policy->v.nodes);
> +       if (next < MAX_NUMNODES)
> +               me->il_next = next;
> +       return nid;
> +}
> +
>  /* Return a zonelist indicated by gfp for node representing a mempolicy */
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>         int nd)
>  {
>         switch (policy->mode) {
> +       case MPOL_INTERLEAVE:
> +               nd = interleave_nodes(policy);
Jut nitpick,
Original code also uses the 'unsigned nid' but now it assigned
'unsigned nid' to 'int nd' at here. does it right?

Thank you,
Kyungmin Park
> +               break;
>         case MPOL_PREFERRED:
>                 if (!(policy->flags & MPOL_F_LOCAL))
>                         nd = policy->v.preferred_node;
> @@ -1556,21 +1574,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>         return node_zonelist(nd, gfp);
>  }
>
> -/* Do dynamic interleaving for a process */
> -static unsigned interleave_nodes(struct mempolicy *policy)
> -{
> -       unsigned nid, next;
> -       struct task_struct *me = current;
> -
> -       nid = me->il_next;
> -       next = next_node(nid, policy->v.nodes);
> -       if (next >= MAX_NUMNODES)
> -               next = first_node(policy->v.nodes);
> -       if (next < MAX_NUMNODES)
> -               me->il_next = next;
> -       return nid;
> -}
> -
>  /*
>   * Depending on the memory policy provide a node from which to allocate the
>   * next slab entry.
> @@ -1801,21 +1804,6 @@ out:
>         return ret;
>  }
>
> -/* Allocate a page in interleaved policy.
> -   Own path because it needs to do special accounting. */
> -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> -                                       unsigned nid)
> -{
> -       struct zonelist *zl;
> -       struct page *page;
> -
> -       zl = node_zonelist(nid, gfp);
> -       page = __alloc_pages(gfp, order, zl);
> -       if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
> -               inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
> -       return page;
> -}
> -
>  /**
>   *     alloc_pages_vma - Allocate a page for a VMA.
>   *
> @@ -1848,15 +1836,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>         struct page *page;
>
>         get_mems_allowed();
> -       if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> -               unsigned nid;
> -
> -               nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
> -               mpol_cond_put(pol);
> -               page = alloc_page_interleave(gfp, order, nid);
> -               put_mems_allowed();
> -               return page;
> -       }
>         zl = policy_zonelist(gfp, pol, node);
>         if (unlikely(mpol_needs_cond_ref(pol))) {
>                 /*
> @@ -1909,12 +1888,9 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>          * No reference counting needed for current->mempolicy
>          * nor system default_policy
>          */
> -       if (pol->mode == MPOL_INTERLEAVE)
> -               page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> -       else
> -               page = __alloc_pages_nodemask(gfp, order,
> -                               policy_zonelist(gfp, pol, numa_node_id()),
> -                               policy_nodemask(gfp, pol));
> +       page = __alloc_pages_nodemask(gfp, order,
> +                       policy_zonelist(gfp, pol, numa_node_id()),
> +                       policy_nodemask(gfp, pol));
>         put_mems_allowed();
>         return page;
>  }
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-06 14:54   ` Kyungmin Park
@ 2012-07-06 15:00     ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-06 15:00 UTC (permalink / raw)
  To: Kyungmin Park
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 2012-07-06 at 23:54 +0900, Kyungmin Park wrote:
> >  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> >         int nd)
> >  {
> >         switch (policy->mode) {
> > +       case MPOL_INTERLEAVE:
> > +               nd = interleave_nodes(policy);
> Jut nitpick,
> Original code also uses the 'unsigned nid' but now it assigned
> 'unsigned nid' to 'int nd' at here. does it right? 

node id is generally signed, we use -1 as a special value indicating no
node preference in a number of places. Not sure why it was unsigned
here. Also I think even SGI isn't anywhere near 2^31 nodes.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-06 14:48     ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Minchan Kim
@ 2012-07-06 15:02       ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-06 15:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Rik van Riel, linux-kernel,
	linux-mm

On Fri, 2012-07-06 at 23:48 +0900, Minchan Kim wrote:
> 
> I alreay sent a patch about that but didn't have a reply from
> Peter/Ingo.
> 
> https://lkml.org/lkml/2012/7/3/477 

Yeah sorry for that.. it looks like Ingo picked up the fix from hnaz
though.

Thanks both!

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-03-23 11:50   ` Mel Gorman
@ 2012-07-06 16:38     ` Rik van Riel
  2012-07-06 20:04       ` Lee Schermerhorn
  2012-07-09 11:48       ` Peter Zijlstra
  0 siblings, 2 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-06 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Johannes Weiner,
	linux-kernel, linux-mm

On 03/23/2012 07:50 AM, Mel Gorman wrote:
> On Fri, Mar 16, 2012 at 03:40:31PM +0100, Peter Zijlstra wrote:
>> From: Lee Schermerhorn<Lee.Schermerhorn@hp.com>
>>
>> This patch adds another mbind() flag to request "lazy migration".
>> The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
>> pages are simply unmapped from the calling task's page table ['_MOVE]
>> or from all referencing page tables [_MOVE_ALL].  Anon pages will first
>> be added to the swap [or migration?] cache, if necessary.  The pages
>> will be migrated in the fault path on "first touch", if the policy
>> dictates at that time.
>>
>> <SNIP>
>>
>> @@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
>>   }
>>
>>   /*
>> + * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
>> + * necessary.  Migration will occur, if policy dictates, when a task faults
>> + * an unmapped page back into its page table--i.e., on "first touch" after
>> + * unmapping.  Note that migrate-on-fault only migrates pages whose mapping
>> + * [e.g., file system] supplies a migratepage op, so we skip pages that
>> + * wouldn't migrate on fault.
>> + *
>> + * Pages are placed back on the lru whether or not they were successfully
>> + * unmapped.  Like migrate_pages().
>> + *
>> + * Unline migrate_pages(), this function is only called in the context of
>> + * a task that is unmapping it's own pages while holding its map semaphore
>> + * for write.
>> + */
>> +int migrate_pages_unmap_only(struct list_head *pagelist)
>
> I'm not properly reviewing these patches at the moment but am taking a
> quick look as I play some catch up on linux-mm.
>
> I think it's worth pointing out that this potentially will confuse
> reclaim. Lets say a process is being migrated to another node and it
> gets unmapped like this then some heuristics will change.
>
> 1. If the page was referenced prior to the unmapping then it should be
>     activated if the page reached the end of the LRU due to the checks
>     in page_check_references(). If the process has been unmapped for
>     migrate-on-fault, the pages will instead be reclaimed.
>
> 2. The heuristic that applies pressure to slab pages if pages are mapped
>     is changed. Prior to migrate-on-fault sc->nr_scanned is incremented
>     for mapped pages to increase the number of slab pages scanned to
>     avoid swapping. During migrate-on-fault, this pressure is relieved
>
> 3. zone_reclaim_mode in default mode will reclaim pages it would
>     previously have skipped over. It potentially will call shrink_zone more
>     for the local node than falling back to other nodes because it thinks
>     most pages are unmapped. This could lead to some trashing.
>
> It may not even be a major problem but it's worth thinking about. If it
> is a problem, it will be necessary to account for migrate-on-fault pages
> similar to mapped pages during reclaim.

I can see other serious issues with this approach:

4. Putting a lot of pages in the swap cache ends up allocating
    swap space. This means this NUMA migration scheme will only
    work on systems that have a substantial amount of memory
    represented by swap space. This is highly unlikely on systems
    with memory in the TB range. On smaller systems, it could drive
    the system out of memory (to the OOM killer), by "filling up"
    the overflow swap with migration pages instead.

5. In the long run, we want the ability to migrate transparent
    huge pages as one unit.  The reason is simple, the performance
    penalty for running on the wrong NUMA node (10-20%) is on the
    same order of magnitude as the performance penalty for running
    with 4kB pages instead of 2MB pages (5-15%).

    Breaking up large pages into small ones, and having khugepaged
    reconstitute them on a random NUMA node later on, will negate
    the performance benefits of both NUMA placement and THP.

In short, while this approach made sense when Lee first proposed
it several years ago (with smaller memory systems, and before Linux
had transparent huge pages), I do not believe it is an acceptable
approach to NUMA migration any more.

We really want something like PROT_NONE or PTE_NUMA page table
(and page directory) entries, so we can avoid filling up swap
space with migration pages and have the possibility of migrating
transparent huge pages in one piece at some point.

In other words, NAK to this patch

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP
  2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
@ 2012-07-06 18:40   ` Rik van Riel
  0 siblings, 0 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-06 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 03/16/2012 10:40 AM, Peter Zijlstra wrote:

Reasonable idea, but we need something else than a blind
unmap and add to swap space, which requires people to run
with gigantic amounts of swap space they will likely never
use.

I suspect that Andrea's _PAGE_NUMA stuff could be implemented
using _PAGE_PROTNONE, and then we can simply call the NUMA
faulting/migration handler whenever we run into a _PAGE_PROTNONE
page in handle_mm_fault / handle_pte_fault.

This overloading of _PAGE_PROTNONE should work fine, because
do_page_fault will never call handle_mm_fault if the fault is
happening on a PROT_NONE VMA. Only if we have the correct VMA
permission will handle_mm_fault be called, at which point we
can fix the pte (and maybe migrate the page).

The same trick can be done at the pmd level for transparent
hugepages, allowing the entire THP to be migrated in one shot,
with just one fault.

Is there any reason why _PAGE_PROTNONE could not work instead
of _PAGE_NUMA or the swap cache thing?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-07-06 16:38     ` Rik van Riel
@ 2012-07-06 20:04       ` Lee Schermerhorn
  2012-07-06 20:27         ` Rik van Riel
  2012-07-09 11:48       ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Lee Schermerhorn @ 2012-07-06 20:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Andrea Arcangeli, Johannes Weiner, linux-kernel,
	linux-mm

On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:
> On 03/23/2012 07:50 AM, Mel Gorman wrote:
> > On Fri, Mar 16, 2012 at 03:40:31PM +0100, Peter Zijlstra wrote:
> >> From: Lee Schermerhorn<Lee.Schermerhorn@hp.com>
> >>
> >> This patch adds another mbind() flag to request "lazy migration".
> >> The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> >> pages are simply unmapped from the calling task's page table ['_MOVE]
> >> or from all referencing page tables [_MOVE_ALL].  Anon pages will first
> >> be added to the swap [or migration?] cache, if necessary.  The pages
> >> will be migrated in the fault path on "first touch", if the policy
> >> dictates at that time.
> >>
> >> <SNIP>
> >>
> >> @@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
> >>   }
> >>
> >>   /*
> >> + * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
> >> + * necessary.  Migration will occur, if policy dictates, when a task faults
> >> + * an unmapped page back into its page table--i.e., on "first touch" after
> >> + * unmapping.  Note that migrate-on-fault only migrates pages whose mapping
> >> + * [e.g., file system] supplies a migratepage op, so we skip pages that
> >> + * wouldn't migrate on fault.
> >> + *
> >> + * Pages are placed back on the lru whether or not they were successfully
> >> + * unmapped.  Like migrate_pages().
> >> + *
> >> + * Unline migrate_pages(), this function is only called in the context of
> >> + * a task that is unmapping it's own pages while holding its map semaphore
> >> + * for write.
> >> + */
> >> +int migrate_pages_unmap_only(struct list_head *pagelist)
> >
> > I'm not properly reviewing these patches at the moment but am taking a
> > quick look as I play some catch up on linux-mm.
> >
> > I think it's worth pointing out that this potentially will confuse
> > reclaim. Lets say a process is being migrated to another node and it
> > gets unmapped like this then some heuristics will change.
> >
> > 1. If the page was referenced prior to the unmapping then it should be
> >     activated if the page reached the end of the LRU due to the checks
> >     in page_check_references(). If the process has been unmapped for
> >     migrate-on-fault, the pages will instead be reclaimed.
> >
> > 2. The heuristic that applies pressure to slab pages if pages are mapped
> >     is changed. Prior to migrate-on-fault sc->nr_scanned is incremented
> >     for mapped pages to increase the number of slab pages scanned to
> >     avoid swapping. During migrate-on-fault, this pressure is relieved
> >
> > 3. zone_reclaim_mode in default mode will reclaim pages it would
> >     previously have skipped over. It potentially will call shrink_zone more
> >     for the local node than falling back to other nodes because it thinks
> >     most pages are unmapped. This could lead to some trashing.
> >
> > It may not even be a major problem but it's worth thinking about. If it
> > is a problem, it will be necessary to account for migrate-on-fault pages
> > similar to mapped pages during reclaim.
> 
> I can see other serious issues with this approach:
> 
> 4. Putting a lot of pages in the swap cache ends up allocating
>     swap space. This means this NUMA migration scheme will only
>     work on systems that have a substantial amount of memory
>     represented by swap space. This is highly unlikely on systems
>     with memory in the TB range. On smaller systems, it could drive
>     the system out of memory (to the OOM killer), by "filling up"
>     the overflow swap with migration pages instead.
> 5. In the long run, we want the ability to migrate transparent
>     huge pages as one unit.  The reason is simple, the performance
>     penalty for running on the wrong NUMA node (10-20%) is on the
>     same order of magnitude as the performance penalty for running
>     with 4kB pages instead of 2MB pages (5-15%).
> 
>     Breaking up large pages into small ones, and having khugepaged
>     reconstitute them on a random NUMA node later on, will negate
>     the performance benefits of both NUMA placement and THP.
> 
> In short, while this approach made sense when Lee first proposed
> it several years ago (with smaller memory systems, and before Linux
> had transparent huge pages), I do not believe it is an acceptable
> approach to NUMA migration any more.
> 
> We really want something like PROT_NONE or PTE_NUMA page table
> (and page directory) entries, so we can avoid filling up swap
> space with migration pages and have the possibility of migrating
> transparent huge pages in one piece at some point.
> 
> In other words, NAK to this patch
> 

When I originally posted the "migrate on fault" series, I posted a
separate series with a "migration cache" to avoid the use of swap space
for lazy migration: http://markmail.org/message/xgvvrnn2nk4nsn2e.

The migration cache was originally implemented by Marcello Tosatti for
the old memory hotplug project:
http://marc.info/?l=linux-mm&m=109779128211239&w=4.

The idea is that you don't need swap space for lazy migration, just an
"address_space" where you can park an anon VMA's pte's while they're
"unmapped" to cause migration faults.  Based on a suggestion from
Christoph Lameter, I had tried to hide the migration cache behind the
swap cache interface to minimize changes mainly in do_swap_page and
vmscan/reclaim.  It seemed to work, but the difference in reference
count semantics for the mig cache -- entry removed when last pte
migrated/mapped -- makes coordination with exit teardown, uh, tricky.

Regards,
Lee




^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-07-06 20:04       ` Lee Schermerhorn
@ 2012-07-06 20:27         ` Rik van Riel
  0 siblings, 0 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-06 20:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Dan Smith,
	Bharata B Rao, Andrea Arcangeli, Johannes Weiner, linux-kernel,
	linux-mm

On 07/06/2012 04:04 PM, Lee Schermerhorn wrote:
> On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:

>> 4. Putting a lot of pages in the swap cache ends up allocating
>>      swap space. This means this NUMA migration scheme will only
>>      work on systems that have a substantial amount of memory
>>      represented by swap space. This is highly unlikely on systems
>>      with memory in the TB range. On smaller systems, it could drive
>>      the system out of memory (to the OOM killer), by "filling up"
>>      the overflow swap with migration pages instead.
>> 5. In the long run, we want the ability to migrate transparent
>>      huge pages as one unit.  The reason is simple, the performance
>>      penalty for running on the wrong NUMA node (10-20%) is on the
>>      same order of magnitude as the performance penalty for running
>>      with 4kB pages instead of 2MB pages (5-15%).
>>
>>      Breaking up large pages into small ones, and having khugepaged
>>      reconstitute them on a random NUMA node later on, will negate
>>      the performance benefits of both NUMA placement and THP.

> When I originally posted the "migrate on fault" series, I posted a
> separate series with a "migration cache" to avoid the use of swap space
> for lazy migration: http://markmail.org/message/xgvvrnn2nk4nsn2e.
>
> The migration cache was originally implemented by Marcello Tosatti for
> the old memory hotplug project:
> http://marc.info/?l=linux-mm&m=109779128211239&w=4.
>
> The idea is that you don't need swap space for lazy migration, just an
> "address_space" where you can park an anon VMA's pte's while they're
> "unmapped" to cause migration faults.  Based on a suggestion from
> Christoph Lameter, I had tried to hide the migration cache behind the
> swap cache interface to minimize changes mainly in do_swap_page and
> vmscan/reclaim.  It seemed to work, but the difference in reference
> count semantics for the mig cache -- entry removed when last pte
> migrated/mapped -- makes coordination with exit teardown, uh, tricky.

That fixes one of the two problems, but using _PTE_NUMA
or _PAGE_PROTNONE looks like it would be both easier,
and solve both.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
@ 2012-07-07 18:26   ` Rik van Riel
  2012-07-09 12:05     ` Peter Zijlstra
  2012-07-09 12:23     ` Peter Zijlstra
  2012-07-08 18:35   ` Rik van Riel
  2012-07-12 22:02   ` Rik van Riel
  2 siblings, 2 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-07 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 03/16/2012 10:40 AM, Peter Zijlstra wrote:

> +/*
> + * Assumes symmetric NUMA -- that is, each node is of equal size.
> + */
> +static void set_max_mem_load(unsigned long load)
> +{
> +	unsigned long old_load;
> +
> +	spin_lock(&max_mem_load.lock);
> +	old_load = max_mem_load.load;
> +	if (!old_load)
> +		old_load = load;
> +	max_mem_load.load = (old_load + load) >> 1;
> +	spin_unlock(&max_mem_load.lock);
> +}

The above in your patch kind of conflicts with this bit
from patch 6/26:

+	/*
+	 * Migration allocates pages in the highest zone. If we cannot
+	 * do so then migration (at least from node to node) is not
+	 * possible.
+	 */
+	if (vma->vm_file &&
+		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
+								< policy_zone)
+			return 0;

Looking at how the memory load code is used, I wonder
if it would make sense to count "zone size - free - inactive
file" pages instead?

> +			/*
> +			 * Avoid migrating ne's when we'll know we'll push our
> +			 * node over the memory limit.
> +			 */
> +			if (max_mem_load &&
> +			    imb->mem_load + mem_moved + ne_mem > max_mem_load)
> +				goto next;

> +static void numa_balance(struct node_queue *this_nq)
> +{
> +	struct numa_imbalance imb;
> +	int busiest;
> +
> +	busiest = find_busiest_node(this_nq->node, &imb);
> +	if (busiest == -1)
> +		return;
> +
> +	if (imb.cpu <= 0 && imb.mem <= 0)
> +		return;
> +
> +	move_processes(nq_of(busiest), this_nq, &imb);
> +}

You asked how and why Andrea's algorithm converges.
After looking at both patch sets for a while, and asking
for clarification, I think I can see how his code converges.

It is not yet clear to me how and why your code converges.

I see some dual bin packing (CPU & memory) heuristics, but
it is not at all clear to me how they interact, especially
when workloads are going active and idle on a regular basis.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities
  2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
@ 2012-07-08 18:34   ` Rik van Riel
  2012-07-09 12:26     ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Rik van Riel @ 2012-07-08 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 03/16/2012 10:40 AM, Peter Zijlstra wrote:

> +static u64 process_cpu_runtime(struct numa_entity *ne)
> +{
> +	struct task_struct *p, *t;
> +	u64 runtime = 0;
> +
> +	rcu_read_lock();
> +	t = p = ne_owner(ne);
> +	if (p) do {
> +		runtime += t->se.sum_exec_runtime; // @#$#@ 32bit
> +	} while ((t = next_thread(t)) != p);
> +	rcu_read_unlock();
> +
> +	return runtime;
> +}

> +	/*
> +	 * Don't bother migrating memory if there's less than 1 second
> +	 * of runtime on the tasks.
> +	 */
> +	if (ne->nops->cpu_runtime(ne) < NSEC_PER_SEC)
> +		return false;

Do we really want to calculate the amount of CPU time used
by a process, and start migrating after just one second?

Or would it be ok to start migrating once a process has
been scanned once or twice by the NUMA code?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
  2012-07-07 18:26   ` Rik van Riel
@ 2012-07-08 18:35   ` Rik van Riel
  2012-07-09 12:25     ` Peter Zijlstra
  2012-07-12 22:02   ` Rik van Riel
  2 siblings, 1 reply; 153+ messages in thread
From: Rik van Riel @ 2012-07-08 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 03/16/2012 10:40 AM, Peter Zijlstra wrote:

+static bool can_move_ne(struct numa_entity *ne)
+{
+	/*
+	 * XXX: consider mems_allowed, stinking cpusets has mems_allowed
+	 * per task and it can actually differ over a whole process, la-la-la.
+	 */
+	return true;
+}

This looks like something that should be fixed before the
code is submitted for merging upstream.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
  2012-07-06 16:38     ` Rik van Riel
  2012-07-06 20:04       ` Lee Schermerhorn
@ 2012-07-09 11:48       ` Peter Zijlstra
  1 sibling, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 11:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Johannes Weiner,
	linux-kernel, linux-mm

On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:
> We really want something like PROT_NONE

Yeah, that makes sense, I'll have a look at PROT_NONE.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-07 18:26   ` Rik van Riel
@ 2012-07-09 12:05     ` Peter Zijlstra
  2012-07-09 12:23     ` Peter Zijlstra
  1 sibling, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 12:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Sat, 2012-07-07 at 14:26 -0400, Rik van Riel wrote:
> > +/*
> > + * Assumes symmetric NUMA -- that is, each node is of equal size.
> > + */
> > +static void set_max_mem_load(unsigned long load)
> > +{
> > +     unsigned long old_load;
> > +
> > +     spin_lock(&max_mem_load.lock);
> > +     old_load = max_mem_load.load;
> > +     if (!old_load)
> > +             old_load = load;
> > +     max_mem_load.load = (old_load + load) >> 1;
> > +     spin_unlock(&max_mem_load.lock);
> > +}
> 
> The above in your patch kind of conflicts with this bit
> from patch 6/26:

Yeah,.. its pretty broken. Its also effectively disabled, but yeah.


> Looking at how the memory load code is used, I wonder
> if it would make sense to count "zone size - free - inactive
> file" pages instead? 

Something like that, although I guess we'd want a sum over all zones in
a node for that.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-07 18:26   ` Rik van Riel
  2012-07-09 12:05     ` Peter Zijlstra
@ 2012-07-09 12:23     ` Peter Zijlstra
  2012-07-09 12:40       ` Peter Zijlstra
  1 sibling, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 12:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Sat, 2012-07-07 at 14:26 -0400, Rik van Riel wrote:
> 
> You asked how and why Andrea's algorithm converges.
> After looking at both patch sets for a while, and asking
> for clarification, I think I can see how his code converges.

Do share.. what does it balance on and where does it converge to?

> It is not yet clear to me how and why your code converges.

I don't think it does.. but since the scheduler interaction is fairly
weak it doesn't matter too much from that pov.

> I see some dual bin packing (CPU & memory) heuristics, but
> it is not at all clear to me how they interact, especially
> when workloads are going active and idle on a regular basis.
> 
Right, this is the bit I wanted discussion on most.. it is not at all
clear to me what one would want it to do. Given sufficient memory you'd
want it to slowly follow the cpu load. However on memory pressure you
can't do that.

Spreading memory evenly across nodes doesn't make much sense if the
compute time and capacity isn't matched either.

Also a pond will never settle if you keep throwing rocks in, you need
semi-stable operation conditions for anything to make sense. So the only
thing to consider for the wildly dynamic case is not going bananas along
with it.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-08 18:35   ` Rik van Riel
@ 2012-07-09 12:25     ` Peter Zijlstra
  2012-07-09 14:54       ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 12:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Sun, 2012-07-08 at 14:35 -0400, Rik van Riel wrote:
> 
> This looks like something that should be fixed before the
> code is submitted for merging upstream. 

static bool __task_can_migrate(struct task_struct *t, u64 *runtime, int node)
{
#ifdef CONFIG_CPUSETS
        if (!node_isset(node, t->mems_allowed))
                return false;
#endif

        if (!cpumask_intersects(cpumask_of_node(node), tsk_cpus_allowed(t)))
                return false;

        *runtime += t->se.sum_exec_runtime; // @#$#@ 32bit

        return true;
}

static bool process_can_migrate(struct numa_entity *ne, int node)
{
        struct task_struct *p, *t;
        bool allowed = false;
        u64 runtime = 0;

        rcu_read_lock();
        t = p = ne_owner(ne);
        if (p) do {
                allowed = __task_can_migrate(t, &runtime, node);
                if (!allowed)
                        break;
        } while ((t = next_thread(t)) != p);
        rcu_read_unlock();

        /*
         * Don't bother migrating memory if there's less than 1 second
         * of runtime on the tasks.
         */
        return allowed && runtime > NSEC_PER_SEC;
}

is what it looks like..

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities
  2012-07-08 18:34   ` Rik van Riel
@ 2012-07-09 12:26     ` Peter Zijlstra
  2012-07-09 14:53       ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 12:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Sun, 2012-07-08 at 14:34 -0400, Rik van Riel wrote:
> On 03/16/2012 10:40 AM, Peter Zijlstra wrote:
> 
> > +static u64 process_cpu_runtime(struct numa_entity *ne)
> > +{
> > +	struct task_struct *p, *t;
> > +	u64 runtime = 0;
> > +
> > +	rcu_read_lock();
> > +	t = p = ne_owner(ne);
> > +	if (p) do {
> > +		runtime += t->se.sum_exec_runtime; // @#$#@ 32bit
> > +	} while ((t = next_thread(t)) != p);
> > +	rcu_read_unlock();
> > +
> > +	return runtime;
> > +}
> 
> > +	/*
> > +	 * Don't bother migrating memory if there's less than 1 second
> > +	 * of runtime on the tasks.
> > +	 */
> > +	if (ne->nops->cpu_runtime(ne) < NSEC_PER_SEC)
> > +		return false;
> 
> Do we really want to calculate the amount of CPU time used
> by a process, and start migrating after just one second?
> 
> Or would it be ok to start migrating once a process has
> been scanned once or twice by the NUMA code?

You mean, the 2-3rd time we try and migrate this task, not the memory
scanning thing as per Andrea, right?

Yeah, that might work too.. 

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-09 12:23     ` Peter Zijlstra
@ 2012-07-09 12:40       ` Peter Zijlstra
  2012-07-09 14:50         ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 12:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Mon, 2012-07-09 at 14:23 +0200, Peter Zijlstra wrote:
> > It is not yet clear to me how and why your code converges.
> 
> I don't think it does.. but since the scheduler interaction is fairly
> weak it doesn't matter too much from that pov.
> 
That is,.. it slowly moves along with the cpu usage, only if there's a
lot of remote memory allocations (memory pressure) things get funny. 

It'll try and rotate all tasks around a bit trying, but there's no good
solution for a memory hole on one node and a cpu hole on another, you're
going to have to take the remote hits.

Again.. what do we want it to do?

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-09 12:40       ` Peter Zijlstra
@ 2012-07-09 14:50         ` Rik van Riel
  0 siblings, 0 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-09 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 07/09/2012 08:40 AM, Peter Zijlstra wrote:
> On Mon, 2012-07-09 at 14:23 +0200, Peter Zijlstra wrote:
>>> It is not yet clear to me how and why your code converges.
>>
>> I don't think it does.. but since the scheduler interaction is fairly
>> weak it doesn't matter too much from that pov.

Fair enough. It is just that you asked this same question
about Andrea's code, and I was asking myself that question
while reading your code (and failing to figure it out).

> That is,.. it slowly moves along with the cpu usage, only if there's a
> lot of remote memory allocations (memory pressure) things get funny.
>
> It'll try and rotate all tasks around a bit trying, but there's no good
> solution for a memory hole on one node and a cpu hole on another, you're
> going to have to take the remote hits.

Agreed, I suspect both your code and Andrea's code will
end up behaving fairly similarly in that situation.

> Again.. what do we want it to do?

That is a good question.

We can have various situations to deal with:

1) tasks fit nicely inside NUMA nodes
2) some tasks have more memory than what fits
    in a NUMA node
3) some tasks have more threads than what fits
    in a NUMA node
4) a combination of the above

I guess what we want the NUMA code to do to increase
the number of local memory accesses for each thread,
and do so in a relatively light weight way.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities
  2012-07-09 12:26     ` Peter Zijlstra
@ 2012-07-09 14:53       ` Rik van Riel
  2012-07-09 14:55         ` Peter Zijlstra
  0 siblings, 1 reply; 153+ messages in thread
From: Rik van Riel @ 2012-07-09 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 07/09/2012 08:26 AM, Peter Zijlstra wrote:
> On Sun, 2012-07-08 at 14:34 -0400, Rik van Riel wrote:

>> Do we really want to calculate the amount of CPU time used
>> by a process, and start migrating after just one second?
>>
>> Or would it be ok to start migrating once a process has
>> been scanned once or twice by the NUMA code?
>
> You mean, the 2-3rd time we try and migrate this task, not the memory
> scanning thing as per Andrea, right?

Indeed.  That way we can simply keep a flag somewhere,
instead of iterating over the threads in a process.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-09 12:25     ` Peter Zijlstra
@ 2012-07-09 14:54       ` Rik van Riel
  0 siblings, 0 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-09 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 07/09/2012 08:25 AM, Peter Zijlstra wrote:
> On Sun, 2012-07-08 at 14:35 -0400, Rik van Riel wrote:
>>
>> This looks like something that should be fixed before the
>> code is submitted for merging upstream.
>
> static bool __task_can_migrate(struct task_struct *t, u64 *runtime, int node)
> {
...
> is what it looks like..

Looks good to me.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities
  2012-07-09 14:53       ` Rik van Riel
@ 2012-07-09 14:55         ` Peter Zijlstra
  0 siblings, 0 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-07-09 14:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On Mon, 2012-07-09 at 10:53 -0400, Rik van Riel wrote:
> On 07/09/2012 08:26 AM, Peter Zijlstra wrote:
> > On Sun, 2012-07-08 at 14:34 -0400, Rik van Riel wrote:
> 
> >> Do we really want to calculate the amount of CPU time used
> >> by a process, and start migrating after just one second?
> >>
> >> Or would it be ok to start migrating once a process has
> >> been scanned once or twice by the NUMA code?
> >
> > You mean, the 2-3rd time we try and migrate this task, not the memory
> > scanning thing as per Andrea, right?
> 
> Indeed.  That way we can simply keep a flag somewhere,
> instead of iterating over the threads in a process.

Note that the code in -tip needs to iterate over all tasks in order to
test all cpus_allowed and mems_allowed masks. But we could keep a
process wide intersection of those masks around as well I guess,
updating them is a slow path anyway.

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
  2012-07-07 18:26   ` Rik van Riel
  2012-07-08 18:35   ` Rik van Riel
@ 2012-07-12 22:02   ` Rik van Riel
  2012-07-13 14:45     ` Don Morris
  2 siblings, 1 reply; 153+ messages in thread
From: Rik van Riel @ 2012-07-12 22:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Johannes Weiner, linux-kernel, linux-mm

On 03/16/2012 10:40 AM, Peter Zijlstra wrote:

At LSF/MM, there was a presentation comparing Peter's
NUMA code with Andrea's NUMA code. I believe this is
the main reason why Andrea's code performed better in
that particular test...

> +		if (sched_feat(NUMA_BALANCE_FILTER)) {
> +			/*
> +			 * Avoid moving ne's when we create a larger imbalance
> +			 * on the other end.
> +			 */
> +			if ((imb->type & NUMA_BALANCE_CPU) &&
> +			    imb->cpu - cpu_moved < ne_cpu / 2)
> +				goto next;
> +
> +			/*
> +			 * Avoid migrating ne's when we'll know we'll push our
> +			 * node over the memory limit.
> +			 */
> +			if (max_mem_load &&
> +			    imb->mem_load + mem_moved + ne_mem > max_mem_load)
> +				goto next;
> +		}

IIRC the test consisted of a 16GB NUMA system with two 8GB nodes.
It was running 3 KVM guests, two guests of 3GB memory each, and
one guest of 6GB each.

With autonuma, the 6GB guest ended up on one node, and the
3GB guests on the other.

With sched numa, each node had a 3GB guest, and part of the 6GB guest.

There is a fundamental difference in the balancing between autonuma
and sched numa.

In sched numa, a process is moved over to the current node only if
the current node has space for it.

Autonuma, on the other hand, operates more of a a "hostage exchange"
policy, where a thread on one node is exchanged with a thread on
another node, if it looks like that will reduce the overall number
of cross-node NUMA faults in the system.

I am not sure how to do a "hostage exchange" algorithm with
sched numa, but it would seem like it could be necessary in order
for some workloads to converge on a sane configuration.

After all, with only about 2GB free on each node, you will never
get to move either a 3GB guest, or parts of a 6GB guest...

Any ideas?

^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-12 22:02   ` Rik van Riel
@ 2012-07-13 14:45     ` Don Morris
  2012-07-14 16:20       ` Rik van Riel
  0 siblings, 1 reply; 153+ messages in thread
From: Don Morris @ 2012-07-13 14:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Johannes Weiner,
	linux-kernel, linux-mm

On 07/12/2012 03:02 PM, Rik van Riel wrote:
> On 03/16/2012 10:40 AM, Peter Zijlstra wrote:
> 
> At LSF/MM, there was a presentation comparing Peter's
> NUMA code with Andrea's NUMA code. I believe this is
> the main reason why Andrea's code performed better in
> that particular test...
> 
>> +        if (sched_feat(NUMA_BALANCE_FILTER)) {
>> +            /*
>> +             * Avoid moving ne's when we create a larger imbalance
>> +             * on the other end.
>> +             */
>> +            if ((imb->type & NUMA_BALANCE_CPU) &&
>> +                imb->cpu - cpu_moved < ne_cpu / 2)
>> +                goto next;
>> +
>> +            /*
>> +             * Avoid migrating ne's when we'll know we'll push our
>> +             * node over the memory limit.
>> +             */
>> +            if (max_mem_load &&
>> +                imb->mem_load + mem_moved + ne_mem > max_mem_load)
>> +                goto next;
>> +        }
> 
> IIRC the test consisted of a 16GB NUMA system with two 8GB nodes.
> It was running 3 KVM guests, two guests of 3GB memory each, and
> one guest of 6GB each.

How many cpus per guest (host threads) and how many physical/logical
cpus per node on the host? Any comparisons with a situation where
the memory would fit within nodes but the scheduling load would
be too high?

Don

> 
> With autonuma, the 6GB guest ended up on one node, and the
> 3GB guests on the other.
> 
> With sched numa, each node had a 3GB guest, and part of the 6GB guest.
> 
> There is a fundamental difference in the balancing between autonuma
> and sched numa.
> 
> In sched numa, a process is moved over to the current node only if
> the current node has space for it.
> 
> Autonuma, on the other hand, operates more of a a "hostage exchange"
> policy, where a thread on one node is exchanged with a thread on
> another node, if it looks like that will reduce the overall number
> of cross-node NUMA faults in the system.
> 
> I am not sure how to do a "hostage exchange" algorithm with
> sched numa, but it would seem like it could be necessary in order
> for some workloads to converge on a sane configuration.
> 
> After all, with only about 2GB free on each node, you will never
> get to move either a 3GB guest, or parts of a 6GB guest...
> 
> Any ideas?
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> .
> 



^ permalink raw reply	[flat|nested] 153+ messages in thread

* Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
  2012-07-13 14:45     ` Don Morris
@ 2012-07-14 16:20       ` Rik van Riel
  0 siblings, 0 replies; 153+ messages in thread
From: Rik van Riel @ 2012-07-14 16:20 UTC (permalink / raw)
  To: Don Morris
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Dan Smith, Bharata B Rao,
	Lee Schermerhorn, Andrea Arcangeli, Johannes Weiner,
	linux-kernel, linux-mm

On 07/13/2012 10:45 AM, Don Morris wrote:

>> IIRC the test consisted of a 16GB NUMA system with two 8GB nodes.
>> It was running 3 KVM guests, two guests of 3GB memory each, and
>> one guest of 6GB each.
>
> How many cpus per guest (host threads) and how many physical/logical
> cpus per node on the host? Any comparisons with a situation where
> the memory would fit within nodes but the scheduling load would
> be too high?

IIRC this particular test was constructed to have guests
A and B fit in one NUMA node, with guest C in the other
NUMA node.

With schednuma, guest A ended up on one NUMA node, guest
B on the other, and guest C was spread between both nodes.

Only migrating when there is plenty of free space available
means you can end up not doing the right thing when running
a few large workloads on the system.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 153+ messages in thread

end of thread, other threads:[~2012-07-14 16:22 UTC | newest]

Thread overview: 153+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-06 10:32   ` Johannes Weiner
2012-07-06 13:46     ` [tip:sched/core] mm: Fix vmstat names-values off-by-one tip-bot for Johannes Weiner
2012-07-06 14:48     ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Minchan Kim
2012-07-06 15:02       ` Peter Zijlstra
2012-07-06 14:54   ` Kyungmin Park
2012-07-06 15:00     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
2012-03-23 11:50   ` Mel Gorman
2012-07-06 16:38     ` Rik van Riel
2012-07-06 20:04       ` Lee Schermerhorn
2012-07-06 20:27         ` Rik van Riel
2012-07-09 11:48       ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
2012-07-06 18:40   ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
2012-04-03 17:32   ` Dan Smith
2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
2012-03-16 18:34   ` Christoph Lameter
2012-03-16 21:12     ` Peter Zijlstra
2012-03-19 13:53       ` Christoph Lameter
2012-03-19 14:05         ` Peter Zijlstra
2012-03-19 15:16           ` Christoph Lameter
2012-03-19 15:23             ` Peter Zijlstra
2012-03-19 15:31               ` Christoph Lameter
2012-03-19 17:09                 ` Peter Zijlstra
2012-03-19 17:28                   ` Peter Zijlstra
2012-03-19 19:06                   ` Christoph Lameter
2012-03-19 20:28                   ` Lee Schermerhorn
2012-03-19 21:21                     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
2012-06-15 18:16   ` Tony Luck
2012-06-20 19:12     ` [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n Luck, Tony
2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
2012-07-07 18:26   ` Rik van Riel
2012-07-09 12:05     ` Peter Zijlstra
2012-07-09 12:23     ` Peter Zijlstra
2012-07-09 12:40       ` Peter Zijlstra
2012-07-09 14:50         ` Rik van Riel
2012-07-08 18:35   ` Rik van Riel
2012-07-09 12:25     ` Peter Zijlstra
2012-07-09 14:54       ` Rik van Riel
2012-07-12 22:02   ` Rik van Riel
2012-07-13 14:45     ` Don Morris
2012-07-14 16:20       ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
2012-03-19 12:16   ` Srivatsa S. Bhat
2012-03-19 12:19     ` Peter Zijlstra
2012-03-19 12:27       ` Srivatsa S. Bhat
2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
2012-07-08 18:34   ` Rik van Riel
2012-07-09 12:26     ` Peter Zijlstra
2012-07-09 14:53       ` Rik van Riel
2012-07-09 14:55         ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
2012-03-19 18:47   ` Peter Zijlstra
2012-03-19 19:02     ` Andrea Arcangeli
2012-03-20 23:41   ` Dan Smith
2012-03-21  1:00     ` Andrea Arcangeli
2012-03-21  2:12     ` Andrea Arcangeli
2012-03-21  4:01       ` Dan Smith
2012-03-21 12:49         ` Andrea Arcangeli
2012-03-21 22:05           ` Dan Smith
2012-03-21 22:52             ` Andrea Arcangeli
2012-03-21 23:13               ` Dan Smith
2012-03-21 23:41                 ` Andrea Arcangeli
2012-03-22  0:17               ` Andrea Arcangeli
2012-03-22 13:58                 ` Dan Smith
2012-03-22 14:27                   ` Andrea Arcangeli
2012-03-22 18:49                     ` Andrea Arcangeli
2012-03-22 18:56                       ` Dan Smith
2012-03-22 19:11                         ` Andrea Arcangeli
2012-03-23 14:15                         ` Andrew Theurer
2012-03-23 16:01                           ` Andrea Arcangeli
2012-03-25 13:30                         ` Andrea Arcangeli
2012-03-21  7:12       ` Ingo Molnar
2012-03-21 12:08         ` Andrea Arcangeli
2012-03-21  7:53     ` Ingo Molnar
2012-03-21 12:17       ` Andrea Arcangeli
2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
2012-03-19 11:12   ` Peter Zijlstra
2012-03-19 11:30     ` Peter Zijlstra
2012-03-19 11:39     ` Peter Zijlstra
2012-03-19 11:42     ` Avi Kivity
2012-03-19 11:59       ` Peter Zijlstra
2012-03-19 12:07         ` Avi Kivity
2012-03-19 12:09       ` Peter Zijlstra
2012-03-19 12:16         ` Avi Kivity
2012-03-19 20:03           ` Peter Zijlstra
2012-03-20 10:18             ` Avi Kivity
2012-03-20 10:48               ` Peter Zijlstra
2012-03-20 10:52                 ` Avi Kivity
2012-03-20 11:07                   ` Peter Zijlstra
2012-03-20 11:48                     ` Avi Kivity
2012-03-19 12:20       ` Peter Zijlstra
2012-03-19 12:24         ` Avi Kivity
2012-03-19 15:44           ` Avi Kivity
2012-03-19 13:40       ` Andrea Arcangeli
2012-03-19 20:06         ` Peter Zijlstra
2012-03-19 13:04     ` Andrea Arcangeli
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 13:57         ` Andrea Arcangeli
2012-03-19 14:06           ` Avi Kivity
2012-03-19 14:30             ` Andrea Arcangeli
2012-03-19 18:42               ` Peter Zijlstra
2012-03-20 22:18                 ` Rik van Riel
2012-03-21 16:50                   ` Andrea Arcangeli
2012-04-02 16:34                   ` Pekka Enberg
2012-04-02 16:55                     ` Rik van Riel
2012-04-02 16:54                       ` Pekka Enberg
2012-04-02 17:12                         ` Pekka Enberg
2012-04-02 17:23                           ` Pekka Enberg
2012-03-19 14:07           ` Peter Zijlstra
2012-03-19 14:34             ` Andrea Arcangeli
2012-03-19 18:41               ` Peter Zijlstra
2012-03-19 19:13           ` Peter Zijlstra
2012-03-19 14:07         ` Andrea Arcangeli
2012-03-19 19:05           ` Peter Zijlstra
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 14:16         ` Andrea Arcangeli
2012-03-19 13:29       ` Peter Zijlstra
2012-03-19 14:19         ` Andrea Arcangeli
2012-03-19 13:39       ` Peter Zijlstra
2012-03-19 14:20         ` Andrea Arcangeli
2012-03-19 20:17           ` Christoph Lameter
2012-03-19 20:28             ` Ingo Molnar
2012-03-19 20:43               ` Christoph Lameter
2012-03-19 21:34                 ` Ingo Molnar
2012-03-20  0:05               ` Linus Torvalds
2012-03-20  7:31                 ` Ingo Molnar
2012-03-21 22:53 ` Nish Aravamudan
2012-03-22  9:45   ` Peter Zijlstra
2012-03-22 10:34     ` Ingo Molnar
2012-03-24  1:41     ` Nish Aravamudan
2012-03-26 11:42       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).