linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration
@ 2013-05-03  0:00 Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 01/31] rbtree: add postorder iteration functions Cody P Schafer
                   ` (30 more replies)
  0 siblings, 31 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

There is still some cleanup to do, so I'm just posting these early and often. I
would appreciate some review to make sure that the basic approaches are sound,
and especially suggestions regarding TODO #1 below. If everything goes well,
I'd like to seem them merged circa 3.12.

--

These patches allow the NUMA memory layout (meaning which node each physical
page belongs to, the mapping from physical pages to NUMA nodes) to be changed
at runtime in place (without hotplugging).

TODO:

 1) Currently, I use pageflag setters without "owning" pages which could cause
   loss of pageflag updates when combined with non-atomic pageflag users in
   mm/*. Some options for solving this: (a) make all pageflags access atomic,
   (b) use pageblock flags, (c) use bits in a new bitmap, or (d) attempt to work
   around races in a similar way to memory-failure.

 2) Fix inaccurate accounting in some cases. Example:
	## "5 MB" should be ~"42 MB"
	# dnuma-test s 0 2 128
	# numactl --hardware
	available: 4 nodes (0-3)
	node 0 cpus: 0
	node 0 size: 43 MB
	node 0 free: 11 MB
	node 1 cpus:
	node 1 size: 5 MB
	node 1 free: 34 MB
	node 2 cpus:
	node 2 size: 42 MB
	node 2 free: 28 MB
	node 3 cpus:
	node 3 size: 0 MB
	node 3 free: 0 MB

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= How are you managing to do this? =

Reconfiguration of page->node mappings is done at the page allocator
level by both pulling free pages out of the free lists (when a new memory
layout is committed) & redirecting pages on free to their new node.

Because we can't change page_node(A) while A is allocated [1], a rbtree
holding the mapping from pfn ranges to node ids ('struct memlayout')
is introduced to track the pfn->node mapping for
yet-to-be-transplanted pages. A lookup in this rbtree occurs on any
page allocator path that decides which zone to free a page to.

To avoid horrible performance due to rbtree lookups all the time, the
rbtree is only consulted when the page is marked with a new pageflag
(LookupNode).

[1]: quite a few users of page_node() depend on it not changing, some
accumulate per-node stats by using this. We'd also have to change it via atomic
operations to avoid disturbing the pageflags which share the same unsigned
long.

= Code & testing =

A debugfs interface allows the NUMA memory layout to be changed.  Basically,
you don't need to have weird systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386 & qemu-x86_64.

A script which stripes the memory between nodes or pushes all memory to a
(potentially new) node is avaliable here:

	https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

The patches are also available via:

	https://github.com/jmesmon/linux.git dnuma/v36

	2325de5^..e0f8f35

= Current Limitations =

For the reconfiguration to be effective (and not make the allocator make
poorer choices), updating the cpu->node mappings is also needed. This patchset
does _not_ handle this. Also missing is a way to update topology (node
distances), which is slightly less fatal.

These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags
(can't be pushed out to the section). This generally means that 32-bit
platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS).

This code does the reconfiguration without hotplugging memory at all (1
errant page doesn't keep us from fixing the rest of them). But it still
depends on MEMORY_HOTPLUG for functions that online nodes & adjust
zone/pgdat size.

Things that need doing or would be nice to have but aren't bugs:

 - While the interface is meant to be driven via a hypervisor/firmware, that
   portion is not yet included.
 - notifier for kernel users of memory that need/want their allocations on a
   particular node (NODE_DATA(), for instance).
 - notifier for userspace.
 - a way to allocate things from the appropriate node prior to the page
   allocator being fully updated (could just be "allocate it wrong now &
   reallocate later").
 - Make memlayout faster (potentially via per-node allocation, different data
   structure, and/or more/smarter caching).
 - Propagation of updated layout knowledge into kmem_caches (SL*B).

--

Since v2: http://comments.gmane.org/gmane.linux.kernel.mm/98371

 - update sysfs node->memory region mappings when a reconfiguration occurs
 - fixup locking when updating node_spanned_pages.
 - rework memlayout api for use by sysfs refresh code
 - remove holes for memlayouts to make iteration over them less of a chore.

Since v1: http://comments.gmane.org/gmane.linux.kernel.mm/95541

 - Update watermarks.
 - Update zone percpu pageset ->batch & ->high only when needed.
 - Don't lazily adjust {pgdat,zone}->{present_pages,managed_pages}, set them all at once.
 - Don't attempt to use more than nr_node_ids nodes.

--


Cody P Schafer (31):
  rbtree: add postorder iteration functions.
  rbtree: add rbtree_postorder_for_each_entry_safe() helper.
  mm/memory_hotplug: factor out zone+pgdat growth.
  memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h
  mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones &
    pgdats
  mm: add nid_zone() helper
  mm: Add Dynamic NUMA Kconfig.
  page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.
  page_alloc: in move_freepages(), skip pages instead of VM_BUG on node
    differences.
  page_alloc: when dynamic numa is enabled, don't check that all pages
    in a block belong to the same zone
  page-flags dnuma: reserve a pageflag for determining if a page needs a
    node lookup.
  memory_hotplug: factor out locks in mem_online_cpu()
  mm: add memlayout & dnuma to track pfn->nid & transplant pages between
    nodes
  mm: memlayout+dnuma: add debugfs interface
  drivers/base/memory.c: alphabetize headers.
  drivers/base/node,memory: rename function to match interface
  drivers/base/node: rename unregister_mem_blk_under_nodes() to be more
    acurate
  drivers/base/node: add unregister_mem_block_under_nodes()
  mm: memory,memlayout: add refresh_memory_blocks() for Dynamic NUMA.
  page_alloc: use dnuma to transplant newly freed pages in
    __free_pages_ok()
  page_alloc: use dnuma to transplant newly freed pages in
    free_hot_cold_page()
  page_alloc: transplant pages that are being flushed from the per-cpu
    lists
  x86: memlayout: add a arch specific inital memlayout setter.
  init/main: call memlayout_global_init() in start_kernel().
  dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug
  x86/mm/numa: when dnuma is enabled, use memlayout to handle memory
    hotplug's physaddr_to_nid.
  mm/memory_hotplug: VM_BUG if nid is too large.
  mm/page_alloc: in page_outside_zone_boundaries(), avoid premature
    decisions.
  mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more
    useful
  mm/page_alloc: use manage_pages instead of present pages when
    calculating default_zonelist_order()
  mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids
    above the minimum by a percentage.

 Documentation/kernel-parameters.txt |   6 +
 arch/x86/mm/numa.c                  |  32 ++-
 drivers/base/memory.c               |  55 ++++-
 drivers/base/node.c                 |  70 ++++--
 include/linux/dnuma.h               |  97 ++++++++
 include/linux/memlayout.h           | 134 +++++++++++
 include/linux/memory.h              |   5 +
 include/linux/memory_hotplug.h      |   4 +
 include/linux/mm.h                  |   7 +-
 include/linux/node.h                |  20 +-
 include/linux/page-flags.h          |  19 ++
 include/linux/rbtree.h              |  12 +
 init/main.c                         |   2 +
 lib/rbtree.c                        |  40 ++++
 mm/Kconfig                          |  54 +++++
 mm/Makefile                         |   2 +
 mm/dnuma.c                          | 432 ++++++++++++++++++++++++++++++++++++
 mm/internal.h                       |  13 +-
 mm/memlayout-debugfs.c              | 339 ++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h              |  39 ++++
 mm/memlayout.c                      | 354 +++++++++++++++++++++++++++++
 mm/memory_hotplug.c                 |  54 +++--
 mm/page_alloc.c                     | 154 +++++++++++--
 23 files changed, 1866 insertions(+), 78 deletions(-)
 create mode 100644 include/linux/dnuma.h
 create mode 100644 include/linux/memlayout.h
 create mode 100644 mm/dnuma.c
 create mode 100644 mm/memlayout-debugfs.c
 create mode 100644 mm/memlayout-debugfs.h
 create mode 100644 mm/memlayout.c

-- 
1.8.2.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 01/31] rbtree: add postorder iteration functions.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 02/31] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/rbtree.h |  4 ++++
 lib/rbtree.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);
 
+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
 			    struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new,
 	*new = *victim;
 }
 EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+	for (;;) {
+		if (node->rb_left)
+			node = node->rb_left;
+		else if (node->rb_right)
+			node = node->rb_right;
+		else
+			return (struct rb_node *)node;
+	}
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+	const struct rb_node *parent;
+	if (!node)
+		return NULL;
+	parent = rb_parent(node);
+
+	/* If we're sitting on node, we've already seen our children */
+	if (parent && node == parent->rb_left && parent->rb_right) {
+		/* If we are the parent's left node, go to the parent's right
+		 * node then all the way down to the left */
+		return rb_left_deepest_node(parent->rb_right);
+	} else
+		/* Otherwise we are the parent's right node, and the parent
+		 * should be next */
+		return (struct rb_node *)parent;
+}
+EXPORT_SYMBOL(rb_next_postorder);
+
+struct rb_node *rb_first_postorder(const struct rb_root *root)
+{
+	if (!root->rb_node)
+		return NULL;
+
+	return rb_left_deepest_node(root->rb_node);
+}
+EXPORT_SYMBOL(rb_first_postorder);
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 02/31] rbtree: add rbtree_postorder_for_each_entry_safe() helper.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 01/31] rbtree: add postorder iteration functions Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 03/31] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/rbtree.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..1b239ca 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,12 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
 	*rb_link = node;
 }
 
+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+	for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+	      n = rb_entry(rb_next_postorder(&pos->field), \
+		      typeof(*pos), field); \
+	     &pos->field; \
+	     pos = n, \
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
 #endif	/* _LINUX_RBTREE_H */
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 03/31] mm/memory_hotplug: factor out zone+pgdat growth.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 01/31] rbtree: add postorder iteration functions Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 02/31] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 04/31] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Create a new function grow_pgdat_and_zone() which handles locking +
growth of a zone & the pgdat which it is associated with.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            | 17 +++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3e622c6..501e9f0 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -78,6 +78,9 @@ static inline void zone_seqlock_init(struct zone *zone)
 {
 	seqlock_init(&zone->span_seqlock);
 }
+extern void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+				unsigned long end_pfn);
+
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a221fac..fafeaae 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -390,13 +390,22 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 					pgdat->node_start_pfn;
 }
 
+void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	unsigned long flags;
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	grow_zone_span(zone, start_pfn, end_pfn);
+	grow_pgdat_span(zone->zone_pgdat, start_pfn, end_pfn);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nr_pages = PAGES_PER_SECTION;
 	int nid = pgdat->node_id;
 	int zone_type;
-	unsigned long flags;
 	int ret;
 
 	zone_type = zone - pgdat->node_zones;
@@ -404,11 +413,7 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 	if (ret)
 		return ret;
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
-	grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
-			phys_start_pfn + nr_pages);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	grow_pgdat_and_zone(zone, phys_start_pfn, phys_start_pfn + nr_pages);
 	memmap_init_zone(nr_pages, nid, zone_type,
 			 phys_start_pfn, MEMMAP_HOTPLUG);
 	return 0;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 04/31] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (2 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 03/31] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 05/31] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Export ensure_zone_is_initialized() so that it can be used to initialize
new zones within the dynamic numa code.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/internal.h       | 8 ++++++++
 mm/memory_hotplug.c | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8562de0..b11e574 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,6 +105,14 @@ extern void prep_compound_page(struct page *page, unsigned long order);
 extern bool is_free_buddy_page(struct page *page);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * in mm/memory_hotplug.c
+ */
+extern int ensure_zone_is_initialized(struct zone *zone,
+			unsigned long start_pfn, unsigned long num_pages);
+#endif
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fafeaae..f4cb01a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -284,7 +284,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
 
 /* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
  * alloc_bootmem_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
+int __ref ensure_zone_is_initialized(struct zone *zone,
 			unsigned long start_pfn, unsigned long num_pages)
 {
 	if (!zone_is_initialized(zone))
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 05/31] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (3 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 04/31] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 06/31] mm: add nid_zone() helper Cody P Schafer
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Use the *_is_empty() helpers to be more clear about what we're actually
checking for.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f4cb01a..a65235f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -242,7 +242,7 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writelock(zone);
 
 	old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	if (!zone->spanned_pages || start_pfn < zone->zone_start_pfn)
+	if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
 		zone->zone_start_pfn = start_pfn;
 
 	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
@@ -383,7 +383,7 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 	unsigned long old_pgdat_end_pfn =
 		pgdat->node_start_pfn + pgdat->node_spanned_pages;
 
-	if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
+	if (pgdat_is_empty(pgdat) || start_pfn < pgdat->node_start_pfn)
 		pgdat->node_start_pfn = start_pfn;
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 06/31] mm: add nid_zone() helper
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (4 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 05/31] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 07/31] mm: Add Dynamic NUMA Kconfig Cody P Schafer
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Add nid_zone(), which returns the zone corresponding to a given nid & zonenum.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/mm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a7f19e..2004713 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -711,9 +711,14 @@ static inline void page_nid_reset_last(struct page *page)
 }
 #endif
 
+static inline struct zone *nid_zone(int nid, enum zone_type zonenum)
+{
+	return &NODE_DATA(nid)->node_zones[zonenum];
+}
+
 static inline struct zone *page_zone(const struct page *page)
 {
-	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+	return nid_zone(page_to_nid(page), page_zonenum(page));
 }
 
 #ifdef SECTION_IN_PAGE_FLAGS
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 07/31] mm: Add Dynamic NUMA Kconfig.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (5 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 06/31] mm: add nid_zone() helper Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 08/31] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

We need to add some functionality for use by Dynamic NUMA to pieces of
mm/, so provide the Kconfig prior to adding actual Dynamic NUMA
functionality. For details on Dynamic NUMA, see te later patch (which
adds baseline functionality):

 "mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes"
---
 mm/Kconfig | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index e742d06..bfbe300 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -169,6 +169,30 @@ config MOVABLE_NODE
 config HAVE_BOOTMEM_INFO_NODE
 	def_bool n
 
+config DYNAMIC_NUMA
+	bool "Dynamic Numa: Allow NUMA layout to change after boot time"
+	depends on NUMA
+	depends on !DISCONTIGMEM
+	depends on MEMORY_HOTPLUG # locking + mem_online_node().
+	help
+	 Dynamic Numa (DNUMA) allows the movement of pages between NUMA nodes at
+	 run time.
+
+	 Typically, this is used on systems running under a hypervisor which
+	 may move the running VM based on the hypervisors needs. On such a
+	 system, this config option enables Linux to update it's knowledge of
+	 the memory layout.
+
+	 If the feature is not used but is enabled, there is a very small
+	 amount of overhead (an additional pageflag check) is added to all page frees.
+
+	 This is only useful if you enable some of the additional options to
+	 allow modifications of the numa memory layout (either through hypervisor events
+	 or a userspace interface).
+
+	 Choose Y if you have are running linux under a hypervisor that uses
+	 this feature, otherwise choose N if unsure.
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 08/31] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (6 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 07/31] mm: Add Dynamic NUMA Kconfig Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 09/31] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Add return_pages_to_zone(), which uses return_page_to_zone().
It is a minimized version of __free_pages_ok() which handles adding
pages which have been removed from another zone into a new zone.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/internal.h   |  5 ++++-
 mm/page_alloc.c | 17 +++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index b11e574..a70c77b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -104,6 +104,10 @@ extern void prep_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone);
+#endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
@@ -114,7 +118,6 @@ extern int ensure_zone_is_initialized(struct zone *zone,
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-
 /*
  * in mm/compaction.c
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 98cbdf6..739b405 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -443,6 +443,12 @@ static inline void set_page_order(struct page *page, int order)
 	__SetPageBuddy(page);
 }
 
+static inline void set_free_page_order(struct page *page, int order)
+{
+	set_page_private(page, order);
+	VM_BUG_ON(!PageBuddy(page));
+}
+
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
@@ -739,6 +745,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	local_irq_restore(flags);
 }
 
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone)
+{
+	unsigned long flags;
+	local_irq_save(flags);
+	free_one_page(zone, page, order, get_freepage_migratetype(page));
+	local_irq_restore(flags);
+}
+#endif
+
 /*
  * Read access to zone->managed_pages is safe because it's unsigned long,
  * but we still need to serialize writers. Currently all callers of
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 09/31] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (7 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 08/31] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 10/31] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

With dynamic numa, pages are going to be gradully moved from one node to
another, causing the page ranges that move_freepages() examines to
contain pages that actually belong to another node.

When dynamic numa is enabled, we skip these pages instead of VM_BUGing
out on them.

This additionally moves the VM_BUG_ON() (which detects a change in node)
so that it follows the pfn_valid_within() check.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 739b405..657f773 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -958,6 +958,7 @@ int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	int pages_moved = 0;
+	int zone_nid = zone_to_nid(zone);
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -971,14 +972,24 @@ int move_freepages(struct zone *zone,
 #endif
 
 	for (page = start_page; page <= end_page;) {
-		/* Make sure we are not inadvertently changing nodes */
-		VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
-
 		if (!pfn_valid_within(page_to_pfn(page))) {
 			page++;
 			continue;
 		}
 
+		if (page_to_nid(page) != zone_nid) {
+#ifndef CONFIG_DYNAMIC_NUMA
+			/*
+			 * In the normal case (without Dynamic NUMA), all pages
+			 * in a pageblock should belong to the same zone (and
+			 * as a result all have the same nid).
+			 */
+			VM_BUG_ON(page_to_nid(page) != zone_nid);
+#endif
+			page++;
+			continue;
+		}
+
 		if (!PageBuddy(page)) {
 			page++;
 			continue;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 10/31] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (8 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 09/31] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 11/31] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

When dynamic numa is enabled, the last or first page in a pageblock may
have been transplanted to a new zone (or may not yet be transplanted to
a new zone).

Disable a BUG_ON() which checks that the start_page and end_page are in
the same zone, if they are not in the proper zone they will simply be
skipped.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 657f773..9de55a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -960,13 +960,16 @@ int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 	int zone_nid = zone_to_nid(zone);
 
-#ifndef CONFIG_HOLES_IN_ZONE
+#if !defined(CONFIG_HOLES_IN_ZONE) && !defined(CONFIG_DYNAMIC_NUMA)
 	/*
-	 * page_zone is not safe to call in this context when
-	 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
-	 * anyway as we check zone boundaries in move_freepages_block().
-	 * Remove at a later date when no bug reports exist related to
-	 * grouping pages by mobility
+	 * With CONFIG_HOLES_IN_ZONE set, this check is unsafe as start_page or
+	 * end_page may not be "valid".
+	 * With CONFIG_DYNAMIC_NUMA set, this condition is a valid occurence &
+	 * not a bug.
+	 *
+	 * This bug check is probably redundant anyway as we check zone
+	 * boundaries in move_freepages_block().  Remove at a later date when
+	 * no bug reports exist related to grouping pages by mobility
 	 */
 	BUG_ON(page_zone(start_page) != page_zone(end_page));
 #endif
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 11/31] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (9 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 10/31] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 12/31] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Add a pageflag called "lookup_node"/ PG_lookup_node / Page*LookupNode().

Used by dynamic numa to indicate when a page has a new node assignment
waiting for it.

FIXME: This also exempts PG_lookup_node from PAGE_FLAGS_CHECK_AT_PREP
due to the asynchronous usage of PG_lookup_node, which needs to be
avoided.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/page-flags.h | 19 +++++++++++++++++++
 mm/page_alloc.c            |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..09dd94e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,9 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	PG_lookup_node,		/* extra lookup required to find real node */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -275,6 +278,17 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+/* Setting is unconditional, simply leads to an extra lookup.
+ * Clearing must be conditional so we don't miss any memlayout changes.
+ */
+#ifdef CONFIG_DYNAMIC_NUMA
+PAGEFLAG(LookupNode, lookup_node)
+TESTCLEARFLAG(LookupNode, lookup_node)
+#else
+PAGEFLAG_FALSE(LookupNode)
+TESTCLEARFLAG_FALSE(LookupNode)
+#endif
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
@@ -509,7 +523,12 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
  * Pages being prepped should not have any flags set.  It they are set,
  * there has been a kernel bug or struct page corruption.
  */
+#ifndef CONFIG_DYNAMIC_NUMA
 #define PAGE_FLAGS_CHECK_AT_PREP	((1 << NR_PAGEFLAGS) - 1)
+#else
+#define PAGE_FLAGS_CHECK_AT_PREP	(((1 << NR_PAGEFLAGS) - 1) \
+						& ~(1 << PG_lookup_node))
+#endif
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1 << PG_private | 1 << PG_private_2)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9de55a2..ea4fda8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6238,6 +6238,9 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	{1UL << PG_lookup_node,		"lookup_node"   },
+#endif
 };
 
 static void dump_page_flags(unsigned long flags)
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 12/31] memory_hotplug: factor out locks in mem_online_cpu()
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (10 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 11/31] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 13/31] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

In dynamic numa, when onlining nodes, lock_memory_hotplug() is already
held when mem_online_node()'s functionality is needed.

Factor out the locking and create a new function __mem_online_node() to
allow reuse.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            | 29 ++++++++++++++++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 501e9f0..1ad85c6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -248,6 +248,7 @@ static inline int is_mem_section_removable(unsigned long pfn,
 static inline void try_offline_node(int nid) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+extern int __mem_online_node(int nid);
 extern int mem_online_node(int nid);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a65235f..8e6658d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1066,26 +1066,29 @@ static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
 	return;
 }
 
-
-/*
- * called by cpu_up() to online a node without onlined memory.
- */
-int mem_online_node(int nid)
+int __mem_online_node(int nid)
 {
-	pg_data_t	*pgdat;
-	int	ret;
+	pg_data_t *pgdat;
+	int ret;
 
-	lock_memory_hotplug();
 	pgdat = hotadd_new_pgdat(nid, 0);
-	if (!pgdat) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!pgdat)
+		return -ENOMEM;
+
 	node_set_online(nid);
 	ret = register_one_node(nid);
 	BUG_ON(ret);
+	return ret;
+}
 
-out:
+/*
+ * called by cpu_up() to online a node without onlined memory.
+ */
+int mem_online_node(int nid)
+{
+	int ret;
+	lock_memory_hotplug();
+	ret = __mem_online_node(nid);
 	unlock_memory_hotplug();
 	return ret;
 }
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 13/31] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (11 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 12/31] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 14/31] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

On some systems, the hypervisor can (and will) relocate physical
addresses as seen in a VM between real NUMA nodes. For example, IBM
Power systems which are using particular revisions of PHYP (IBM's
proprietary hypervisor)

This change set introduces the infrastructure for tracking & dynamically
changing "memory layouts" (or "memlayouts"): the mapping between page
ranges & the actual backing NUMA node.

A memlayout is stored as an rbtree which maps pfns (really, ranges of
pfns) to a node. This mapping (combined with the LookupNode pageflag) is
used to "transplant" (move pages between nodes) pages when they are
freed back to the page allocator.

Additionally, when a new memlayout is commited the currently free pages
that are now in the 'wrong' zone's freelist are immidiately transplanted.

Hooks that tie it into the page alloctor to actually perform the
"transplant on free" are in later patches.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |  97 +++++++++++
 include/linux/memlayout.h | 127 ++++++++++++++
 mm/Makefile               |   1 +
 mm/dnuma.c                | 430 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout.c            | 322 ++++++++++++++++++++++++++++++++++
 5 files changed, 977 insertions(+)
 create mode 100644 include/linux/dnuma.h
 create mode 100644 include/linux/memlayout.h
 create mode 100644 mm/dnuma.c
 create mode 100644 mm/memlayout.c

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
new file mode 100644
index 0000000..029a984
--- /dev/null
+++ b/include/linux/dnuma.h
@@ -0,0 +1,97 @@
+#ifndef LINUX_DNUMA_H_
+#define LINUX_DNUMA_H_
+
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/memlayout.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+
+#ifdef CONFIG_DYNAMIC_NUMA
+/* Must be called _before_ setting a new_ml to the pfn_to_node_map */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml);
+
+/* Must be called _after_ setting a new_ml to the pfn_to_node_map */
+void dnuma_move_free_pages(struct memlayout *new_ml);
+void dnuma_mark_page_range(struct memlayout *new_ml);
+
+static inline bool dnuma_is_active(void)
+{
+	struct memlayout *ml;
+	bool ret;
+
+	rcu_read_lock();
+	ml = rcu_dereference(pfn_to_node_map);
+	ret = ml && (ml->type != ML_INITIAL);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool dnuma_has_memlayout(void)
+{
+	return !!rcu_access_pointer(pfn_to_node_map);
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+	int new_nid, old_nid;
+
+	if (!TestClearPageLookupNode(page))
+		return NUMA_NO_NODE;
+
+	/* FIXME: this does rcu_lock, deref, unlock */
+	if (WARN_ON(!dnuma_is_active()))
+		return NUMA_NO_NODE;
+
+	/* FIXME: and so does this (rcu lock, deref, and unlock) */
+	new_nid = memlayout_pfn_to_nid(page_to_pfn(page));
+	old_nid = page_to_nid(page);
+
+	if (new_nid == NUMA_NO_NODE) {
+		pr_alert("dnuma: pfn %05lx has moved from node %d to a non-memlayout range.\n",
+				page_to_pfn(page), old_nid);
+		return NUMA_NO_NODE;
+	}
+
+	if (new_nid == old_nid)
+		return NUMA_NO_NODE;
+
+	if (WARN_ON(!zone_is_initialized(
+			nid_zone(new_nid, page_zonenum(page)))))
+		return NUMA_NO_NODE;
+
+	return new_nid;
+}
+
+void dnuma_post_free_to_new_zone(struct page *page, int order);
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+				  struct zone *dest_zone,
+				  int dest_nid);
+
+#else /* !defined CONFIG_DYNAMIC_NUMA */
+
+static inline bool dnuma_is_active(void)
+{
+	return false;
+}
+
+static inline void dnuma_prior_free_to_new_zone(struct page *page, int order,
+						struct zone *dest_zone,
+						int dest_nid)
+{
+	BUG();
+}
+
+static inline void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+	BUG();
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+	return NUMA_NO_NODE;
+}
+#endif /* !defined CONFIG_DYNAMIC_NUMA */
+
+#endif /* defined LINUX_DNUMA_H_ */
diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
new file mode 100644
index 0000000..adab685
--- /dev/null
+++ b/include/linux/memlayout.h
@@ -0,0 +1,127 @@
+#ifndef LINUX_MEMLAYOUT_H_
+#define LINUX_MEMLAYOUT_H_
+
+#include <linux/memblock.h> /* __init_memblock */
+#include <linux/mm.h>       /* NODE_DATA, page_zonenum */
+#include <linux/mmzone.h>   /* pfn_to_nid */
+#include <linux/rbtree.h>
+#include <linux/types.h>    /* size_t */
+
+#ifdef CONFIG_DYNAMIC_NUMA
+# ifdef NODE_NOT_IN_PAGE_FLAGS
+#  error "CONFIG_DYNAMIC_NUMA requires the NODE is in page flags. Try freeing up some flags by decreasing the maximum number of NUMA nodes, or switch to sparsmem-vmemmap"
+# endif
+
+enum memlayout_type {
+	ML_INITIAL,
+	ML_USER_DEBUG,
+	ML_NUM_TYPES
+};
+
+struct rangemap_entry {
+	struct rb_node node;
+	unsigned long pfn_start;
+	/* @pfn_end: inclusive, not stored as a count to make the lookup
+	 *           faster
+	 */
+	unsigned long pfn_end;
+	int nid;
+};
+
+#define RME_FMT "{%05lx-%05lx}:%d"
+#define RME_EXP(rme) rme->pfn_start, rme->pfn_end, rme->nid
+
+struct memlayout {
+	/*
+	 * - contains rangemap_entrys.
+	 * - assumes no 'ranges' overlap.
+	 */
+	struct rb_root root;
+	enum memlayout_type type;
+
+	/*
+	 * When a memlayout is commited, 'cache' is accessed (the field is read
+	 * from & written to) by multiple tasks without additional locking
+	 * (other than the rcu locking for accessing the memlayout).
+	 *
+	 * Do not assume that it will not change. Use ACCESS_ONCE() to avoid
+	 * potential races.
+	 */
+	struct rangemap_entry *cache;
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+	unsigned seq;
+	struct dentry *d;
+#endif
+};
+
+extern __rcu struct memlayout *pfn_to_node_map;
+
+/* FIXME: overflow potential in completion check */
+#define ml_for_each_pfn_in_range(rme, pfn)	\
+	for (pfn = rme->pfn_start;		\
+	     pfn <= rme->pfn_end || pfn < rme->pfn_start; \
+	     pfn++)
+
+static inline bool rme_bounds_pfn(struct rangemap_entry *rme, unsigned long pfn)
+{
+	return rme->pfn_start <= pfn && pfn <= rme->pfn_end;
+}
+
+static inline struct rangemap_entry *rme_next(struct rangemap_entry *rme)
+{
+	struct rb_node *node = rb_next(&rme->node);
+	if (!node)
+		return NULL;
+	return rb_entry(node, typeof(*rme), node);
+}
+
+static inline struct rangemap_entry *rme_first(struct memlayout *ml)
+{
+	struct rb_node *node = rb_first(&ml->root);
+	if (!node)
+		return NULL;
+	return rb_entry(node, struct rangemap_entry, node);
+}
+
+#define ml_for_each_range(ml, rme) \
+	for (rme = rme_first(ml);	\
+	     &rme->node;		\
+	     rme = rme_next(rme))
+
+struct memlayout *memlayout_create(enum memlayout_type);
+void memlayout_destroy(struct memlayout *ml);
+
+int memlayout_new_range(struct memlayout *ml,
+		unsigned long pfn_start, unsigned long pfn_end, int nid);
+int memlayout_pfn_to_nid(unsigned long pfn);
+struct rangemap_entry *memlayout_pfn_to_rme_higher(struct memlayout *ml, unsigned long pfn);
+
+/*
+ * Put ranges added by memlayout_new_range() into use by
+ * memlayout_pfn_get_nid() and retire old memlayout.
+ *
+ * No modifications to a memlayout should be made after it is commited.
+ */
+void memlayout_commit(struct memlayout *ml);
+
+/*
+ * Sets up an inital memlayout in early boot.
+ * A weak default which uses memblock is provided.
+ */
+void memlayout_global_init(void);
+
+#else /* !defined(CONFIG_DYNAMIC_NUMA) */
+
+/* memlayout_new_range() & memlayout_commit() are purposefully omitted */
+
+static inline void memlayout_global_init(void)
+{}
+
+static inline int memlayout_pfn_to_nid(unsigned long pfn)
+{
+	return NUMA_NO_NODE;
+}
+#endif /* !defined(CONFIG_DYNAMIC_NUMA) */
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index 72c5acb..c538e1e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
+obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
new file mode 100644
index 0000000..2b6e13e
--- /dev/null
+++ b/mm/dnuma.c
@@ -0,0 +1,430 @@
+#define pr_fmt(fmt) "dnuma: " fmt
+
+#include <linux/atomic.h>
+#include <linux/bootmem.h>
+#include <linux/dnuma.h>
+#include <linux/memory.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "internal.h"
+
+/* - must be called under lock_memory_hotplug() */
+/* TODO: avoid iterating over all PFNs. */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml)
+{
+	struct rangemap_entry *rme;
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn;
+		int nid = rme->nid;
+
+		if (!node_online(nid)) {
+			/* Consult hotadd_new_pgdat() */
+			__mem_online_node(nid);
+
+			/* XXX: we aren't really onlining memory, but some code
+			 * uses memory online notifications to tell if new
+			 * nodes have been created.
+			 *
+			 * Also note that the notifiers expect to be able to do
+			 * allocations, ie we must allow for might_sleep() */
+			{
+				int ret;
+
+				/* memory_notify() expects:
+				 *	- to add pages at the same time
+				 *	- to add zones at the same time
+				 * We can do neither of these things.
+				 *
+				 * XXX: - slab uses .status_change_nid
+				 *      - slub uses .status_change_nid_normal
+				 * FIXME: for slub, we may not be placing any
+				 *        "normal" memory in it, can we check
+				 *        for this?
+				 */
+				struct memory_notify arg = {
+					.status_change_nid = nid,
+					.status_change_nid_normal = nid,
+				};
+
+				ret = memory_notify(MEM_GOING_ONLINE, &arg);
+				ret = notifier_to_errno(ret);
+				if (WARN_ON(ret)) {
+					/* XXX: other stuff will bug out if we
+					 * keep going, need to actually cancel
+					 * memlayout changes
+					 */
+					memory_notify(MEM_CANCEL_ONLINE, &arg);
+				}
+			}
+		}
+
+		/* Determine the zones required */
+		for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+			struct zone *zone;
+			if (!pfn_valid(pfn))
+				continue;
+
+			zone = nid_zone(nid, page_zonenum(pfn_to_page(pfn)));
+			/* XXX: we (dnuma paths) can handle this (there will
+			 * just be quite a few WARNS in the logs), but if we
+			 * are indicating error above, should we bail out here
+			 * as well? */
+			WARN_ON(ensure_zone_is_initialized(zone, 0, 0));
+		}
+	}
+}
+
+/*
+ * Cannot be folded into dnuma_move_unallocated_pages() because unmarked pages
+ * could be freed back into the zone as dnuma_move_unallocated_pages() was in
+ * the process of iterating over it.
+ */
+void dnuma_mark_page_range(struct memlayout *new_ml)
+{
+	struct rangemap_entry *rme;
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn;
+		for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+			if (!pfn_valid(pfn))
+				continue;
+			/* FIXME: should we be skipping compound / buddied
+			 *        pages? */
+			/* FIXME: if PageReserved(), can we just poke the nid
+			 *        directly? Should we? */
+			SetPageLookupNode(pfn_to_page(pfn));
+		}
+	}
+}
+
+#if 0
+static void node_states_set_node(int node, struct memory_notify *arg)
+{
+	if (arg->status_change_nid_normal >= 0)
+		node_set_state(node, N_NORMAL_MEMORY);
+
+	if (arg->status_change_nid_high >= 0)
+		node_set_state(node, N_HIGH_MEMORY);
+
+	node_set_state(node, N_MEMORY);
+}
+#endif
+
+void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+}
+
+static void dnuma_prior_return_to_new_zone(struct page *page, int order,
+					   struct zone *dest_zone,
+					   int dest_nid)
+{
+	int i;
+	unsigned long pfn = page_to_pfn(page);
+
+	grow_pgdat_and_zone(dest_zone, pfn, pfn + (1UL << order));
+
+	for (i = 0; i < 1UL << order; i++)
+		set_page_node(&page[i], dest_nid);
+}
+
+static void clear_lookup_node(struct page *page, int order)
+{
+	int i;
+	for (i = 0; i < 1UL << order; i++)
+		ClearPageLookupNode(&page[i]);
+}
+
+/* Does not assume it is called with any locking (but can be called with zone
+ * locks held, if needed) */
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+				  struct zone *dest_zone,
+				  int dest_nid)
+{
+	dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+}
+
+/* must be called with zone->lock held and memlayout's update_lock held */
+static void remove_free_pages_from_zone(struct zone *zone, struct page *page,
+					int order)
+{
+	/* zone free stats */
+	zone->free_area[order].nr_free--;
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	list_del(&page->lru);
+	__ClearPageBuddy(page);
+
+	/* Allowed because we hold the memlayout update_lock. */
+	clear_lookup_node(page, order);
+}
+
+/*
+ * __ref is to allow (__meminit) zone_pcp_update(), which we will have because
+ * DYNAMIC_NUMA depends on MEMORY_HOTPLUG (and MEMORY_HOTPLUG makes __meminit a
+ * nop).
+ */
+static void __ref add_free_page_to_node(int dest_nid, struct page *page,
+					int order)
+{
+	bool need_zonelists_rebuild = false;
+	struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+	VM_BUG_ON(!zone_is_initialized(dest_zone));
+
+	if (zone_is_empty(dest_zone))
+		need_zonelists_rebuild = true;
+
+	/* Add page to new zone */
+	dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+	return_pages_to_zone(page, order, dest_zone);
+	dnuma_post_free_to_new_zone(order);
+
+	/* XXX: fixme, there are other states that need fixing up */
+	if (!node_state(dest_nid, N_MEMORY))
+		node_set_state(dest_nid, N_MEMORY);
+
+	if (need_zonelists_rebuild) {
+		/* XXX: also does stop_machine() */
+		zone_pcp_reset(dest_zone);
+		/* XXX: why is this locking actually needed? */
+		mutex_lock(&zonelists_mutex);
+#if 0
+		/* assumes that zone is unused */
+		setup_zone_pageset(dest_zone);
+		build_all_zonelists(NULL, NULL);
+#else
+		build_all_zonelists(NULL, dest_zone);
+#endif
+		mutex_unlock(&zonelists_mutex);
+	}
+}
+
+static struct rangemap_entry *add_split_pages_to_zones(
+		struct rangemap_entry *first_rme,
+		struct page *page, int order)
+{
+	int i;
+	struct rangemap_entry *rme = first_rme;
+	/*
+	 * We avoid doing any hard work to try to split the pages optimally
+	 * here because the page allocator splits them into 0-order pages
+	 * anyway.
+	 *
+	 * XXX: All of the checks for NULL rmes and the nid conditional are to
+	 * work around memlayouts potentially not covering all valid memory.
+	 */
+	for (i = 0; i < (1 << order); i++) {
+		unsigned long pfn = page_to_pfn(page);
+		int nid;
+		while (rme && pfn > rme->pfn_end)
+			rme = rme_next(rme);
+
+		if (rme && pfn >= rme->pfn_start)
+			nid = rme->nid;
+		else
+			nid = page_to_nid(page + i);
+
+		add_free_page_to_node(nid, page + i, 0);
+	}
+
+	return rme;
+}
+
+#define _page_count_idx(managed, nid, zone_num) \
+	(managed + 2 * (zone_num + MAX_NR_ZONES * (nid)))
+#define page_count_idx(nid, zone_num) _page_count_idx(0, nid, zone_num)
+
+/*
+ * Because we hold lock_memory_hotplug(), we assume that no else will be
+ * changing present_pages and managed_pages.
+ *
+ * Note that while we iterate over all pages and could collect the info to
+ * adjust all the various spanned_pages and start_pfn fields here, because
+ * movement of pages from their old node to the new one occurs gradually doing
+ * so would cause some allocated pages that still belong to a node/zone being
+ * missed durring a iteration over the span.
+ */
+static void update_page_counts(struct memlayout *new_ml)
+{
+	/* Perform a combined iteration of pgdat+zones and memlayout.
+	 * - memlayouts are ordered, their lookup from pfn is "slow", and they
+	 *   are contiguous.
+	 * - pgdat+zones are unordered, have O(1) lookups, and don't have holes
+	 *   over valid pfns.
+	 */
+	int nid;
+	struct rangemap_entry *rme;
+	unsigned long pfn = 0;
+	unsigned long *counts = kzalloc(2 * nr_node_ids * MAX_NR_ZONES *
+						sizeof(*counts),
+					GFP_KERNEL);
+	if (WARN_ON(!counts))
+		return;
+	rme = rme_first(new_ml);
+
+	/* TODO: use knowledge about what size blocks of pages can be !valid to
+	 * greatly speed this computation. */
+	for (pfn = 0; pfn < max_pfn; pfn++) {
+		int nid;
+		struct page *page;
+		size_t idx;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (pfn > rme->pfn_end)
+			rme = rme_next(rme);
+
+		if (WARN_ON(!rme))
+			continue;
+
+		nid = rme->nid;
+
+		idx = page_count_idx(nid, page_zonenum(page));
+		/* XXX: what happens if pages become
+		   reserved/unreserved during this
+		   process? */
+		if (!PageReserved(page))
+			counts[idx]++; /* managed_pages */
+		counts[idx + 1]++;     /* present_pages */
+	}
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long flags;
+		unsigned long nid_present = 0;
+		int zone_num;
+		pg_data_t *node = NODE_DATA(nid);
+		if (!node)
+			continue;
+		for (zone_num = 0; zone_num < node->nr_zones;
+				zone_num++) {
+			struct zone *zone = &node->node_zones[zone_num];
+			size_t idx = page_count_idx(nid, zone_num);
+			pr_debug("nid %d zone %d mp=%lu pp=%lu -> mp=%lu pp=%lu\n",
+					nid, zone_num,
+					zone->managed_pages,
+					zone->present_pages,
+					counts[idx], counts[idx+1]);
+			zone->managed_pages = counts[idx];
+			zone->present_pages = counts[idx + 1];
+			nid_present += zone->present_pages;
+
+			/*
+			 * recalculate pcp ->batch & ->high using
+			 * zone->managed_pages
+			 */
+			zone_pcp_update(zone);
+		}
+
+		pr_debug(" node %d zone * present_pages %lu to %lu\n",
+				node->node_id, node->node_present_pages,
+				nid_present);
+		pgdat_resize_lock(node, &flags);
+		node->node_present_pages = nid_present;
+		pgdat_resize_unlock(node, &flags);
+	}
+
+	kfree(counts);
+}
+
+void __ref dnuma_move_free_pages(struct memlayout *new_ml)
+{
+	struct rangemap_entry *rme;
+
+	update_page_counts(new_ml);
+	init_per_zone_wmark_min();
+
+	/* FIXME: how does this removal of pages from a zone interact with
+	 * migrate types? ISOLATION? */
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn = rme->pfn_start;
+		int range_nid;
+		struct page *page;
+new_rme:
+		range_nid = rme->nid;
+
+		for (; pfn <= rme->pfn_end; pfn++) {
+			struct zone *zone;
+			int page_nid, order;
+			unsigned long flags, last_pfn, first_pfn;
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+#if 0
+			/* XXX: can we ensure this is safe? Pages marked
+			 * reserved could be freed into the page allocator if
+			 * they mark memory areas that were allocated via
+			 * earlier allocators. */
+			if (PageReserved(page)) {
+				set_page_node(page, range_nid);
+				continue;
+			}
+#endif
+
+			/* Currently allocated, will be fixed up when freed. */
+			if (!PageBuddy(page))
+				continue;
+
+			page_nid = page_to_nid(page);
+			if (page_nid == range_nid)
+				continue;
+
+			zone = page_zone(page);
+			spin_lock_irqsave(&zone->lock, flags);
+
+			/* Someone allocated it since we last checked. It will
+			 * be fixed up when it is freed */
+			if (!PageBuddy(page))
+				goto skip_unlock;
+
+			/* It has already been transplanted "somewhere",
+			 * somewhere should be the proper zone. */
+			if (page_zone(page) != zone) {
+				VM_BUG_ON(zone != nid_zone(range_nid,
+							page_zonenum(page)));
+				goto skip_unlock;
+			}
+
+			order = page_order(page);
+			first_pfn = pfn & ~((1 << order) - 1);
+			last_pfn  = pfn |  ((1 << order) - 1);
+			if (WARN(pfn != first_pfn,
+					"pfn %05lx is not first_pfn %05lx\n",
+					pfn, first_pfn)) {
+				pfn = last_pfn;
+				goto skip_unlock;
+			}
+
+			if (last_pfn > rme->pfn_end) {
+				/*
+				 * this higher order page doesn't fit into the
+				 * current range even though it starts there.
+				 */
+				pr_warn("order-%02d page (pfn %05lx-%05lx) extends beyond end of rme "RME_FMT"\n",
+						order, first_pfn, last_pfn,
+						RME_EXP(rme));
+
+				remove_free_pages_from_zone(zone, page, order);
+				spin_unlock_irqrestore(&zone->lock, flags);
+
+				rme = add_split_pages_to_zones(rme, page,
+						order);
+				pfn = last_pfn + 1;
+				goto new_rme;
+			}
+
+			remove_free_pages_from_zone(zone, page, order);
+			spin_unlock_irqrestore(&zone->lock, flags);
+
+			add_free_page_to_node(range_nid, page, order);
+			pfn = last_pfn;
+			continue;
+skip_unlock:
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+	}
+}
diff --git a/mm/memlayout.c b/mm/memlayout.c
new file mode 100644
index 0000000..132dbff
--- /dev/null
+++ b/mm/memlayout.c
@@ -0,0 +1,322 @@
+/*
+ * memlayout - provides a mapping of PFN ranges to nodes with the requirements
+ * that looking up a node from a PFN is fast, and changes to the mapping will
+ * occour relatively infrequently.
+ *
+ */
+#define pr_fmt(fmt) "memlayout: " fmt
+
+#include <linux/dnuma.h>
+#include <linux/export.h>
+#include <linux/memblock.h>
+#include <linux/printk.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+
+/* protected by memlayout_lock */
+__rcu struct memlayout *pfn_to_node_map;
+DEFINE_MUTEX(memlayout_lock);
+
+static void free_rme_tree(struct rb_root *root)
+{
+	struct rangemap_entry *pos, *n;
+	rbtree_postorder_for_each_entry_safe(pos, n, root, node) {
+		kfree(pos);
+	}
+}
+
+static void ml_destroy_mem(struct memlayout *ml)
+{
+	if (!ml)
+		return;
+	free_rme_tree(&ml->root);
+	kfree(ml);
+}
+
+static int find_insertion_point(struct memlayout *ml, unsigned long pfn_start,
+		unsigned long pfn_end, int nid, struct rb_node ***o_new,
+		struct rb_node **o_parent)
+{
+	struct rb_node **new = &ml->root.rb_node, *parent = NULL;
+	struct rangemap_entry *rme;
+	pr_debug("adding range: {%lX-%lX}:%d\n", pfn_start, pfn_end, nid);
+	while (*new) {
+		rme = rb_entry(*new, typeof(*rme), node);
+
+		parent = *new;
+		if (pfn_end < rme->pfn_start && pfn_start < rme->pfn_end)
+			new = &((*new)->rb_left);
+		else if (pfn_start > rme->pfn_end && pfn_end > rme->pfn_end)
+			new = &((*new)->rb_right);
+		else {
+			/* an embedded region, need to use an interval or
+			 * sequence tree. */
+			pr_warn("tried to embed {%lX,%lX}:%d inside {%lX-%lX}:%d\n",
+				 pfn_start, pfn_end, nid,
+				 rme->pfn_start, rme->pfn_end, rme->nid);
+			return 1;
+		}
+	}
+
+	*o_new = new;
+	*o_parent = parent;
+	return 0;
+}
+
+int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,
+		unsigned long pfn_end, int nid)
+{
+	struct rb_node **new, *parent;
+	struct rangemap_entry *rme;
+
+	if (WARN_ON(nid < 0))
+		return -EINVAL;
+	if (WARN_ON(nid >= nr_node_ids))
+		return -EINVAL;
+
+	if (find_insertion_point(ml, pfn_start, pfn_end, nid, &new, &parent))
+		return 1;
+
+	rme = kmalloc(sizeof(*rme), GFP_KERNEL);
+	if (!rme)
+		return -ENOMEM;
+
+	rme->pfn_start = pfn_start;
+	rme->pfn_end = pfn_end;
+	rme->nid = nid;
+
+	rb_link_node(&rme->node, parent, new);
+	rb_insert_color(&rme->node, &ml->root);
+	return 0;
+}
+
+/*
+ * If @ml is the pfn_to_node_map, it must have been dereferenced and
+ * rcu_read_lock() must be held when called and while the returned
+ * rangemap_entry is used. Alternately, the update_lock can be held and
+ * rcu_dereference_protected() used for operations that need to block.
+ *
+ * Returns the RME that contains the given PFN,
+ * OR if there is no RME that contains the given PFN, it returns the next one (containing a higher pfn),
+ * OR if there is no next RME, it returns NULL.
+ *
+ * This is designed for use in iterating over a subset of the rme's, starting
+ * at @pfn passed to this function.
+ */
+struct rangemap_entry *memlayout_pfn_to_rme_higher(struct memlayout *ml, unsigned long pfn)
+{
+	struct rb_node *node, *prev_node = NULL;
+	struct rangemap_entry *rme;
+	if (!ml || (ml->type == ML_INITIAL))
+		return NULL;
+
+	rme = ACCESS_ONCE(ml->cache);
+	smp_read_barrier_depends();
+
+	if (rme && rme_bounds_pfn(rme, pfn))
+		return rme;
+
+	node = ml->root.rb_node;
+	while (node) {
+		struct rangemap_entry *rme = rb_entry(node, typeof(*rme), node);
+		bool greater_than_start = rme->pfn_start <= pfn;
+		bool less_than_end = pfn <= rme->pfn_end;
+
+		if (greater_than_start && !less_than_end) {
+			prev_node = node;
+			node = node->rb_right;
+		} else if (less_than_end && !greater_than_start) {
+			prev_node = node;
+			node = node->rb_left;
+		} else {
+			/* only can occur if a range ends before it starts */
+			if (WARN_ON(!greater_than_start && !less_than_end))
+				return NULL;
+
+			/* greater_than_start && less_than_end. */
+			ACCESS_ONCE(ml->cache) = rme;
+			return rme;
+		}
+	}
+	if (prev_node) {
+		struct rangemap_entry *rme = rb_entry(prev_node, typeof(*rme), node);
+		if (pfn < rme->pfn_start)
+			return rme;
+		else
+			return rme_next(rme);
+	}
+	return NULL;
+}
+
+int memlayout_pfn_to_nid(unsigned long pfn)
+{
+	struct rangemap_entry *rme;
+	int nid;
+	rcu_read_lock();
+	rme = memlayout_pfn_to_rme_higher(rcu_dereference(pfn_to_node_map), pfn);
+
+	/*
+	 * by using a modified version of memlayout_pfn_to_rme_higher(), the
+	 * rme_bounds_pfn() check could be skipped. Unfortunately, it would also
+	 * result in a large amount of copy-pasted code (or a nasty inline func)
+	 */
+	if (!rme || !rme_bounds_pfn(rme, pfn))
+		nid = NUMA_NO_NODE;
+	else
+		nid = rme->nid;
+	rcu_read_unlock();
+	return nid;
+}
+
+/*
+ * given a new memory layout that is not yet in use by the system,
+ * modify it so that
+ * - all pfns are included
+ *   - handled by extending the first range to the beginning of memory and
+ *     extending all other ranges until they abut the following range (or in the
+ *     case of the last node, the end of memory)
+ *
+ * 1) we could have it exclude pfn ranges that are !pfn_valid() if we hook
+ * into the code which changes pfn validity.
+ *  - Would this be a significant performance/code quality boost?
+ *
+ * 2) even further, we could munge the memlayout to handle cases where the
+ * number of physical numa nodes exceeds nr_node_ids, and generally clean up
+ * the node numbering (avoid nid gaps, renumber nids to reduce the need for
+ * moving pages between nodes). These changes would require cooperation between
+ * this and code which manages the mapping of CPUs to nodes.
+ */
+static void memlayout_expand(struct memlayout *ml)
+{
+	struct rb_node *r = rb_first(&ml->root);
+	struct rangemap_entry *rme = rb_entry(r, typeof(*rme), node), *prev;
+	if (rme->pfn_start != 0) {
+		pr_info("expanding rme "RME_FMT" to start of memory\n",
+				RME_EXP(rme));
+		rme->pfn_start = 0;
+	}
+
+	for (r = rb_next(r); r; r = rb_next(r)) {
+		prev = rme;
+		rme = rb_entry(r, typeof(*rme), node);
+
+		if (prev->pfn_end + 1 < rme->pfn_start) {
+			pr_info("expanding rme "RME_FMT" to end of "RME_FMT"\n",
+					RME_EXP(prev), RME_EXP(rme));
+			prev->pfn_end = rme->pfn_start - 1;
+		}
+	}
+
+	if (rme->pfn_end < max_pfn) {
+		pr_info("expanding rme "RME_FMT" to max_pfn=%05lx\n",
+				RME_EXP(rme), max_pfn);
+		rme->pfn_end = max_pfn;
+	}
+}
+
+void memlayout_destroy(struct memlayout *ml)
+{
+	ml_destroy_mem(ml);
+}
+
+struct memlayout *memlayout_create(enum memlayout_type type)
+{
+	struct memlayout *ml;
+
+	if (WARN_ON(type < 0 || type >= ML_NUM_TYPES))
+		return NULL;
+
+	ml = kmalloc(sizeof(*ml), GFP_KERNEL);
+	if (!ml)
+		return NULL;
+
+	ml->root = RB_ROOT;
+	ml->type = type;
+	ml->cache = NULL;
+
+	return ml;
+}
+
+void memlayout_commit(struct memlayout *ml)
+{
+	struct memlayout *old_ml;
+	memlayout_expand(ml);
+
+	if (ml->type == ML_INITIAL) {
+		if (WARN(dnuma_has_memlayout(),
+				"memlayout marked first is not first, ignoring.\n")) {
+			memlayout_destroy(ml);
+			ml_backlog_feed(ml);
+			return;
+		}
+
+		mutex_lock(&memlayout_lock);
+		rcu_assign_pointer(pfn_to_node_map, ml);
+		mutex_unlock(&memlayout_lock);
+		return;
+	}
+
+	lock_memory_hotplug();
+	dnuma_online_required_nodes_and_zones(ml);
+	/* this unlock is only allowed if nothing will offline nodes (or zones)
+	 * */
+	unlock_memory_hotplug();
+
+	mutex_lock(&memlayout_lock);
+	old_ml = rcu_dereference_protected(pfn_to_node_map,
+			mutex_is_locked(&memlayout_lock));
+
+	rcu_assign_pointer(pfn_to_node_map, ml);
+
+	synchronize_rcu();
+	memlayout_destroy(old_ml);
+
+	/* Must be called only after the new value for pfn_to_node_map has
+	 * propogated to all tasks, otherwise some pages may lookup the old
+	 * pfn_to_node_map on free & not transplant themselves to their new-new
+	 * node. */
+	dnuma_mark_page_range(ml);
+
+	/* Do this after the free path is set up so that pages are free'd into
+	 * their "new" zones so that after this completes, no free pages in the
+	 * wrong zone remain (except for those in the pcp lists) */
+	dnuma_move_free_pages(ml);
+
+	/* All new _non pcp_ page allocations now match the memlayout*/
+	drain_all_pages();
+	/* All new page allocations now match the memlayout */
+
+	mutex_unlock(&memlayout_lock);
+}
+
+/*
+ * The default memlayout global initializer, using memblock to determine
+ * affinities
+ *
+ * reqires: slab_is_available() && memblock is not (yet) freed.
+ * sleeps: definitely: memlayout_commit() -> synchronize_rcu()
+ *	   potentially: kmalloc()
+ */
+__weak __init
+void memlayout_global_init(void)
+{
+	int i, nid, errs = 0;
+	unsigned long start, end;
+	struct memlayout *ml = memlayout_create(ML_INITIAL);
+	if (WARN_ON(!ml))
+		return;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+		int r = memlayout_new_range(ml, start, end - 1, nid);
+		if (r) {
+			pr_err("failed to add range [%05lx, %05lx] in node %d to mapping\n",
+					start, end, nid);
+			errs++;
+		} else
+			pr_devel("added range [%05lx, %05lx] in node %d\n",
+					start, end, nid);
+	}
+
+	memlayout_commit(ml);
+}
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 14/31] mm: memlayout+dnuma: add debugfs interface
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (12 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 13/31] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 15/31] drivers/base/memory.c: alphabetize headers Cody P Schafer
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Add a debugfs interface to dnuma/memlayout. It keeps track of a
variable backlog of memory layouts, provides some statistics on dnuma
moved pages & cache performance, and allows the setting of a new global
memlayout.

TODO: split out statistics, backlog, & write interfaces from eachother.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |   2 +-
 include/linux/memlayout.h |   7 +
 mm/Kconfig                |  30 ++++
 mm/Makefile               |   1 +
 mm/dnuma.c                |   4 +-
 mm/memlayout-debugfs.c    | 339 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h    |  39 ++++++
 mm/memlayout.c            |  23 +++-
 8 files changed, 438 insertions(+), 7 deletions(-)
 create mode 100644 mm/memlayout-debugfs.c
 create mode 100644 mm/memlayout-debugfs.h

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
index 029a984..7a33131 100644
--- a/include/linux/dnuma.h
+++ b/include/linux/dnuma.h
@@ -64,7 +64,7 @@ static inline int dnuma_page_needs_move(struct page *page)
 	return new_nid;
 }
 
-void dnuma_post_free_to_new_zone(struct page *page, int order);
+void dnuma_post_free_to_new_zone(int order);
 void dnuma_prior_free_to_new_zone(struct page *page, int order,
 				  struct zone *dest_zone,
 				  int dest_nid);
diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
index adab685..c09ecdb 100644
--- a/include/linux/memlayout.h
+++ b/include/linux/memlayout.h
@@ -56,6 +56,7 @@ struct memlayout {
 };
 
 extern __rcu struct memlayout *pfn_to_node_map;
+extern struct mutex memlayout_lock; /* update-side lock */
 
 /* FIXME: overflow potential in completion check */
 #define ml_for_each_pfn_in_range(rme, pfn)	\
@@ -90,7 +91,13 @@ static inline struct rangemap_entry *rme_first(struct memlayout *ml)
 	     rme = rme_next(rme))
 
 struct memlayout *memlayout_create(enum memlayout_type);
+
+/*
+ * In most cases, these should only be used by the memlayout debugfs code (or
+ * internally within memlayout)
+ */
 void memlayout_destroy(struct memlayout *ml);
+void memlayout_destroy_mem(struct memlayout *ml);
 
 int memlayout_new_range(struct memlayout *ml,
 		unsigned long pfn_start, unsigned long pfn_end, int nid);
diff --git a/mm/Kconfig b/mm/Kconfig
index bfbe300..3ddf6e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,6 +193,36 @@ config DYNAMIC_NUMA
 	 Choose Y if you have are running linux under a hypervisor that uses
 	 this feature, otherwise choose N if unsure.
 
+config DNUMA_DEBUGFS
+	bool "Export DNUMA & memlayout internals via debugfs"
+	depends on DYNAMIC_NUMA
+	help
+	 Export some dynamic numa info via debugfs in <debugfs>/memlayout.
+
+	 Enables the tracking and export of statistics and the exporting of the
+	 current memory layout.
+
+	 If you are not debugging Dynamic NUMA or memlayout, choose N.
+
+config DNUMA_BACKLOG
+	int "Number of old memlayouts to keep (0 = None, -1 = unlimited)"
+	depends on DNUMA_DEBUGFS
+	help
+	 Allows access to old memory layouts & statistics in debugfs.
+
+	 Each memlayout will consume some memory, and when set to -1
+	 (unlimited), this can result in unbounded kernel memory use.
+
+config DNUMA_DEBUGFS_WRITE
+	bool "Change NUMA layout via debugfs"
+	depends on DNUMA_DEBUGFS
+	help
+	 Enable the use of <debugfs>/memlayout/{start,end,node,commit}
+
+	 Write a PFN to 'start' & 'end', then a node id to 'node'.
+	 Repeat this until you are satisfied with your memory layout, then
+	 write '1' to 'commit'.
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
diff --git a/mm/Makefile b/mm/Makefile
index c538e1e..7ce2b26 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
+obj-$(CONFIG_DNUMA_DEBUGFS) += memlayout-debugfs.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
index 2b6e13e..7ee77a0 100644
--- a/mm/dnuma.c
+++ b/mm/dnuma.c
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 
 #include "internal.h"
+#include "memlayout-debugfs.h"
 
 /* - must be called under lock_memory_hotplug() */
 /* TODO: avoid iterating over all PFNs. */
@@ -113,8 +114,9 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 }
 #endif
 
-void dnuma_post_free_to_new_zone(struct page *page, int order)
+void dnuma_post_free_to_new_zone(int order)
 {
+	ml_stat_count_moved_pages(order);
 }
 
 static void dnuma_prior_return_to_new_zone(struct page *page, int order,
diff --git a/mm/memlayout-debugfs.c b/mm/memlayout-debugfs.c
new file mode 100644
index 0000000..a4fc2cb
--- /dev/null
+++ b/mm/memlayout-debugfs.c
@@ -0,0 +1,339 @@
+#include <linux/debugfs.h>
+
+#include <linux/slab.h> /* kmalloc */
+#include <linux/module.h> /* THIS_MODULE, needed for DEFINE_SIMPLE_ATTR */
+
+#include "memlayout-debugfs.h"
+
+#if CONFIG_DNUMA_BACKLOG > 0
+/* Fixed size backlog */
+#include <linux/kfifo.h>
+#include <linux/log2.h> /* roundup_pow_of_two */
+DEFINE_KFIFO(ml_backlog, struct memlayout *,
+		roundup_pow_of_two(CONFIG_DNUMA_BACKLOG));
+void ml_backlog_feed(struct memlayout *ml)
+{
+	if (kfifo_is_full(&ml_backlog)) {
+		struct memlayout *old_ml;
+		BUG_ON(!kfifo_get(&ml_backlog, &old_ml));
+		memlayout_destroy(old_ml);
+	}
+
+	kfifo_put(&ml_backlog, (const struct memlayout **)&ml);
+}
+#elif CONFIG_DNUMA_BACKLOG < 0
+/* Unlimited backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+	/* we never use the rme_tree, so we destroy the non-debugfs portions to
+	 * save memory */
+	memlayout_destroy_mem(ml);
+}
+#else /* CONFIG_DNUMA_BACKLOG == 0 */
+/* No backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+	memlayout_destroy(ml);
+}
+#endif
+
+static atomic64_t dnuma_moved_page_ct;
+void ml_stat_count_moved_pages(int order)
+{
+	atomic64_add(1 << order, &dnuma_moved_page_ct);
+}
+
+static atomic_t ml_seq = ATOMIC_INIT(0);
+static struct dentry *root_dentry, *current_dentry;
+#define ML_LAYOUT_NAME_SZ \
+	((size_t)(DIV_ROUND_UP(sizeof(unsigned) * 8, 3) \
+				+ 1 + strlen("layout.")))
+#define ML_REGION_NAME_SZ ((size_t)(2 * BITS_PER_LONG / 4 + 2))
+
+static void ml_layout_name(struct memlayout *ml, char *name)
+{
+	sprintf(name, "layout.%u", ml->seq);
+}
+
+static int dfs_range_get(void *data, u64 *val)
+{
+	*val = (uintptr_t)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(range_fops, dfs_range_get, NULL, "%lld\n");
+
+static void _ml_dbgfs_create_range(struct dentry *base,
+		struct rangemap_entry *rme, char *name)
+{
+	struct dentry *rd;
+	sprintf(name, "%05lx-%05lx", rme->pfn_start, rme->pfn_end);
+	rd = debugfs_create_file(name, 0400, base,
+				(void *)(uintptr_t)rme->nid, &range_fops);
+	if (!rd)
+		pr_devel("debugfs: failed to create "RME_FMT"\n",
+				RME_EXP(rme));
+	else
+		pr_devel("debugfs: created "RME_FMT"\n", RME_EXP(rme));
+}
+
+/* Must be called with memlayout_lock held */
+static void _ml_dbgfs_set_current(struct memlayout *ml, char *name)
+{
+	ml_layout_name(ml, name);
+	debugfs_remove(current_dentry);
+	current_dentry = debugfs_create_symlink("current", root_dentry, name);
+}
+
+static void ml_dbgfs_create_layout_assume_root(struct memlayout *ml)
+{
+	char name[ML_LAYOUT_NAME_SZ];
+	ml_layout_name(ml, name);
+	WARN_ON(!root_dentry);
+	ml->d = debugfs_create_dir(name, root_dentry);
+	WARN_ON(!ml->d);
+}
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+
+#define DEFINE_DEBUGFS_GET(___type)					\
+	static int debugfs_## ___type ## _get(void *data, u64 *val)	\
+	{								\
+		*val = *(___type *)data;				\
+		return 0;						\
+	}
+
+DEFINE_DEBUGFS_GET(u32);
+DEFINE_DEBUGFS_GET(u8);
+
+#define DEFINE_WATCHED_ATTR(___type, ___var)			\
+	static int ___var ## _watch_set(void *data, u64 val)	\
+	{							\
+		___type old_val = *(___type *)data;		\
+		int ret = ___var ## _watch(old_val, val);	\
+		if (!ret)					\
+			*(___type *)data = val;			\
+		return ret;					\
+	}							\
+	DEFINE_SIMPLE_ATTRIBUTE(___var ## _fops,		\
+			debugfs_ ## ___type ## _get,		\
+			___var ## _watch_set, "%llu\n");
+
+#define DEFINE_ACTION_ATTR(___name)
+
+static u64 dnuma_user_start;
+static u64 dnuma_user_end;
+static u32 dnuma_user_node; /* XXX: I don't care about this var, remove? */
+static u8  dnuma_user_commit, dnuma_user_clear; /* same here */
+static struct memlayout *user_ml;
+static DEFINE_MUTEX(dnuma_user_lock);
+static int dnuma_user_node_watch(u32 old_val, u32 new_val)
+{
+	int ret = 0;
+	mutex_lock(&dnuma_user_lock);
+	if (!user_ml)
+		user_ml = memlayout_create(ML_USER_DEBUG);
+
+	if (WARN_ON(!user_ml)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (new_val >= nr_node_ids) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (dnuma_user_start > dnuma_user_end) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = memlayout_new_range(user_ml, dnuma_user_start, dnuma_user_end,
+				  new_val);
+
+	if (!ret) {
+		dnuma_user_start = 0;
+		dnuma_user_end = 0;
+	}
+out:
+	mutex_unlock(&dnuma_user_lock);
+	return ret;
+}
+
+static int dnuma_user_commit_watch(u8 old_val, u8 new_val)
+{
+	mutex_lock(&dnuma_user_lock);
+	if (user_ml)
+		memlayout_commit(user_ml);
+	user_ml = NULL;
+	mutex_unlock(&dnuma_user_lock);
+	return 0;
+}
+
+static int dnuma_user_clear_watch(u8 old_val, u8 new_val)
+{
+	mutex_lock(&dnuma_user_lock);
+	if (user_ml)
+		memlayout_destroy(user_ml);
+	user_ml = NULL;
+	mutex_unlock(&dnuma_user_lock);
+	return 0;
+}
+
+DEFINE_WATCHED_ATTR(u32, dnuma_user_node);
+DEFINE_WATCHED_ATTR(u8, dnuma_user_commit);
+DEFINE_WATCHED_ATTR(u8, dnuma_user_clear);
+# endif /* defined(CONFIG_DNUMA_DEBUGFS_WRITE) */
+
+/* create the entire current memlayout.
+ * only used for the layout which exsists prior to fs initialization
+ */
+static void ml_dbgfs_create_initial_layout(void)
+{
+	struct rangemap_entry *rme;
+	char name[max(ML_REGION_NAME_SZ, ML_LAYOUT_NAME_SZ)];
+	struct memlayout *old_ml, *new_ml;
+
+	new_ml = kmalloc(sizeof(*new_ml), GFP_KERNEL);
+	if (WARN(!new_ml, "memlayout allocation failed\n"))
+		return;
+
+	mutex_lock(&memlayout_lock);
+
+	old_ml = rcu_dereference_protected(pfn_to_node_map,
+			mutex_is_locked(&memlayout_lock));
+	if (WARN_ON(!old_ml))
+		goto e_out;
+	*new_ml = *old_ml;
+
+	if (WARN_ON(new_ml->d))
+		goto e_out;
+
+	/* this assumption holds as ml_dbgfs_create_initial_layout() (this
+	 * function) is only called by ml_dbgfs_create_root() */
+	ml_dbgfs_create_layout_assume_root(new_ml);
+	if (!new_ml->d)
+		goto e_out;
+
+	ml_for_each_range(new_ml, rme) {
+		_ml_dbgfs_create_range(new_ml->d, rme, name);
+	}
+
+	_ml_dbgfs_set_current(new_ml, name);
+	rcu_assign_pointer(pfn_to_node_map, new_ml);
+	mutex_unlock(&memlayout_lock);
+
+	synchronize_rcu();
+	kfree(old_ml);
+	return;
+e_out:
+	mutex_unlock(&memlayout_lock);
+	kfree(new_ml);
+}
+
+static atomic64_t ml_cache_hits;
+static atomic64_t ml_cache_misses;
+
+void ml_stat_cache_miss(void)
+{
+	atomic64_inc(&ml_cache_misses);
+}
+
+void ml_stat_cache_hit(void)
+{
+	atomic64_inc(&ml_cache_hits);
+}
+
+/* returns 0 if root_dentry has been created */
+static int ml_dbgfs_create_root(void)
+{
+	if (root_dentry)
+		return 0;
+
+	if (!debugfs_initialized()) {
+		pr_devel("debugfs not registered or disabled.\n");
+		return -EINVAL;
+	}
+
+	root_dentry = debugfs_create_dir("memlayout", NULL);
+	if (!root_dentry) {
+		pr_devel("root dir creation failed\n");
+		return -EINVAL;
+	}
+
+	/* TODO: place in a different dir? (to keep memlayout & dnuma seperate)
+	 */
+	/* FIXME: use debugfs_create_atomic64() [does not yet exsist]. */
+	debugfs_create_u64("moved-pages", 0400, root_dentry,
+			   (uint64_t *)&dnuma_moved_page_ct.counter);
+	debugfs_create_u64("pfn-lookup-cache-misses", 0400, root_dentry,
+			   (uint64_t *)&ml_cache_misses.counter);
+	debugfs_create_u64("pfn-lookup-cache-hits", 0400, root_dentry,
+			   (uint64_t *)&ml_cache_hits.counter);
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+	/* Set node last: on write, it adds the range. */
+	debugfs_create_x64("start", 0600, root_dentry, &dnuma_user_start);
+	debugfs_create_x64("end",   0600, root_dentry, &dnuma_user_end);
+	debugfs_create_file("node",  0200, root_dentry,
+			&dnuma_user_node, &dnuma_user_node_fops);
+	debugfs_create_file("commit",  0200, root_dentry,
+			&dnuma_user_commit, &dnuma_user_commit_fops);
+	debugfs_create_file("clear",  0200, root_dentry,
+			&dnuma_user_clear, &dnuma_user_clear_fops);
+# endif
+
+	/* uses root_dentry */
+	ml_dbgfs_create_initial_layout();
+
+	return 0;
+}
+
+static void ml_dbgfs_create_layout(struct memlayout *ml)
+{
+	if (ml_dbgfs_create_root()) {
+		ml->d = NULL;
+		return;
+	}
+	ml_dbgfs_create_layout_assume_root(ml);
+}
+
+static int ml_dbgfs_init_root(void)
+{
+	ml_dbgfs_create_root();
+	return 0;
+}
+
+void ml_dbgfs_init(struct memlayout *ml)
+{
+	ml->seq = atomic_inc_return(&ml_seq) - 1;
+	ml_dbgfs_create_layout(ml);
+}
+
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme)
+{
+	char name[ML_REGION_NAME_SZ];
+	if (ml->d)
+		_ml_dbgfs_create_range(ml->d, rme, name);
+}
+
+void ml_dbgfs_set_current(struct memlayout *ml)
+{
+	char name[ML_LAYOUT_NAME_SZ];
+	_ml_dbgfs_set_current(ml, name);
+}
+
+void ml_destroy_dbgfs(struct memlayout *ml)
+{
+	if (ml && ml->d)
+		debugfs_remove_recursive(ml->d);
+}
+
+static void __exit ml_dbgfs_exit(void)
+{
+	debugfs_remove_recursive(root_dentry);
+	root_dentry = NULL;
+}
+
+module_init(ml_dbgfs_init_root);
+module_exit(ml_dbgfs_exit);
diff --git a/mm/memlayout-debugfs.h b/mm/memlayout-debugfs.h
new file mode 100644
index 0000000..12dc1eb
--- /dev/null
+++ b/mm/memlayout-debugfs.h
@@ -0,0 +1,39 @@
+#ifndef LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+#define LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+
+#include <linux/memlayout.h>
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+void ml_stat_count_moved_pages(int order);
+void ml_stat_cache_hit(void);
+void ml_stat_cache_miss(void);
+void ml_dbgfs_init(struct memlayout *ml);
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme);
+void ml_destroy_dbgfs(struct memlayout *ml);
+void ml_dbgfs_set_current(struct memlayout *ml);
+void ml_backlog_feed(struct memlayout *ml);
+#else /* !defined(CONFIG_DNUMA_DEBUGFS) */
+static inline void ml_stat_count_moved_pages(int order)
+{}
+static inline void ml_stat_cache_hit(void)
+{}
+static inline void ml_stat_cache_miss(void)
+{}
+
+static inline void ml_dbgfs_init(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_create_range(struct memlayout *ml,
+		struct rangemap_entry *rme)
+{}
+static inline void ml_destroy_dbgfs(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_set_current(struct memlayout *ml)
+{}
+
+static inline void ml_backlog_feed(struct memlayout *ml)
+{
+	memlayout_destroy(ml);
+}
+#endif
+
+#endif
diff --git a/mm/memlayout.c b/mm/memlayout.c
index 132dbff..0a1a602 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -14,6 +14,8 @@
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 
+#include "memlayout-debugfs.h"
+
 /* protected by memlayout_lock */
 __rcu struct memlayout *pfn_to_node_map;
 DEFINE_MUTEX(memlayout_lock);
@@ -26,7 +28,7 @@ static void free_rme_tree(struct rb_root *root)
 	}
 }
 
-static void ml_destroy_mem(struct memlayout *ml)
+void memlayout_destroy_mem(struct memlayout *ml)
 {
 	if (!ml)
 		return;
@@ -88,6 +90,8 @@ int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,
 
 	rb_link_node(&rme->node, parent, new);
 	rb_insert_color(&rme->node, &ml->root);
+
+	ml_dbgfs_create_range(ml, rme);
 	return 0;
 }
 
@@ -114,8 +118,12 @@ struct rangemap_entry *memlayout_pfn_to_rme_higher(struct memlayout *ml, unsigne
 	rme = ACCESS_ONCE(ml->cache);
 	smp_read_barrier_depends();
 
-	if (rme && rme_bounds_pfn(rme, pfn))
+	if (rme && rme_bounds_pfn(rme, pfn)) {
+		ml_stat_cache_hit();
 		return rme;
+	}
+
+	ml_stat_cache_miss();
 
 	node = ml->root.rb_node;
 	while (node) {
@@ -217,7 +225,8 @@ static void memlayout_expand(struct memlayout *ml)
 
 void memlayout_destroy(struct memlayout *ml)
 {
-	ml_destroy_mem(ml);
+	ml_destroy_dbgfs(ml);
+	memlayout_destroy_mem(ml);
 }
 
 struct memlayout *memlayout_create(enum memlayout_type type)
@@ -235,6 +244,7 @@ struct memlayout *memlayout_create(enum memlayout_type type)
 	ml->type = type;
 	ml->cache = NULL;
 
+	ml_dbgfs_init(ml);
 	return ml;
 }
 
@@ -246,12 +256,12 @@ void memlayout_commit(struct memlayout *ml)
 	if (ml->type == ML_INITIAL) {
 		if (WARN(dnuma_has_memlayout(),
 				"memlayout marked first is not first, ignoring.\n")) {
-			memlayout_destroy(ml);
 			ml_backlog_feed(ml);
 			return;
 		}
 
 		mutex_lock(&memlayout_lock);
+		ml_dbgfs_set_current(ml);
 		rcu_assign_pointer(pfn_to_node_map, ml);
 		mutex_unlock(&memlayout_lock);
 		return;
@@ -264,13 +274,16 @@ void memlayout_commit(struct memlayout *ml)
 	unlock_memory_hotplug();
 
 	mutex_lock(&memlayout_lock);
+
+	ml_dbgfs_set_current(ml);
+
 	old_ml = rcu_dereference_protected(pfn_to_node_map,
 			mutex_is_locked(&memlayout_lock));
 
 	rcu_assign_pointer(pfn_to_node_map, ml);
 
 	synchronize_rcu();
-	memlayout_destroy(old_ml);
+	ml_backlog_feed(old_ml);
 
 	/* Must be called only after the new value for pfn_to_node_map has
 	 * propogated to all tasks, otherwise some pages may lookup the old
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 15/31] drivers/base/memory.c: alphabetize headers.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (13 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 14/31] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 16/31] drivers/base/node,memory: rename function to match interface Cody P Schafer
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

---
 drivers/base/memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 14f8a69..5247698 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -10,20 +10,20 @@
  * SPARSEMEM should be contained here, or in mm/memory_hotplug.c.
  */
 
-#include <linux/module.h>
-#include <linux/init.h>
-#include <linux/topology.h>
+#include <linux/atomic.h>
 #include <linux/capability.h>
 #include <linux/device.h>
-#include <linux/memory.h>
+#include <linux/init.h>
 #include <linux/kobject.h>
+#include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
-#include <linux/stat.h>
 #include <linux/slab.h>
+#include <linux/stat.h>
+#include <linux/topology.h>
 
-#include <linux/atomic.h>
 #include <asm/uaccess.h>
 
 static DEFINE_MUTEX(mem_sysfs_mutex);
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 16/31] drivers/base/node,memory: rename function to match interface
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (14 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 15/31] drivers/base/memory.c: alphabetize headers Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 17/31] drivers/base/node: rename unregister_mem_blk_under_nodes() to be more acurate Cody P Schafer
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Rename register_mem_sect_under_node() to register_mem_block_under_node() and rename unregister_mem_sect_under_nodes() to unregister_mem_block_under_nodes() to reflect that both of these functions are given memory_blocks instead of mem_sections
---
 drivers/base/memory.c |  4 ++--
 drivers/base/node.c   | 10 +++++-----
 include/linux/node.h  |  8 ++++----
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 5247698..b6e3f26 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -619,7 +619,7 @@ static int add_memory_section(int nid, struct mem_section *section,
 	if (!ret) {
 		if (context == HOTPLUG &&
 		    mem->section_count == sections_per_block)
-			ret = register_mem_sect_under_node(mem, nid);
+			ret = register_mem_block_under_node(mem, nid);
 	}
 
 	mutex_unlock(&mem_sysfs_mutex);
@@ -653,7 +653,7 @@ static int remove_memory_block(unsigned long node_id,
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
-	unregister_mem_sect_under_nodes(mem, __section_nr(section));
+	unregister_mem_block_under_nodes(mem, __section_nr(section));
 
 	mem->section_count--;
 	if (mem->section_count == 0) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7616a77c..ad45b59 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -388,8 +388,8 @@ static int get_nid_for_pfn(unsigned long pfn)
 	return pfn_to_nid(pfn);
 }
 
-/* register memory section under specified node if it spans that node */
-int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+/* register memory block under specified node if it spans that node */
+int register_mem_block_under_node(struct memory_block *mem_blk, int nid)
 {
 	int ret;
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -424,8 +424,8 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 	return 0;
 }
 
-/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+/* unregister memory block under all nodes that it spans */
+int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
 				    unsigned long phys_index)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
@@ -485,7 +485,7 @@ static int link_mem_sections(int nid)
 
 		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
 
-		ret = register_mem_sect_under_node(mem_blk, nid);
+		ret = register_mem_block_under_node(mem_blk, nid);
 		if (!err)
 			err = ret;
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 2115ad5..e20a203 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -36,9 +36,9 @@ extern int register_one_node(int nid);
 extern void unregister_one_node(int nid);
 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
-extern int register_mem_sect_under_node(struct memory_block *mem_blk,
+extern int register_mem_block_under_node(struct memory_block *mem_blk,
 						int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+extern int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
@@ -62,12 +62,12 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 {
 	return 0;
 }
-static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
+static inline int register_mem_block_under_node(struct memory_block *mem_blk,
 							int nid)
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+static inline int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
 						  unsigned long phys_index)
 {
 	return 0;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 17/31] drivers/base/node: rename unregister_mem_blk_under_nodes() to be more acurate
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (15 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 16/31] drivers/base/node,memory: rename function to match interface Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 18/31] drivers/base/node: add unregister_mem_block_under_nodes() Cody P Schafer
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

unregister_mem_block_under_nodes() only unregisters a single section in
the mem block under all nodes, not the entire mem block. Rename it to
unregister_mem_block_section_under_nodes(). Also rename the phys_index
param to indicate that it is a section number.
---
 drivers/base/memory.c |  2 +-
 drivers/base/node.c   | 11 +++++++----
 include/linux/node.h  | 10 ++++++----
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b6e3f26..90e387c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -653,7 +653,7 @@ static int remove_memory_block(unsigned long node_id,
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
-	unregister_mem_block_under_nodes(mem, __section_nr(section));
+	unregister_mem_block_section_under_nodes(mem, __section_nr(section));
 
 	mem->section_count--;
 	if (mem->section_count == 0) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ad45b59..d3f981e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -424,9 +424,12 @@ int register_mem_block_under_node(struct memory_block *mem_blk, int nid)
 	return 0;
 }
 
-/* unregister memory block under all nodes that it spans */
-int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
-				    unsigned long phys_index)
+/*
+ * unregister memory block under all nodes that a particular section it
+ * contains spans spans
+ */
+int unregister_mem_block_section_under_nodes(struct memory_block *mem_blk,
+				    unsigned long sec_num)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -439,7 +442,7 @@ int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
+	sect_start_pfn = section_nr_to_pfn(sec_num);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
diff --git a/include/linux/node.h b/include/linux/node.h
index e20a203..f438c45 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -38,8 +38,9 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_block_under_node(struct memory_block *mem_blk,
 						int nid);
-extern int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
-					   unsigned long phys_index);
+extern int unregister_mem_block_section_under_nodes(
+					struct memory_block *mem_blk,
+					unsigned long sec_nr);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -67,8 +68,9 @@ static inline int register_mem_block_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_mem_block_under_nodes(struct memory_block *mem_blk,
-						  unsigned long phys_index)
+static inline int unregister_mem_block_section_under_nodes(
+						struct memory_block *mem_blk,
+						unsigned long sec_nr)
 {
 	return 0;
 }
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 18/31] drivers/base/node: add unregister_mem_block_under_nodes()
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (16 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 17/31] drivers/base/node: rename unregister_mem_blk_under_nodes() to be more acurate Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 19/31] mm: memory,memlayout: add refresh_memory_blocks() for Dynamic NUMA Cody P Schafer
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Provides similar functionality to
unregister_mem_block_section_under_nodes() (which was previously named
identically to the newly added funtion), but operates on all memory
sections included in the memory block, not just the specified one.
---
 drivers/base/node.c  | 53 +++++++++++++++++++++++++++++++++++++++-------------
 include/linux/node.h |  6 ++++++
 2 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d3f981e..2861ef6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -424,6 +424,24 @@ int register_mem_block_under_node(struct memory_block *mem_blk, int nid)
 	return 0;
 }
 
+static void unregister_mem_block_pfn_under_nodes(struct memory_block *mem_blk,
+		unsigned long pfn, nodemask_t *unlinked_nodes)
+{
+	int nid;
+
+	nid = get_nid_for_pfn(pfn);
+	if (nid < 0)
+		return;
+	if (!node_online(nid))
+		return;
+	if (node_test_and_set(nid, *unlinked_nodes))
+		return;
+	sysfs_remove_link(&node_devices[nid]->dev.kobj,
+			kobject_name(&mem_blk->dev.kobj));
+	sysfs_remove_link(&mem_blk->dev.kobj,
+			kobject_name(&node_devices[nid]->dev.kobj));
+}
+
 /*
  * unregister memory block under all nodes that a particular section it
  * contains spans spans
@@ -444,20 +462,29 @@ int unregister_mem_block_section_under_nodes(struct memory_block *mem_blk,
 
 	sect_start_pfn = section_nr_to_pfn(sec_num);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
-	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int nid;
+	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++)
+		unregister_mem_block_pfn_under_nodes(mem_blk, pfn,
+				unlinked_nodes);
+	NODEMASK_FREE(unlinked_nodes);
+	return 0;
+}
 
-		nid = get_nid_for_pfn(pfn);
-		if (nid < 0)
-			continue;
-		if (!node_online(nid))
-			continue;
-		if (node_test_and_set(nid, *unlinked_nodes))
-			continue;
-		sysfs_remove_link(&node_devices[nid]->dev.kobj,
-			 kobject_name(&mem_blk->dev.kobj));
-		sysfs_remove_link(&mem_blk->dev.kobj,
-			 kobject_name(&node_devices[nid]->dev.kobj));
+int unregister_mem_block_under_nodes(struct memory_block *mem_blk)
+{
+	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
+	unsigned long pfn, sect_start_pfn, sect_end_pfn, sec_num;
+
+	if (!unlinked_nodes)
+		return -ENOMEM;
+	nodes_clear(*unlinked_nodes);
+
+	for (sec_num = mem_blk->start_section_nr;
+			sec_num < mem_blk->end_section_nr; sec_num++) {
+		sect_start_pfn = section_nr_to_pfn(sec_num);
+		sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+		for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++)
+			unregister_mem_block_pfn_under_nodes(mem_blk, pfn,
+					unlinked_nodes);
 	}
 	NODEMASK_FREE(unlinked_nodes);
 	return 0;
diff --git a/include/linux/node.h b/include/linux/node.h
index f438c45..04b464e 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -41,6 +41,7 @@ extern int register_mem_block_under_node(struct memory_block *mem_blk,
 extern int unregister_mem_block_section_under_nodes(
 					struct memory_block *mem_blk,
 					unsigned long sec_nr);
+extern int unregister_mem_block_under_nodes(struct memory_block *mem_blk);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -75,6 +76,11 @@ static inline int unregister_mem_block_section_under_nodes(
 	return 0;
 }
 
+static inline int unregister_mem_block_under_nodes(struct memory_block *mem_blk)
+{
+	return 0;
+}
+
 static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 						node_registration_func_t unreg)
 {
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 19/31] mm: memory,memlayout: add refresh_memory_blocks() for Dynamic NUMA.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (17 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 18/31] drivers/base/node: add unregister_mem_block_under_nodes() Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 20/31] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Properly update the sysfs info when memory blocks move between nodes
due to a Dynamic NUMA reconfiguration.
---
 drivers/base/memory.c  | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/memory.h |  5 +++++
 mm/memlayout.c         |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 90e387c..db1b034 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -15,6 +15,7 @@
 #include <linux/device.h>
 #include <linux/init.h>
 #include <linux/kobject.h>
+#include <linux/memlayout.h>
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
@@ -700,6 +701,44 @@ bool is_memblock_offlined(struct memory_block *mem)
 	return mem->state == MEM_OFFLINE;
 }
 
+#if defined(CONFIG_DYNAMIC_NUMA)
+int refresh_memory_blocks(struct memlayout *ml)
+{
+	struct subsys_dev_iter iter;
+	struct device *dev;
+	/* XXX: 4th arg is (struct device_type *), can we spec one? */
+	mutex_lock(&mem_sysfs_mutex);
+	subsys_dev_iter_init(&iter, &memory_subsys, NULL, NULL);
+
+	while ((dev = subsys_dev_iter_next(&iter))) {
+		struct memory_block *mem_blk = container_of(dev, struct memory_block, dev);
+		unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+		unsigned long end_pfn   = section_nr_to_pfn(mem_blk->end_section_nr + 1);
+		struct rangemap_entry *rme = memlayout_pfn_to_rme_higher(ml, start_pfn);
+		unsigned long pfn = start_pfn;
+
+		if (!rme || !rme_bounds_pfn(rme, pfn)) {
+			pr_warn("memory block %s {sec %lx-%lx}, {pfn %05lx-%05lx} is not bounded by the memlayout %pK\n",
+					dev_name(dev),
+					mem_blk->start_section_nr, mem_blk->end_section_nr,
+					start_pfn, end_pfn, ml);
+			continue;
+		}
+
+		unregister_mem_block_under_nodes(mem_blk);
+
+		for (; pfn < end_pfn && rme; rme = rme_next(rme)) {
+			register_mem_block_under_node(mem_blk, rme->nid);
+			pfn = rme->pfn_end + 1;
+		}
+	}
+
+	subsys_dev_iter_exit(&iter);
+	mutex_unlock(&mem_sysfs_mutex);
+	return 0;
+}
+#endif
+
 /*
  * Initialize the sysfs support for memory devices...
  */
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 85c31a8..8f1dc43 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -143,6 +143,11 @@ enum mem_add_context { BOOT, HOTPLUG };
 #define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
 #endif
 
+#ifdef CONFIG_DYNAMIC_NUMA
+struct memlayout;
+extern int refresh_memory_blocks(struct memlayout *ml);
+#endif
+
 /*
  * 'struct memory_accessor' is a generic interface to provide
  * in-kernel access to persistent memory such as i2c or SPI EEPROMs
diff --git a/mm/memlayout.c b/mm/memlayout.c
index 0a1a602..8b9ba9a 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -9,6 +9,7 @@
 #include <linux/dnuma.h>
 #include <linux/export.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/printk.h>
 #include <linux/rbtree.h>
 #include <linux/rcupdate.h>
@@ -300,6 +301,8 @@ void memlayout_commit(struct memlayout *ml)
 	drain_all_pages();
 	/* All new page allocations now match the memlayout */
 
+	refresh_memory_blocks(ml);
+
 	mutex_unlock(&memlayout_lock);
 }
 
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 20/31] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok()
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (18 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 19/31] mm: memory,memlayout: add refresh_memory_blocks() for Dynamic NUMA Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 21/31] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

__free_pages_ok() handles higher order (order != 0) pages. Transplant
hook is added here as this is where the struct zone to free to is
decided.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea4fda8..f33f1bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/rt.h>
+#include <linux/dnuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -733,6 +734,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	int dest_nid = dnuma_page_needs_move(page);
+	struct zone *zone;
+
+	if (dest_nid != NUMA_NO_NODE)
+		zone = nid_zone(dest_nid, page_zonenum(page));
+	else
+		zone = page_zone(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
@@ -741,7 +749,11 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	__count_vm_events(PGFREE, 1 << order);
 	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	if (dest_nid != NUMA_NO_NODE)
+		dnuma_prior_free_to_new_zone(page, order, zone, dest_nid);
+	free_one_page(zone, page, order, migratetype);
+	if (dest_nid != NUMA_NO_NODE)
+		dnuma_post_free_to_new_zone(order);
 	local_irq_restore(flags);
 }
 
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 21/31] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page()
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (19 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 20/31] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 22/31] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

free_hot_cold_page() is used for order == 0 pages, and is where the
page's zone is decided.

In the normal case, these pages are freed to the per-cpu lists. When a
page needs transplanting (ie: the actual node it belongs to has changed,
and it needs to be moved to another zone), the pcp lists are skipped &
the page is freed via free_one_page().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f33f1bf..38a2161 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1358,6 +1358,7 @@ void mark_free_pages(struct zone *zone)
  */
 void free_hot_cold_page(struct page *page, int cold)
 {
+	int dest_nid;
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
@@ -1371,6 +1372,15 @@ void free_hot_cold_page(struct page *page, int cold)
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
 
+	dest_nid = dnuma_page_needs_move(page);
+	if (dest_nid != NUMA_NO_NODE) {
+		struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+		dnuma_prior_free_to_new_zone(page, 0, dest_zone, dest_nid);
+		free_one_page(dest_zone, page, 0, migratetype);
+		dnuma_post_free_to_new_zone(0);
+		goto out;
+	}
+
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
 	 * Free ISOLATE pages back to the allocator because they are being
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 22/31] page_alloc: transplant pages that are being flushed from the per-cpu lists
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (20 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 21/31] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 23/31] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

In free_pcppages_bulk(), check if a page needs to be moved to a new
node/zone & then perform the transplant (in a slightly defered manner).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38a2161..879ab9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -644,13 +644,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
+	struct page *pos, *page;
+	LIST_HEAD(need_move);
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
-		struct page *page;
 		struct list_head *list;
 
 		/*
@@ -673,11 +674,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
+			int dest_nid;
 
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
+
+			dest_nid = dnuma_page_needs_move(page);
+			if (dest_nid != NUMA_NO_NODE) {
+				dnuma_prior_free_to_new_zone(page, 0,
+						nid_zone(dest_nid,
+							page_zonenum(page)),
+						dest_nid);
+				list_add(&page->lru, &need_move);
+				continue;
+			}
+
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
@@ -689,6 +702,27 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--to_free && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+
+	list_for_each_entry_safe(page, pos, &need_move, lru) {
+		struct zone *dest_zone = page_zone(page);
+		int mt;
+
+		spin_lock(&dest_zone->lock);
+
+		VM_BUG_ON(dest_zone != page_zone(page));
+		pr_devel("freeing pcp page %pK with changed node\n", page);
+		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		__free_one_page(page, dest_zone, 0, mt);
+		trace_mm_page_pcpu_drain(page, 0, mt);
+
+		/* XXX: fold into "post_free_to_new_zone()" ? */
+		if (is_migrate_cma(mt))
+			__mod_zone_page_state(dest_zone, NR_FREE_CMA_PAGES, 1);
+		dnuma_post_free_to_new_zone(0);
+
+		spin_unlock(&dest_zone->lock);
+	}
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order,
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 23/31] x86: memlayout: add a arch specific inital memlayout setter.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (21 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 22/31] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 24/31] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

On x86, we have numa_info specifically to track the numa layout, which
is precisely the data memlayout needs, so use it to create an initial
memlayout.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..75819ef 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/dnuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -32,6 +33,33 @@ __initdata
 #endif
 ;
 
+#ifdef CONFIG_DYNAMIC_NUMA
+void __init memlayout_global_init(void)
+{
+	struct numa_meminfo *mi = &numa_meminfo;
+	int i;
+	struct numa_memblk *blk;
+	struct memlayout *ml = memlayout_create(ML_INITIAL);
+	if (WARN_ON(!ml))
+		return;
+
+	pr_devel("x86/memlayout: adding ranges from numa_meminfo\n");
+	for (i = 0; i < mi->nr_blks; i++) {
+		blk = mi->blk + i;
+		pr_devel("  adding range {%LX[%LX]-%LX[%LX]}:%d\n",
+			 PFN_DOWN(blk->start), blk->start,
+			 PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1),
+			 blk->end - 1, blk->nid);
+		memlayout_new_range(ml, PFN_DOWN(blk->start),
+				PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1),
+				blk->nid);
+	}
+	pr_devel("  done adding ranges from numa_meminfo\n");
+
+	memlayout_commit(ml);
+}
+#endif
+
 static int numa_distance_cnt;
 static u8 *numa_distance;
 
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 24/31] init/main: call memlayout_global_init() in start_kernel().
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (22 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 23/31] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 25/31] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

memlayout_global_init() initializes the first memlayout, which is
assumed to match the initial page-flag nid settings.

This is done in start_kernel() as the initdata used to populate the
memlayout is purged from memory early in the boot process (XXX: When?).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 init/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/init/main.c b/init/main.c
index ceed17a..8ed5209 100644
--- a/init/main.c
+++ b/init/main.c
@@ -74,6 +74,7 @@
 #include <linux/ptrace.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
+#include <linux/memlayout.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -613,6 +614,7 @@ asmlinkage void __init start_kernel(void)
 	security_init();
 	dbg_late_init();
 	vfs_caches_init(totalram_pages);
+	memlayout_global_init();
 	signals_init();
 	/* rootfs populating might need page-writeback */
 	page_writeback_init();
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 25/31] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (23 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 24/31] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 26/31] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memlayout.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/memlayout.c b/mm/memlayout.c
index 8b9ba9a..3e89482 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -336,3 +336,19 @@ void memlayout_global_init(void)
 
 	memlayout_commit(ml);
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Provides a default memory_add_physaddr_to_nid() for memory hotplug, unless
+ * overridden by the arch.
+ */
+__weak
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = memlayout_pfn_to_nid(PFN_DOWN(start));
+	if (nid == NUMA_NO_NODE)
+		return 0;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 26/31] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (24 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 25/31] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:00 ` [RFC PATCH v3 27/31] mm/memory_hotplug: VM_BUG if nid is too large Cody P Schafer
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

When a memlayout is tracked (ie: CONFIG_DYNAMIC_NUMA is enabled), rather
than iterate over numa_meminfo, a lookup can be done using memlayout.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 75819ef..f1609c0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -28,7 +28,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
 static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
+#if !defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DYNAMIC_NUMA)
 __initdata
 #endif
 ;
@@ -832,7 +832,7 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && !defined(CONFIG_DYNAMIC_NUMA)
 int memory_add_physaddr_to_nid(u64 start)
 {
 	struct numa_meminfo *mi = &numa_meminfo;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 27/31] mm/memory_hotplug: VM_BUG if nid is too large.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (25 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 26/31] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
@ 2013-05-03  0:00 ` Cody P Schafer
  2013-05-03  0:01 ` [RFC PATCH v3 28/31] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions Cody P Schafer
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:00 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8e6658d..320d914 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1071,6 +1071,8 @@ int __mem_online_node(int nid)
 	pg_data_t *pgdat;
 	int ret;
 
+	VM_BUG_ON(nid >= nr_node_ids);
+
 	pgdat = hotadd_new_pgdat(nid, 0);
 	if (!pgdat)
 		return -ENOMEM;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 28/31] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (26 preceding siblings ...)
  2013-05-03  0:00 ` [RFC PATCH v3 27/31] mm/memory_hotplug: VM_BUG if nid is too large Cody P Schafer
@ 2013-05-03  0:01 ` Cody P Schafer
  2013-05-03  0:01 ` [RFC PATCH v3 29/31] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful Cody P Schafer
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:01 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

With some code that expands the zone boundaries, VM_BUG_ON(bad_range()) was being triggered.

Previously, page_outside_zone_boundaries() decided that once it detected
a page outside the boundaries, it was certainly outside even if the
seqlock indicated the data was invalid & needed to be reread. This
methodology _almost_ works because zones are only ever grown. However,
becase the zone span is stored as a start and a length, some expantions
momentarily appear as shifts to the left (when the zone_start_pfn is
assigned prior to zone_spanned_pages).

If we want to remove the seqlock around zone_start_pfn & zone
spanned_pages, always writing the spanned_pages first, issuing a memory
barrier, and then writing the new zone_start_pfn _may_ work. The concern
there is that we could be seen as shrinking the span when zone_start_pfn
is written (the entire span would shift to the left). As there will be
no pages in the exsess span that actually belong to the zone being
manipulated, I don't expect there to be issues.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879ab9d..3695ca5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -239,12 +239,13 @@ bool oom_killer_disabled __read_mostly;
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
-	int ret = 0;
+	int ret;
 	unsigned seq;
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long sp, start_pfn;
 
 	do {
+		ret = 0;
 		seq = zone_span_seqbegin(zone);
 		start_pfn = zone->zone_start_pfn;
 		sp = zone->spanned_pages;
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 29/31] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (27 preceding siblings ...)
  2013-05-03  0:01 ` [RFC PATCH v3 28/31] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions Cody P Schafer
@ 2013-05-03  0:01 ` Cody P Schafer
  2013-05-03  0:01 ` [RFC PATCH v3 30/31] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order() Cody P Schafer
  2013-05-03  0:01 ` [RFC PATCH v3 31/31] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage Cody P Schafer
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:01 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3695ca5..4fe35b24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -254,8 +254,11 @@ static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 	} while (zone_span_seqretry(zone, seq));
 
 	if (ret)
-		pr_err("page %lu outside zone [ %lu - %lu ]\n",
-			pfn, start_pfn, start_pfn + sp);
+		pr_err("page with pfn %05lx outside zone %s with pfn range {%05lx-%05lx} in node %d with pfn range {%05lx-%05lx}\n",
+			pfn, zone->name, start_pfn, start_pfn + sp,
+			zone->zone_pgdat->node_id,
+			zone->zone_pgdat->node_start_pfn,
+			pgdat_end_pfn(zone->zone_pgdat));
 
 	return ret;
 }
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 30/31] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order()
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (28 preceding siblings ...)
  2013-05-03  0:01 ` [RFC PATCH v3 29/31] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful Cody P Schafer
@ 2013-05-03  0:01 ` Cody P Schafer
  2013-05-03  0:01 ` [RFC PATCH v3 31/31] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage Cody P Schafer
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:01 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4fe35b24..cc7b332 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3514,8 +3514,8 @@ static int default_zonelist_order(void)
 			z = &NODE_DATA(nid)->node_zones[zone_type];
 			if (populated_zone(z)) {
 				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->present_pages;
-				total_size += z->present_pages;
+					low_kmem_size += z->managed_pages;
+				total_size += z->managed_pages;
 			} else if (zone_type == ZONE_NORMAL) {
 				/*
 				 * If any node has only lowmem, then node order
@@ -3545,8 +3545,8 @@ static int default_zonelist_order(void)
 			z = &NODE_DATA(nid)->node_zones[zone_type];
 			if (populated_zone(z)) {
 				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->present_pages;
-				total_size += z->present_pages;
+					low_kmem_size += z->managed_pages;
+				total_size += z->managed_pages;
 			}
 		}
 		if (low_kmem_size &&
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v3 31/31] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage.
  2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
                   ` (29 preceding siblings ...)
  2013-05-03  0:01 ` [RFC PATCH v3 30/31] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order() Cody P Schafer
@ 2013-05-03  0:01 ` Cody P Schafer
  30 siblings, 0 replies; 32+ messages in thread
From: Cody P Schafer @ 2013-05-03  0:01 UTC (permalink / raw)
  To: Linux MM; +Cc: LKML, Cody P Schafer, Simon Jeons

For dynamic numa, sometimes the hypervisor we're running under will want
to split a single NUMA node into multiple NUMA nodes. If the number of
numa nodes is limited to the number avaliable when the system booted (as
it is on x86), we may not be able to fully adopt the new memory layout
provided by the hypervisor.

This option allows reserving some extra node ids as a percentage of the
boot time node ids. While not perfect (idealy nr_node_ids would be fully
dynamic), this allows decent functionality without invasive changes to
the SL{U,A}B allocators.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 mm/page_alloc.c                     | 24 ++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9653cf2..c606371 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2082,6 +2082,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			use hotplug cpu feature to put more cpu back to online.
 			just like you compile the kernel NR_CPUS=n
 
+	extra_nr_node_ids= [NUMA] Increase the maximum number of NUMA nodes
+			above the number detected at boot by the specified
+			percentage (rounded up). For example:
+			extra_nr_node_ids=100 would double the number of
+			node_ids avaliable (up to a max of MAX_NUMNODES).
+
 	nr_uarts=	[SERIAL] maximum number of UARTs to be registered.
 
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cc7b332..1fd2f2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4837,6 +4837,17 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
+
+static unsigned nr_node_ids_mod_percent;
+static int __init setup_extra_nr_node_ids(char *arg)
+{
+	int r = kstrtouint(arg, 10, &nr_node_ids_mod_percent);
+	if (r)
+		pr_err("invalid param value extra_nr_node_ids=\"%s\"\n", arg);
+	return 0;
+}
+early_param("extra_nr_node_ids", setup_extra_nr_node_ids);
+
 /*
  * Figure out the number of possible node ids.
  */
@@ -4848,6 +4859,19 @@ void __init setup_nr_node_ids(void)
 	for_each_node_mask(node, node_possible_map)
 		highest = node;
 	nr_node_ids = highest + 1;
+
+	/*
+	 * expand nr_node_ids and node_possible_map so more can be onlined
+	 * later
+	 */
+	nr_node_ids +=
+		DIV_ROUND_UP(nr_node_ids * nr_node_ids_mod_percent, 100);
+
+	if (nr_node_ids > MAX_NUMNODES)
+		nr_node_ids = MAX_NUMNODES;
+
+	for (node = highest + 1; node < nr_node_ids; node++)
+		node_set(node, node_possible_map);
 }
 #endif
 
-- 
1.8.2.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-05-03  0:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-03  0:00 [RFC PATCH v3 00/31] Dynamic NUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 01/31] rbtree: add postorder iteration functions Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 02/31] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 03/31] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 04/31] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 05/31] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 06/31] mm: add nid_zone() helper Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 07/31] mm: Add Dynamic NUMA Kconfig Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 08/31] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 09/31] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 10/31] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 11/31] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 12/31] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 13/31] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 14/31] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 15/31] drivers/base/memory.c: alphabetize headers Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 16/31] drivers/base/node,memory: rename function to match interface Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 17/31] drivers/base/node: rename unregister_mem_blk_under_nodes() to be more acurate Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 18/31] drivers/base/node: add unregister_mem_block_under_nodes() Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 19/31] mm: memory,memlayout: add refresh_memory_blocks() for Dynamic NUMA Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 20/31] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 21/31] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 22/31] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 23/31] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 24/31] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 25/31] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 26/31] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
2013-05-03  0:00 ` [RFC PATCH v3 27/31] mm/memory_hotplug: VM_BUG if nid is too large Cody P Schafer
2013-05-03  0:01 ` [RFC PATCH v3 28/31] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions Cody P Schafer
2013-05-03  0:01 ` [RFC PATCH v3 29/31] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful Cody P Schafer
2013-05-03  0:01 ` [RFC PATCH v3 30/31] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order() Cody P Schafer
2013-05-03  0:01 ` [RFC PATCH v3 31/31] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage Cody P Schafer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).