linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
@ 2015-03-05 18:05 Nishanth Aravamudan
  2015-03-05 21:16 ` David Rientjes
  2015-03-05 22:13 ` [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Tejun Heo
  0 siblings, 2 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 18:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, David Rientjes,
	Tejun Heo, linuxppc-dev

Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
iteration of for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

One could alternatively nodemask_and(node_possible_map,
node_online_map), but I think the cost of anding the two will always be
higher than zero and set a few bits in practice.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
While looking at this, I noticed that nr_node_ids is actually a
misnomer, it seems. It's not the number, but the maximum_node_id, as
with sparse NUMA nodes, you might only have two NUMA nodes possible, but
to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
changed?

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..24de29b3651b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,9 +958,17 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * zero out the possible nodes after we parse the device-tree,
+	 * so that we lower the maximum NUMA node ID to what is actually
+	 * present.
+	 */
+	nodes_clear(node_possible_map);
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
+		node_set(nid, node_possible_map);
 		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 		setup_node_data(nid, start_pfn, end_pfn);
 		sparse_memory_present_with_active_regions(nid);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 18:05 [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Nishanth Aravamudan
@ 2015-03-05 21:16 ` David Rientjes
  2015-03-05 21:48   ` Michael Ellerman
  2015-03-05 23:15   ` Nishanth Aravamudan
  2015-03-05 22:13 ` [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Tejun Heo
  1 sibling, 2 replies; 18+ messages in thread
From: David Rientjes @ 2015-03-05 21:16 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, Tejun Heo,
	linuxppc-dev

On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..24de29b3651b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,9 +958,17 @@ void __init initmem_init(void)
>  
>  	memblock_dump_all();
>  
> +	/*
> +	 * zero out the possible nodes after we parse the device-tree,
> +	 * so that we lower the maximum NUMA node ID to what is actually
> +	 * present.
> +	 */
> +	nodes_clear(node_possible_map);
> +
>  	for_each_online_node(nid) {
>  		unsigned long start_pfn, end_pfn;
>  
> +		node_set(nid, node_possible_map);
>  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  		setup_node_data(nid, start_pfn, end_pfn);
>  		sparse_memory_present_with_active_regions(nid);

This seems a bit strange, node_possible_map is supposed to be a superset 
of node_online_map and this loop is iterating over node_online_map to set 
nodes in node_possible_map.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:16 ` David Rientjes
@ 2015-03-05 21:48   ` Michael Ellerman
  2015-03-05 21:58     ` David Rientjes
  2015-03-05 23:17     ` Nishanth Aravamudan
  2015-03-05 23:15   ` Nishanth Aravamudan
  1 sibling, 2 replies; 18+ messages in thread
From: Michael Ellerman @ 2015-03-05 21:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, Raghavendra K T, Paul Mackerras,
	Anton Blanchard, Tejun Heo, linuxppc-dev

On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote:
> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..24de29b3651b 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> >  
> >  	memblock_dump_all();
> >  
> > +	/*
> > +	 * zero out the possible nodes after we parse the device-tree,
> > +	 * so that we lower the maximum NUMA node ID to what is actually
> > +	 * present.
> > +	 */
> > +	nodes_clear(node_possible_map);
> > +
> >  	for_each_online_node(nid) {
> >  		unsigned long start_pfn, end_pfn;
> >  
> > +		node_set(nid, node_possible_map);
> >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >  		setup_node_data(nid, start_pfn, end_pfn);
> >  		sparse_memory_present_with_active_regions(nid);
> 
> This seems a bit strange, node_possible_map is supposed to be a superset 
> of node_online_map and this loop is iterating over node_online_map to set 
> nodes in node_possible_map.
 
Yeah. Though at this point in boot I don't think it matters that the two maps
are out-of-sync temporarily.

But it would simpler to just set the possible map to be the online map. That
would also maintain the invariant that the possible map is always a superset of
the online map.

Or did I miss a detail there (sleep deprived parent mode).

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:48   ` Michael Ellerman
@ 2015-03-05 21:58     ` David Rientjes
  2015-03-05 22:08       ` Tejun Heo
  2015-03-05 23:20       ` Nishanth Aravamudan
  2015-03-05 23:17     ` Nishanth Aravamudan
  1 sibling, 2 replies; 18+ messages in thread
From: David Rientjes @ 2015-03-05 21:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nishanth Aravamudan, Raghavendra K T, Paul Mackerras,
	Anton Blanchard, Tejun Heo, linuxppc-dev

On Fri, 6 Mar 2015, Michael Ellerman wrote:

> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 0257a7d659ef..24de29b3651b 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > >  
> > >  	memblock_dump_all();
> > >  
> > > +	/*
> > > +	 * zero out the possible nodes after we parse the device-tree,
> > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > +	 * present.
> > > +	 */
> > > +	nodes_clear(node_possible_map);
> > > +
> > >  	for_each_online_node(nid) {
> > >  		unsigned long start_pfn, end_pfn;
> > >  
> > > +		node_set(nid, node_possible_map);
> > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > >  		setup_node_data(nid, start_pfn, end_pfn);
> > >  		sparse_memory_present_with_active_regions(nid);
> > 
> > This seems a bit strange, node_possible_map is supposed to be a superset 
> > of node_online_map and this loop is iterating over node_online_map to set 
> > nodes in node_possible_map.
>  
> Yeah. Though at this point in boot I don't think it matters that the two maps
> are out-of-sync temporarily.
> 
> But it would simpler to just set the possible map to be the online map. That
> would also maintain the invariant that the possible map is always a superset of
> the online map.
> 
> Or did I miss a detail there (sleep deprived parent mode).
> 

I think reset_numa_cpu_lookup_table() which iterates over the possible 
map, and thus only a subset of nodes now, may be concerning.

I'm not sure why this is being proposed as a powerpc patch and now a patch 
for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
for all possible nodes?  We should only be allocating for online nodes in 
N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
transition from memoryless -> memory.  The extra bonus is that 
alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
TODO in that function can be removed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:58     ` David Rientjes
@ 2015-03-05 22:08       ` Tejun Heo
  2015-03-05 22:18         ` Tejun Heo
  2015-03-05 23:21         ` Nishanth Aravamudan
  2015-03-05 23:20       ` Nishanth Aravamudan
  1 sibling, 2 replies; 18+ messages in thread
From: Tejun Heo @ 2015-03-05 22:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, Raghavendra K T, Paul Mackerras,
	Anton Blanchard, linuxppc-dev

Hello,

On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote:
> I'm not sure why this is being proposed as a powerpc patch and now a patch 
> for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
> for all possible nodes?  We should only be allocating for online nodes in 
> N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
> callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
> transition from memoryless -> memory.  The extra bonus is that 
> alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
> TODO in that function can be removed.

For cpus, the general direction is allocating for all possible cpus.
For iterations, we alternate between using all possibles and onlines
depending on the use case; however, the general idea is that the
possibles and onlines aren't gonna be very different.  NR_CPUS and
MAX_NUMNODES gotta accomodate the worst possible case the kernel may
run on but the possible masks should be set to the actually possible
subset during boot so that the kernel don't end up allocating for and
iterating over things which can't ever exist.

It can be argued that we should always stick to the online masks for
allocation and iteration; however, that usually requires more
complexity and the only cases where this mattered have been when the
boot code got it wrong and failed to set the possible masks correctly,
which also seems to be the case here.  I don't see any reason to
deviate here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 18:05 [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Nishanth Aravamudan
  2015-03-05 21:16 ` David Rientjes
@ 2015-03-05 22:13 ` Tejun Heo
  2015-03-05 23:27   ` Nishanth Aravamudan
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2015-03-05 22:13 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Raghavendra K T, David Rientjes, Paul Mackerras, Anton Blanchard,
	linuxppc-dev

On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote:
> While looking at this, I noticed that nr_node_ids is actually a
> misnomer, it seems. It's not the number, but the maximum_node_id, as
> with sparse NUMA nodes, you might only have two NUMA nodes possible, but
> to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
> changed?

It's the same for nr_cpu_ids.  It's counting the number of valid IDs
during that boot instance.  In the above case, whether the nodes are
sparse or not, there exist 17 node ids - 0 to 16.  Maybe numa_max_id
had been a better name (but would that equal the highest number or
+1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer
either.  Doesn't really matter at this point.  Maybe add comments on
top of both?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 22:08       ` Tejun Heo
@ 2015-03-05 22:18         ` Tejun Heo
  2015-03-05 23:21         ` Nishanth Aravamudan
  1 sibling, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2015-03-05 22:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, Raghavendra K T, Paul Mackerras,
	Anton Blanchard, linuxppc-dev

On Thu, Mar 05, 2015 at 05:08:04PM -0500, Tejun Heo wrote:
> It can be argued that we should always stick to the online masks for
> allocation and iteration; however, that usually requires more
> complexity and the only cases where this mattered have been when the
> boot code got it wrong and failed to set the possible masks correctly,
> which also seems to be the case here.  I don't see any reason to
> deviate here.

Hmm... but yeah, as you wrote, keeping the allocation local could be a
reason but let's please not do this just to reduce memory consumption.
If memory locality of the field affects performance noticeably, sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:16 ` David Rientjes
  2015-03-05 21:48   ` Michael Ellerman
@ 2015-03-05 23:15   ` Nishanth Aravamudan
  2015-03-05 23:29     ` David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 23:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, linuxppc-dev, Raghavendra K T, Paul Mackerras,
	Anton Blanchard

Hi David,

On 05.03.2015 [13:16:35 -0800], David Rientjes wrote:
> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..24de29b3651b 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> >  
> >  	memblock_dump_all();
> >  
> > +	/*
> > +	 * zero out the possible nodes after we parse the device-tree,
> > +	 * so that we lower the maximum NUMA node ID to what is actually
> > +	 * present.
> > +	 */
> > +	nodes_clear(node_possible_map);
> > +
> >  	for_each_online_node(nid) {
> >  		unsigned long start_pfn, end_pfn;
> >  
> > +		node_set(nid, node_possible_map);
> >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >  		setup_node_data(nid, start_pfn, end_pfn);
> >  		sparse_memory_present_with_active_regions(nid);
> 
> This seems a bit strange, node_possible_map is supposed to be a superset 
> of node_online_map and this loop is iterating over node_online_map to set 
> nodes in node_possible_map.

So if we compare to x86:

arch/x86/mm/numa.c::numa_init():

        nodes_clear(numa_nodes_parsed);
        nodes_clear(node_possible_map);
        nodes_clear(node_online_map);
	...
	numa_register_memblks(...);

arch/x86/mm/numa.c::numa_register_memblks():

	node_possible_map = numa_nodes_parsed;

Basically, it looks like x86 NUMA init clears out possible map and
online map, probably for a similar reason to what I gave in the
changelog that by default, the possible map seems to be based off
MAX_NUMNODES, rather than nr_node_ids or anything dynamic.

My patch was an attempt to emulate the same thing on powerpc. You are
right that there is a window in which the node_possible_map and
node_online_map are out of sync with my patch. It seems like it
shouldn't matter given how early in boot we are, but perhaps the
following would have been clearer:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..1a118b08fad2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)
 
        memblock_dump_all();
 
+       /*
+        * Reduce the possible NUMA nodes to the online NUMA nodes,
+        * since we do not support node hotplug. This ensures that  we
+        * lower the maximum NUMA node ID to what is actually present.
+        */
+       nodes_and(node_possible_map, node_possible_map, node_online_map);
+
        for_each_online_node(nid) {
                unsigned long start_pfn, end_pfn;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:48   ` Michael Ellerman
  2015-03-05 21:58     ` David Rientjes
@ 2015-03-05 23:17     ` Nishanth Aravamudan
  1 sibling, 0 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 23:17 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, David Rientjes,
	Tejun Heo, linuxppc-dev

On 06.03.2015 [08:48:52 +1100], Michael Ellerman wrote:
> On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote:
> > On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> > 
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 0257a7d659ef..24de29b3651b 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > >  
> > >  	memblock_dump_all();
> > >  
> > > +	/*
> > > +	 * zero out the possible nodes after we parse the device-tree,
> > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > +	 * present.
> > > +	 */
> > > +	nodes_clear(node_possible_map);
> > > +
> > >  	for_each_online_node(nid) {
> > >  		unsigned long start_pfn, end_pfn;
> > >  
> > > +		node_set(nid, node_possible_map);
> > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > >  		setup_node_data(nid, start_pfn, end_pfn);
> > >  		sparse_memory_present_with_active_regions(nid);
> > 
> > This seems a bit strange, node_possible_map is supposed to be a superset 
> > of node_online_map and this loop is iterating over node_online_map to set 
> > nodes in node_possible_map.
>  
> Yeah. Though at this point in boot I don't think it matters that the two maps
> are out-of-sync temporarily.
> 
> But it would simpler to just set the possible map to be the online
> map. That would also maintain the invariant that the possible map is
> always a superset of the online map.

Yes, we could do that (see my reply to David just now). I didn't
consider just setting the map directly, that would be clearer. I didn't
want to post my nodes_and() version, because the cost of nodes_and
seemed higher than nodes_clear & node_set appropriately.

-Nish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 21:58     ` David Rientjes
  2015-03-05 22:08       ` Tejun Heo
@ 2015-03-05 23:20       ` Nishanth Aravamudan
  1 sibling, 0 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 23:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, Tejun Heo,
	linuxppc-dev

On 05.03.2015 [13:58:27 -0800], David Rientjes wrote:
> On Fri, 6 Mar 2015, Michael Ellerman wrote:
> 
> > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > > index 0257a7d659ef..24de29b3651b 100644
> > > > --- a/arch/powerpc/mm/numa.c
> > > > +++ b/arch/powerpc/mm/numa.c
> > > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > > >  
> > > >  	memblock_dump_all();
> > > >  
> > > > +	/*
> > > > +	 * zero out the possible nodes after we parse the device-tree,
> > > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > > +	 * present.
> > > > +	 */
> > > > +	nodes_clear(node_possible_map);
> > > > +
> > > >  	for_each_online_node(nid) {
> > > >  		unsigned long start_pfn, end_pfn;
> > > >  
> > > > +		node_set(nid, node_possible_map);
> > > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > > >  		setup_node_data(nid, start_pfn, end_pfn);
> > > >  		sparse_memory_present_with_active_regions(nid);
> > > 
> > > This seems a bit strange, node_possible_map is supposed to be a superset 
> > > of node_online_map and this loop is iterating over node_online_map to set 
> > > nodes in node_possible_map.
> >  
> > Yeah. Though at this point in boot I don't think it matters that the
> > two maps are out-of-sync temporarily.
> > 
> > But it would simpler to just set the possible map to be the online
> > map. That would also maintain the invariant that the possible map is
> > always a superset of the online map.
> > 
> > Or did I miss a detail there (sleep deprived parent mode).
> > 
> 
> I think reset_numa_cpu_lookup_table() which iterates over the possible
> map, and thus only a subset of nodes now, may be concerning.


I think you are confusing the CPU online map and the NUMA node online
map. reset_numa_cpu_lookup_table is a cpu->node mapping, only called at
boot-time, and iterates over the CPU online map, which is unaltered by
my patch.

> I'm not sure why this is being proposed as a powerpc patch and now a
> patch for mem_cgroup_css_alloc().

I think mem_cgroup_css_alloc() is just an example of a larger issue. I
should have made that clearer in my changelog. Even if we change
mem_cgroup_css_alloc(), I think we want to fix the node_possible_map on
powerpc to be accurate at run-time, just like x86 does.

> In other words, why do we have to allocate for all possible nodes?  We
> should only be allocating for online nodes in N_MEMORY with mem
> hotplug disabled initially and then have a mem hotplug callback
> implemented to alloc_mem_cgroup_per_zone_info() for nodes that
> transition from memoryless -> memory.  The extra bonus is that
> alloc_mem_cgroup_per_zone_info() need never allocate remote memory and
> the TODO in that function can be removed.

This is a good idea, and seems like it can be a follow-on parallel patch
to the one I provided (which does need an updated changelog now).

Thanks,
Nish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 22:08       ` Tejun Heo
  2015-03-05 22:18         ` Tejun Heo
@ 2015-03-05 23:21         ` Nishanth Aravamudan
  2015-03-05 23:24           ` Tejun Heo
  1 sibling, 1 reply; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 23:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linuxppc-dev, Raghavendra K T, Paul Mackerras, Anton Blanchard,
	David Rientjes

On 05.03.2015 [17:08:04 -0500], Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote:
> > I'm not sure why this is being proposed as a powerpc patch and now a patch 
> > for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
> > for all possible nodes?  We should only be allocating for online nodes in 
> > N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
> > callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
> > transition from memoryless -> memory.  The extra bonus is that 
> > alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
> > TODO in that function can be removed.
> 
> For cpus, the general direction is allocating for all possible cpus.
> For iterations, we alternate between using all possibles and onlines
> depending on the use case; however, the general idea is that the
> possibles and onlines aren't gonna be very different.  NR_CPUS and
> MAX_NUMNODES gotta accomodate the worst possible case the kernel may
> run on but the possible masks should be set to the actually possible
> subset during boot so that the kernel don't end up allocating for and
> iterating over things which can't ever exist.

Makes sense to me.

> It can be argued that we should always stick to the online masks for
> allocation and iteration; however, that usually requires more
> complexity and the only cases where this mattered have been when the
> boot code got it wrong and failed to set the possible masks correctly,
> which also seems to be the case here.  I don't see any reason to
> deviate here.

So, do you agree with the general direction of my change? :)

Thanks,
Nish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 23:21         ` Nishanth Aravamudan
@ 2015-03-05 23:24           ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2015-03-05 23:24 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: linuxppc-dev, Raghavendra K T, Paul Mackerras, Anton Blanchard,
	David Rientjes

On Thu, Mar 05, 2015 at 03:21:35PM -0800, Nishanth Aravamudan wrote:
> So, do you agree with the general direction of my change? :)

Yeah, I mean it's an obvious bug fix.  I don't know when or how it
should be set on powerpc but if the machine can't do NUMA node
hotplug, its node online and possible masks must be equal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 22:13 ` [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Tejun Heo
@ 2015-03-05 23:27   ` Nishanth Aravamudan
  0 siblings, 0 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-05 23:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linuxppc-dev, Raghavendra K T, Paul Mackerras, Anton Blanchard,
	David Rientjes

On 05.03.2015 [17:13:08 -0500], Tejun Heo wrote:
> On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote:
> > While looking at this, I noticed that nr_node_ids is actually a
> > misnomer, it seems. It's not the number, but the maximum_node_id, as
> > with sparse NUMA nodes, you might only have two NUMA nodes possible, but
> > to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
> > changed?
> 
> It's the same for nr_cpu_ids.  It's counting the number of valid IDs
> during that boot instance.  In the above case, whether the nodes are
> sparse or not, there exist 17 node ids - 0 to 16.  Maybe numa_max_id
> had been a better name (but would that equal the highest number or
> +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer
> either.  Doesn't really matter at this point.  Maybe add comments on
> top of both?

Yes, I will consider that. To me, I guess it's more a matter of:

a) How does nr_node_ids relate to the number of possible NUMA node IDs
at runtime?

They are identical.

b) How does nr_node_ids relate to the number of NUMA node IDs in use?

There is no relation.

c) How does nr_node_ids relate to the maximum NUMA node ID in use?

It is one larger than that value.

However, for a), at least, we don't care about that on power, really. We
don't have node hotplug, so the "possible" is the "online" in practice,
for a given system.

Iteration seems to generally not be a problem (since we have sparse
iterators anyways) and we shouldn't be allocating for non-present nodes.

But we run into excessive allocations (I'm looking into a few others
Dipankar has found now) with array allocations based of nr_node_ids or
MAX_NUMNODES when the NUMA topology is sparse..

-Nish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
  2015-03-05 23:15   ` Nishanth Aravamudan
@ 2015-03-05 23:29     ` David Rientjes
  2015-03-06  5:27       ` [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot Nishanth Aravamudan
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2015-03-05 23:29 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Tejun Heo, linuxppc-dev, Raghavendra K T, Paul Mackerras,
	Anton Blanchard

On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:

> So if we compare to x86:
> 
> arch/x86/mm/numa.c::numa_init():
> 
>         nodes_clear(numa_nodes_parsed);
>         nodes_clear(node_possible_map);
>         nodes_clear(node_online_map);
> 	...
> 	numa_register_memblks(...);
> 
> arch/x86/mm/numa.c::numa_register_memblks():
> 
> 	node_possible_map = numa_nodes_parsed;
> 
> Basically, it looks like x86 NUMA init clears out possible map and
> online map, probably for a similar reason to what I gave in the
> changelog that by default, the possible map seems to be based off
> MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
> 
> My patch was an attempt to emulate the same thing on powerpc. You are
> right that there is a window in which the node_possible_map and
> node_online_map are out of sync with my patch. It seems like it
> shouldn't matter given how early in boot we are, but perhaps the
> following would have been clearer:
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..1a118b08fad2 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>  
>         memblock_dump_all();
>  
> +       /*
> +        * Reduce the possible NUMA nodes to the online NUMA nodes,
> +        * since we do not support node hotplug. This ensures that  we
> +        * lower the maximum NUMA node ID to what is actually present.
> +        */
> +       nodes_and(node_possible_map, node_possible_map, node_online_map);

If you don't support node hotplug, then a node should always be possible 
if it's online unless there are other tricks powerpc plays with 
node_possible_map.  Shouldn't this just be 
node_possible_map = node_online_map?

> +
>         for_each_online_node(nid) {
>                 unsigned long start_pfn, end_pfn;
>  
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot
  2015-03-05 23:29     ` David Rientjes
@ 2015-03-06  5:27       ` Nishanth Aravamudan
  2015-03-06 11:29         ` Raghavendra K T
  2015-03-09 23:55         ` Michael Ellerman
  0 siblings, 2 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-06  5:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, linuxppc-dev, Raghavendra K T, Paul Mackerras,
	Anton Blanchard

On 05.03.2015 [15:29:00 -0800], David Rientjes wrote:
> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> 
> > So if we compare to x86:
> > 
> > arch/x86/mm/numa.c::numa_init():
> > 
> >         nodes_clear(numa_nodes_parsed);
> >         nodes_clear(node_possible_map);
> >         nodes_clear(node_online_map);
> > 	...
> > 	numa_register_memblks(...);
> > 
> > arch/x86/mm/numa.c::numa_register_memblks():
> > 
> > 	node_possible_map = numa_nodes_parsed;
> > 
> > Basically, it looks like x86 NUMA init clears out possible map and
> > online map, probably for a similar reason to what I gave in the
> > changelog that by default, the possible map seems to be based off
> > MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
> > 
> > My patch was an attempt to emulate the same thing on powerpc. You are
> > right that there is a window in which the node_possible_map and
> > node_online_map are out of sync with my patch. It seems like it
> > shouldn't matter given how early in boot we are, but perhaps the
> > following would have been clearer:
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..1a118b08fad2 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,6 +958,13 @@ void __init initmem_init(void)
> >  
> >         memblock_dump_all();
> >  
> > +       /*
> > +        * Reduce the possible NUMA nodes to the online NUMA nodes,
> > +        * since we do not support node hotplug. This ensures that  we
> > +        * lower the maximum NUMA node ID to what is actually present.
> > +        */
> > +       nodes_and(node_possible_map, node_possible_map, node_online_map);
> 
> If you don't support node hotplug, then a node should always be possible 
> if it's online unless there are other tricks powerpc plays with 
> node_possible_map.  Shouldn't this just be 
> node_possible_map = node_online_map?

Yeah, but I was too dumb to think of that before sending :)

Updated version follows...

-Nish


Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.
    
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

---
v1 -> v2:
  Rather than clear node_possible_map and set it nid-by-nid, just
  directly assign node_online_map to it, as suggested by Michael
  Ellerman and Tejun Heo.

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..0c1716cd271f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * Reduce the possible NUMA nodes to the online NUMA nodes,
+	 * since we do not support node hotplug. This ensures that  we
+	 * lower the maximum NUMA node ID to what is actually present.
+	 */
+	node_possible_map = node_online_map;
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot
  2015-03-06  5:27       ` [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot Nishanth Aravamudan
@ 2015-03-06 11:29         ` Raghavendra K T
  2015-03-09 23:55         ` Michael Ellerman
  1 sibling, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2015-03-06 11:29 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Tejun Heo, linuxppc-dev, Paul Mackerras, Anton Blanchard, David Rientjes

On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote:
> On 05.03.2015 [15:29:00 -0800], David Rientjes wrote:
>> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
>>
>>> So if we compare to x86:
>>>
>>> arch/x86/mm/numa.c::numa_init():
>>>
>>>          nodes_clear(numa_nodes_parsed);
>>>          nodes_clear(node_possible_map);
>>>          nodes_clear(node_online_map);
>>> 	...
>>> 	numa_register_memblks(...);
>>>
>>> arch/x86/mm/numa.c::numa_register_memblks():
>>>
>>> 	node_possible_map = numa_nodes_parsed;
>>>
>>> Basically, it looks like x86 NUMA init clears out possible map and
>>> online map, probably for a similar reason to what I gave in the
>>> changelog that by default, the possible map seems to be based off
>>> MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
>>>
>>> My patch was an attempt to emulate the same thing on powerpc. You are
>>> right that there is a window in which the node_possible_map and
>>> node_online_map are out of sync with my patch. It seems like it
>>> shouldn't matter given how early in boot we are, but perhaps the
>>> following would have been clearer:
>>>
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index 0257a7d659ef..1a118b08fad2 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>>>
>>>          memblock_dump_all();
>>>
>>> +       /*
>>> +        * Reduce the possible NUMA nodes to the online NUMA nodes,
>>> +        * since we do not support node hotplug. This ensures that  we
>>> +        * lower the maximum NUMA node ID to what is actually present.
>>> +        */
>>> +       nodes_and(node_possible_map, node_possible_map, node_online_map);
>>
>> If you don't support node hotplug, then a node should always be possible
>> if it's online unless there are other tricks powerpc plays with
>> node_possible_map.  Shouldn't this just be
>> node_possible_map = node_online_map?
>
> Yeah, but I was too dumb to think of that before sending :)
>
> Updated version follows...
>
> -Nish
>
---8<---
>
> Raghu noticed an issue with excessive memory allocation on power with a
> simple cgroup test, specifically, in mem_cgroup_css_alloc ->
> for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
> up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
> directories).
should we also add after this patch it has reduced to around 2MB?
>
> The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
> possible), which defines node_possible_map, which in turn defines the
> value of nr_node_ids in setup_nr_node_ids and the iteration of
> for_each_node.
>
> In practice, we never see a system with 256 NUMA nodes, and in fact, we
> do not support node hotplug on power in the first place, so the nodes
> that are online when we come up are the nodes that will be present for
> the lifetime of this kernel. So let's, at least, drop the NUMA possible
> map down to the online map at runtime. This is similar to what x86 does
> in its initialization routines.
>
> mem_cgroup_css_alloc should also be fixed to only iterate over
> memory-populated nodes and handle hotplug, but that is a separate
> change.
>
Maybe we could fomally add
Reported-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> To: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Anton Blanchard <anton@samba.org>
> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> ---
> v1 -> v2:
>    Rather than clear node_possible_map and set it nid-by-nid, just
>    directly assign node_online_map to it, as suggested by Michael
>    Ellerman and Tejun Heo.
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..0c1716cd271f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>
>   	memblock_dump_all();
>
> +	/*
> +	 * Reduce the possible NUMA nodes to the online NUMA nodes,
> +	 * since we do not support node hotplug. This ensures that  we
> +	 * lower the maximum NUMA node ID to what is actually present.
> +	 */

  Hope we remember this change when we add hotplug :)

> +	node_possible_map = node_online_map;
> +
>   	for_each_online_node(nid) {
>   		unsigned long start_pfn, end_pfn;
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot
  2015-03-06  5:27       ` [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot Nishanth Aravamudan
  2015-03-06 11:29         ` Raghavendra K T
@ 2015-03-09 23:55         ` Michael Ellerman
  2015-03-10 23:50           ` [PATCH v3] " Nishanth Aravamudan
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Ellerman @ 2015-03-09 23:55 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, David Rientjes,
	Tejun Heo, linuxppc-dev

On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote:
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..0c1716cd271f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>  
>  	memblock_dump_all();
>  
> +	/*
> +	 * Reduce the possible NUMA nodes to the online NUMA nodes,
> +	 * since we do not support node hotplug. This ensures that  we
> +	 * lower the maximum NUMA node ID to what is actually present.
> +	 */
> +	node_possible_map = node_online_map;

That looks nice, but is it generating what we want?

ie. is the content of node_online_map being *copied* into node_possible_map.

Or are we changing node_possible_map to point at node_online_map?

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] powerpc/numa: set node_possible_map to only node_online_map during boot
  2015-03-09 23:55         ` Michael Ellerman
@ 2015-03-10 23:50           ` Nishanth Aravamudan
  0 siblings, 0 replies; 18+ messages in thread
From: Nishanth Aravamudan @ 2015-03-10 23:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Raghavendra K T, Paul Mackerras, Anton Blanchard, David Rientjes,
	Tejun Heo, linuxppc-dev

On 10.03.2015 [10:55:05 +1100], Michael Ellerman wrote:
> On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote:
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..0c1716cd271f 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,6 +958,13 @@ void __init initmem_init(void)
> >  
> >  	memblock_dump_all();
> >  
> > +	/*
> > +	 * Reduce the possible NUMA nodes to the online NUMA nodes,
> > +	 * since we do not support node hotplug. This ensures that  we
> > +	 * lower the maximum NUMA node ID to what is actually present.
> > +	 */
> > +	node_possible_map = node_online_map;
> 
> That looks nice, but is it generating what we want?
> 
> ie. is the content of node_online_map being *copied* into node_possible_map.
> 
> Or are we changing node_possible_map to point at node_online_map?

I think it ends up being the latter, which is probably fine in practice
(I think node_online_map is static on power after boot), but perhaps it
would be better to do:

nodes_and(node_possible_map, node_possible_map, node_online_map);

?

e.g.:


powerpc/numa: reset node_possible_map to only node_online_map

Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
    
---
v1 -> v2:
  Rather than clear node_possible_map and set it nid-by-nid, just
  directly assign node_online_map to it, as suggested by Michael
  Ellerman and Tejun Heo.

v2 -> v3:
  Rather than direct assignment (which is just repointing the pointer),
  modify node_possible_map in-place.

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..1a118b08fad2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * Reduce the possible NUMA nodes to the online NUMA nodes,
+	 * since we do not support node hotplug. This ensures that  we
+	 * lower the maximum NUMA node ID to what is actually present.
+	 */
+	nodes_and(node_possible_map, node_possible_map, node_online_map);
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-03-10 23:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-05 18:05 [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Nishanth Aravamudan
2015-03-05 21:16 ` David Rientjes
2015-03-05 21:48   ` Michael Ellerman
2015-03-05 21:58     ` David Rientjes
2015-03-05 22:08       ` Tejun Heo
2015-03-05 22:18         ` Tejun Heo
2015-03-05 23:21         ` Nishanth Aravamudan
2015-03-05 23:24           ` Tejun Heo
2015-03-05 23:20       ` Nishanth Aravamudan
2015-03-05 23:17     ` Nishanth Aravamudan
2015-03-05 23:15   ` Nishanth Aravamudan
2015-03-05 23:29     ` David Rientjes
2015-03-06  5:27       ` [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot Nishanth Aravamudan
2015-03-06 11:29         ` Raghavendra K T
2015-03-09 23:55         ` Michael Ellerman
2015-03-10 23:50           ` [PATCH v3] " Nishanth Aravamudan
2015-03-05 22:13 ` [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map Tejun Heo
2015-03-05 23:27   ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).