linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
@ 2015-07-02 23:02 Nishanth Aravamudan
  2015-07-02 23:03 ` [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table Nishanth Aravamudan
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-02 23:02 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: David Rientjes, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
have an ordering issue during boot with early calls to cpu_to_node().
The value returned by those calls now depend on the per-cpu area being
setup, but that is not guaranteed to be the case during boot. Instead,
we need to add an early_cpu_to_node() which doesn't use the per-CPU area
and call that from certain spots that are known to invoke cpu_to_node()
before the per-CPU areas are not configured.

On an example 2-node NUMA system with the following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2029 MB
node 0 free: 1753 MB
node 1 cpus: 4 5 6 7
node 1 size: 2045 MB
node 1 free: 1945 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

we currently emit at boot:

[    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 

After this commit, we correctly emit:

[    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 5f1048e..f2c4c89 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
+extern int early_cpu_to_node(int);
+
 extern void __init dump_numa_cpu_topology(void);
 
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c69671c..23a2cf3 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
 
 static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
 {
-	return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
-				    __pa(MAX_DMA_ADDRESS));
+	return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
+				    align, __pa(MAX_DMA_ADDRESS));
 }
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
@@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-	if (cpu_to_node(from) == cpu_to_node(to))
+	if (early_cpu_to_node(from) == early_cpu_to_node(to))
 		return LOCAL_DISTANCE;
 	else
 		return REMOTE_DISTANCE;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e80621..9ffabf4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
 		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
 }
 
+int early_cpu_to_node(int cpu)
+{
+	return numa_cpu_lookup_table[cpu];
+}
+
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table
  2015-07-02 23:02 [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot Nishanth Aravamudan
@ 2015-07-02 23:03 ` Nishanth Aravamudan
  2015-07-09  1:25   ` David Rientjes
  2015-07-08  4:00 ` [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot Michael Ellerman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-02 23:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: David Rientjes, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

A simple move to a wrapper function to numa_cpu_lookup_table, now that
power has the early_cpu_to_node() API.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..7bf333b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 		 * numa_node_id() works after this.
 		 */
 		if (cpu_present(cpu)) {
-			set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+			set_cpu_numa_node(cpu, early_cpu_to_node(cpu));
 			set_cpu_numa_mem(cpu,
-				local_memory_node(numa_cpu_lookup_table[cpu]));
+				local_memory_node(early_cpu_to_node(cpu)));
 		}
 	}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
 	paca[boot_cpuid].__current = current;
 #endif
-	set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+	set_numa_node(early_cpu_to_node(boot_cpuid));
 	current_set[boot_cpuid] = task_thread_info(current);
 }
 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-02 23:02 [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot Nishanth Aravamudan
  2015-07-02 23:03 ` [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table Nishanth Aravamudan
@ 2015-07-08  4:00 ` Michael Ellerman
  2015-07-08 23:16   ` Nishanth Aravamudan
  2015-07-09  1:22 ` [RFC PATCH 1/2] " David Rientjes
  2015-07-15 20:35 ` Tejun Heo
  3 siblings, 1 reply; 15+ messages in thread
From: Michael Ellerman @ 2015-07-08  4:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev

On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> have an ordering issue during boot with early calls to cpu_to_node().

"now that .." implies we changed something and broke this. What commit was
it that changed the behaviour?

> The value returned by those calls now depend on the per-cpu area being
> setup, but that is not guaranteed to be the case during boot. Instead,
> we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> and call that from certain spots that are known to invoke cpu_to_node()
> before the per-CPU areas are not configured.
> 
> On an example 2-node NUMA system with the following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 2029 MB
> node 0 free: 1753 MB
> node 1 cpus: 4 5 6 7
> node 1 size: 2045 MB
> node 1 free: 1945 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> we currently emit at boot:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> 
> After this commit, we correctly emit:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 


So it looks fairly sane, and I guess it's a bug fix.

But I'm a bit reluctant to put it in straight away without some time in next.

It looks like the symptom is that the per-cpu areas are all allocated on node
0, is that all that goes wrong?

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-08  4:00 ` [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot Michael Ellerman
@ 2015-07-08 23:16   ` Nishanth Aravamudan
  2015-07-09  1:24     ` David Rientjes
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-08 23:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev

On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > have an ordering issue during boot with early calls to cpu_to_node().
> 
> "now that .." implies we changed something and broke this. What commit was
> it that changed the behaviour?

Well, that's something I'm trying to still unearth. In the commits
before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

At least prior to 8c272261194d, this might have been due to the old
powerpc-specific cpu_to_node():

static inline int cpu_to_node(int cpu)
{
       int nid;

       nid = numa_cpu_lookup_table[cpu];

       /*
        * During early boot, the numa-cpu lookup table might not have
        been
        * setup for all CPUs yet. In such cases, default to node 0.
        */
       return (nid < 0) ? 0 : nid;
}

which might imply that no one cares or that simply no one noticed.

> > The value returned by those calls now depend on the per-cpu area being
> > setup, but that is not guaranteed to be the case during boot. Instead,
> > we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> > and call that from certain spots that are known to invoke cpu_to_node()
> > before the per-CPU areas are not configured.
> > 
> > On an example 2-node NUMA system with the following topology:
> > 
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 2029 MB
> > node 0 free: 1753 MB
> > node 1 cpus: 4 5 6 7
> > node 1 size: 2045 MB
> > node 1 free: 1945 MB
> > node distances:
> > node   0   1 
> >   0:  10  40 
> >   1:  40  10 
> > 
> > we currently emit at boot:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> > 
> > After this commit, we correctly emit:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
> 
> 
> So it looks fairly sane, and I guess it's a bug fix.
> 
> But I'm a bit reluctant to put it in straight away without some time in next.

I'm fine with that -- it could use some more extensive testing,
admittedly (I only have been able to verify the pcpu areas are being
correctly allocated on the right node so far).

I still need to test with hotplug and things like that. Hence the RFC.

> It looks like the symptom is that the per-cpu areas are all allocated on node
> 0, is that all that goes wrong?

Yes, that's the symptom. I cc'd a few folks to see if they could help
indicate the performance implications of such a setup -- sorry, I should
have been more explicit about that.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-02 23:02 [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot Nishanth Aravamudan
  2015-07-02 23:03 ` [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table Nishanth Aravamudan
  2015-07-08  4:00 ` [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot Michael Ellerman
@ 2015-07-09  1:22 ` David Rientjes
  2015-07-10 16:25   ` Nishanth Aravamudan
  2015-07-15 20:35 ` Tejun Heo
  3 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2015-07-09  1:22 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

> Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> have an ordering issue during boot with early calls to cpu_to_node().
> The value returned by those calls now depend on the per-cpu area being
> setup, but that is not guaranteed to be the case during boot. Instead,
> we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> and call that from certain spots that are known to invoke cpu_to_node()
> before the per-CPU areas are not configured.
> 
> On an example 2-node NUMA system with the following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 2029 MB
> node 0 free: 1753 MB
> node 1 cpus: 4 5 6 7
> node 1 size: 2045 MB
> node 1 free: 1945 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> we currently emit at boot:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> 
> After this commit, we correctly emit:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
> 
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 5f1048e..f2c4c89 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>  extern int __node_distance(int, int);
>  #define node_distance(a, b) __node_distance(a, b)
>  
> +extern int early_cpu_to_node(int);
> +
>  extern void __init dump_numa_cpu_topology(void);
>  
>  extern int sysfs_add_device_to_node(struct device *dev, int nid);
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index c69671c..23a2cf3 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
>  {
> -	return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
> -				    __pa(MAX_DMA_ADDRESS));
> +	return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
> +				    align, __pa(MAX_DMA_ADDRESS));
>  }
>  
>  static void __init pcpu_fc_free(void *ptr, size_t size)
> @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>  
>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>  {
> -	if (cpu_to_node(from) == cpu_to_node(to))
> +	if (early_cpu_to_node(from) == early_cpu_to_node(to))
>  		return LOCAL_DISTANCE;
>  	else
>  		return REMOTE_DISTANCE;
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 5e80621..9ffabf4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
>  		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
>  }
>  
> +int early_cpu_to_node(int cpu)
> +{
> +	return numa_cpu_lookup_table[cpu];
> +}
> +
>  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
>  static void unmap_cpu_from_node(unsigned long cpu)
>  {
> 
> 

early_cpu_to_node() looks like it's begging to be __init since we 
shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
that appears like it can be done if pcpu_cpu_distance() is made __init in 
this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
So I think this is fine, but those functions and things like 
reset_numa_cpu_lookup_table() should be in init.text.

After the percpu areas on initialized and cpu_to_node() is correct, it 
would be really nice to be able to make numa_cpu_lookup_table[] be 
__initdata since it shouldn't be necessary anymore.  That probably has cpu 
callbacks that need to be modified to no longer look at 
numa_cpu_lookup_table[] or pass the value in, but it would make it much 
cleaner.  Then nobody will have to worry about figuring out whether 
early_cpu_to_node() or cpu_to_node() is the right one to call.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-08 23:16   ` Nishanth Aravamudan
@ 2015-07-09  1:24     ` David Rientjes
  2015-07-10 16:15     ` Nishanth Aravamudan
  2015-07-15  0:22     ` Michael Ellerman
  2 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2015-07-09  1:24 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, linuxppc-dev

On Wed, 8 Jul 2015, Nishanth Aravamudan wrote:

> > It looks like the symptom is that the per-cpu areas are all allocated on node
> > 0, is that all that goes wrong?
> 
> Yes, that's the symptom. I cc'd a few folks to see if they could help
> indicate the performance implications of such a setup -- sorry, I should
> have been more explicit about that.
> 

Yeah, not sure it's really a bugfix but rather a performance optimization 
since cpu_to_node() with CONFIG_USE_PERCPU_NUMA_NODE_ID is only about 
performance.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table
  2015-07-02 23:03 ` [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table Nishanth Aravamudan
@ 2015-07-09  1:25   ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2015-07-09  1:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

> A simple move to a wrapper function to numa_cpu_lookup_table, now that
> power has the early_cpu_to_node() API.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

When early_cpu_to_node() is __init:

	Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-08 23:16   ` Nishanth Aravamudan
  2015-07-09  1:24     ` David Rientjes
@ 2015-07-10 16:15     ` Nishanth Aravamudan
  2015-07-15 20:37       ` Tejun Heo
  2015-07-15  0:22     ` Michael Ellerman
  2 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-10 16:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, tj

On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote:
> On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > > have an ordering issue during boot with early calls to cpu_to_node().
> > 
> > "now that .." implies we changed something and broke this. What commit was
> > it that changed the behaviour?
> 
> Well, that's something I'm trying to still unearth. In the commits
> before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
> "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:
> 
> pcpu-alloc: [0] 0 1 2 3 4 5 6 7

Ok, I did a bisection, and it seems like prior to commit
1a4d76076cda69b0abf15463a8cebc172406da25 ("percpu: implement
asynchronous chunk population"), we emitted the above, e.g.:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

And after that commit, we emitted:

pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7

I'm not exactly sure why that changed, but I'm still
reading/understanding the commit. Tejun might be able to explain.

Tejun, for reference, I noticed on Power systems since the
above-mentioned commit, pcpu-alloc is not reflecting the topology of the
system correctly -- that is, the pcpu areas are all on node 0
unconditionally (based up on pcpu-alloc's output). Prior to that, there
was just one group, it seems like, which completely ignored the NUMA
topology.

Is this just an ordering thing that changed with the introduction of the
async code?

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-09  1:22 ` [RFC PATCH 1/2] " David Rientjes
@ 2015-07-10 16:25   ` Nishanth Aravamudan
  2015-07-14 21:31     ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-10 16:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

On 08.07.2015 [18:22:09 -0700], David Rientjes wrote:
> On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:
> 
> > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > have an ordering issue during boot with early calls to cpu_to_node().
> > The value returned by those calls now depend on the per-cpu area being
> > setup, but that is not guaranteed to be the case during boot. Instead,
> > we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> > and call that from certain spots that are known to invoke cpu_to_node()
> > before the per-CPU areas are not configured.
> > 
> > On an example 2-node NUMA system with the following topology:
> > 
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 2029 MB
> > node 0 free: 1753 MB
> > node 1 cpus: 4 5 6 7
> > node 1 size: 2045 MB
> > node 1 free: 1945 MB
> > node distances:
> > node   0   1 
> >   0:  10  40 
> >   1:  40  10 
> > 
> > we currently emit at boot:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> > 
> > After this commit, we correctly emit:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > 
> > diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> > index 5f1048e..f2c4c89 100644
> > --- a/arch/powerpc/include/asm/topology.h
> > +++ b/arch/powerpc/include/asm/topology.h
> > @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
> >  extern int __node_distance(int, int);
> >  #define node_distance(a, b) __node_distance(a, b)
> >  
> > +extern int early_cpu_to_node(int);
> > +
> >  extern void __init dump_numa_cpu_topology(void);
> >  
> >  extern int sysfs_add_device_to_node(struct device *dev, int nid);
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index c69671c..23a2cf3 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
> >  
> >  static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
> >  {
> > -	return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
> > -				    __pa(MAX_DMA_ADDRESS));
> > +	return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
> > +				    align, __pa(MAX_DMA_ADDRESS));
> >  }
> >  
> >  static void __init pcpu_fc_free(void *ptr, size_t size)
> > @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
> >  
> >  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
> >  {
> > -	if (cpu_to_node(from) == cpu_to_node(to))
> > +	if (early_cpu_to_node(from) == early_cpu_to_node(to))
> >  		return LOCAL_DISTANCE;
> >  	else
> >  		return REMOTE_DISTANCE;
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 5e80621..9ffabf4 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
> >  		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> >  }
> >  
> > +int early_cpu_to_node(int cpu)
> > +{
> > +	return numa_cpu_lookup_table[cpu];
> > +}
> > +
> >  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> >  static void unmap_cpu_from_node(unsigned long cpu)
> >  {
> > 
> > 
> 
> early_cpu_to_node() looks like it's begging to be __init since we 
> shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
> that appears like it can be done if pcpu_cpu_distance() is made __init in 
> this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
> So I think this is fine, but those functions and things like 
> reset_numa_cpu_lookup_table() should be in init.text.

Yep, that makes total sense!

> After the percpu areas on initialized and cpu_to_node() is correct, it 
> would be really nice to be able to make numa_cpu_lookup_table[] be 
> __initdata since it shouldn't be necessary anymore.  That probably has cpu 
> callbacks that need to be modified to no longer look at 
> numa_cpu_lookup_table[] or pass the value in, but it would make it much 
> cleaner.  Then nobody will have to worry about figuring out whether 
> early_cpu_to_node() or cpu_to_node() is the right one to call.

When I worked on the original pcpu patches for power, I wanted to do
this, but got myself confused and never came back to it. Thank you for
suggesting it and I'll work on it soon.

-Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-10 16:25   ` Nishanth Aravamudan
@ 2015-07-14 21:31     ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2015-07-14 21:31 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Anton Blanchard, Peter Zijlstra, linuxppc-dev, linux-kernel

On Fri, 10 Jul 2015, Nishanth Aravamudan wrote:

> > After the percpu areas on initialized and cpu_to_node() is correct, it 
> > would be really nice to be able to make numa_cpu_lookup_table[] be 
> > __initdata since it shouldn't be necessary anymore.  That probably has cpu 
> > callbacks that need to be modified to no longer look at 
> > numa_cpu_lookup_table[] or pass the value in, but it would make it much 
> > cleaner.  Then nobody will have to worry about figuring out whether 
> > early_cpu_to_node() or cpu_to_node() is the right one to call.
> 
> When I worked on the original pcpu patches for power, I wanted to do
> this, but got myself confused and never came back to it. Thank you for
> suggesting it and I'll work on it soon.
> 

Great, thanks for taking it on!  I have powerpc machines so I can test 
this and try to help where possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-08 23:16   ` Nishanth Aravamudan
  2015-07-09  1:24     ` David Rientjes
  2015-07-10 16:15     ` Nishanth Aravamudan
@ 2015-07-15  0:22     ` Michael Ellerman
  2 siblings, 0 replies; 15+ messages in thread
From: Michael Ellerman @ 2015-07-15  0:22 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev

On Wed, 2015-07-08 at 16:16 -0700, Nishanth Aravamudan wrote:
> On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > > 
> > > we currently emit at boot:
> > > 
> > > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> > > 
> > > After this commit, we correctly emit:
> > > 
> > > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
> > 
> > 
> > So it looks fairly sane, and I guess it's a bug fix.
> > 
> > But I'm a bit reluctant to put it in straight away without some time in next.
> 
> I'm fine with that -- it could use some more extensive testing,
> admittedly (I only have been able to verify the pcpu areas are being
> correctly allocated on the right node so far).
> 
> I still need to test with hotplug and things like that. Hence the RFC.
> 
> > It looks like the symptom is that the per-cpu areas are all allocated on node
> > 0, is that all that goes wrong?
> 
> Yes, that's the symptom. I cc'd a few folks to see if they could help
> indicate the performance implications of such a setup -- sorry, I should
> have been more explicit about that.

OK cool. I'm happy to put it in next if you send a non-RFC version.

cheers



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-02 23:02 [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot Nishanth Aravamudan
                   ` (2 preceding siblings ...)
  2015-07-09  1:22 ` [RFC PATCH 1/2] " David Rientjes
@ 2015-07-15 20:35 ` Tejun Heo
  2015-07-15 22:43   ` Nishanth Aravamudan
  3 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2015-07-15 20:35 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, David Rientjes, Benjamin Herrenschmidt,
	Paul Mackerras, Anton Blanchard, Peter Zijlstra, linuxppc-dev,
	linux-kernel

Hello,

On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
> we currently emit at boot:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> 
> After this commit, we correctly emit:
> 
> [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 

JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
allocation group numbers and they're not split according to nodes but
percpu allocation units.  In both cases, there are two units each
serving 0-3 and 4-7.  In the above case, because it wasn't being fed
the correct NUMA information, both got assigned to the same group.  In
the latter, they got assigned to different ones but even then if the
group numbers match NUMA node numbers, that's just a coincidence.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-10 16:15     ` Nishanth Aravamudan
@ 2015-07-15 20:37       ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2015-07-15 20:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, David Rientjes, linuxppc-dev

Hello,

On Fri, Jul 10, 2015 at 09:15:47AM -0700, Nishanth Aravamudan wrote:
> On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote:
> > On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > > > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > > > have an ordering issue during boot with early calls to cpu_to_node().
> > > 
> > > "now that .." implies we changed something and broke this. What commit was
> > > it that changed the behaviour?
> > 
> > Well, that's something I'm trying to still unearth. In the commits
> > before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
> > "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:
> > 
> > pcpu-alloc: [0] 0 1 2 3 4 5 6 7
> 
> Ok, I did a bisection, and it seems like prior to commit
> 1a4d76076cda69b0abf15463a8cebc172406da25 ("percpu: implement
> asynchronous chunk population"), we emitted the above, e.g.:
> 
> pcpu-alloc: [0] 0 1 2 3 4 5 6 7
> 
> And after that commit, we emitted:
> 
> pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
> 
> I'm not exactly sure why that changed, but I'm still
> reading/understanding the commit. Tejun might be able to explain.
> 
> Tejun, for reference, I noticed on Power systems since the
> above-mentioned commit, pcpu-alloc is not reflecting the topology of the
> system correctly -- that is, the pcpu areas are all on node 0
> unconditionally (based up on pcpu-alloc's output). Prior to that, there
> was just one group, it seems like, which completely ignored the NUMA
> topology.
> 
> Is this just an ordering thing that changed with the introduction of the
> async code?

It's just each unit growing and percpu allocator deciding to split
them into separate allocation units.  Before it was serving all cpus
in a single alloc unit as they looked like they belong to the same
NUMA node and small enough to fit into one alloc unit.  In the latter,
the async one added more reserve space, so the allocator is deciding
to split them into two alloc units while assigning them to the same
group as the NUMA info wasn't still there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-15 20:35 ` Tejun Heo
@ 2015-07-15 22:43   ` Nishanth Aravamudan
  2015-07-15 22:47     ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-07-15 22:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michael Ellerman, David Rientjes, Benjamin Herrenschmidt,
	Paul Mackerras, Anton Blanchard, Peter Zijlstra, linuxppc-dev,
	linux-kernel

On 15.07.2015 [16:35:16 -0400], Tejun Heo wrote:
> Hello,
> 
> On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
> > we currently emit at boot:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
> > 
> > After this commit, we correctly emit:
> > 
> > [    0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
> 
> JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
> allocation group numbers and they're not split according to nodes but
> percpu allocation units.  In both cases, there are two units each
> serving 0-3 and 4-7.  In the above case, because it wasn't being fed
> the correct NUMA information, both got assigned to the same group.  In
> the latter, they got assigned to different ones but even then if the
> group numbers match NUMA node numbers, that's just a coincidence.

Ok, thank you for clarifying! From a correctness perspective, even if
the numbers don't match NUMA nodes, should we expect the grouping to be
split along NUMA topology?

-Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
  2015-07-15 22:43   ` Nishanth Aravamudan
@ 2015-07-15 22:47     ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2015-07-15 22:47 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, David Rientjes, Benjamin Herrenschmidt,
	Paul Mackerras, Anton Blanchard, Peter Zijlstra, linuxppc-dev,
	linux-kernel

Hello,

On Wed, Jul 15, 2015 at 03:43:51PM -0700, Nishanth Aravamudan wrote:
> Ok, thank you for clarifying! From a correctness perspective, even if
> the numbers don't match NUMA nodes, should we expect the grouping to be
> split along NUMA topology?

Yeap, the groups get formed according to the node distances.  Nodes
which are not at LOCAL_DISTANCE are always put in different groups.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-07-15 22:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-02 23:02 [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot Nishanth Aravamudan
2015-07-02 23:03 ` [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table Nishanth Aravamudan
2015-07-09  1:25   ` David Rientjes
2015-07-08  4:00 ` [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot Michael Ellerman
2015-07-08 23:16   ` Nishanth Aravamudan
2015-07-09  1:24     ` David Rientjes
2015-07-10 16:15     ` Nishanth Aravamudan
2015-07-15 20:37       ` Tejun Heo
2015-07-15  0:22     ` Michael Ellerman
2015-07-09  1:22 ` [RFC PATCH 1/2] " David Rientjes
2015-07-10 16:25   ` Nishanth Aravamudan
2015-07-14 21:31     ` David Rientjes
2015-07-15 20:35 ` Tejun Heo
2015-07-15 22:43   ` Nishanth Aravamudan
2015-07-15 22:47     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).