linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [FIX PATCH 0/2] Fix NUMA nodes fallback list ordering
@ 2021-08-30 12:16 Bharata B Rao
  2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
  0 siblings, 2 replies; 13+ messages in thread
From: Bharata B Rao @ 2021-08-30 12:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan, Bharata B Rao

For a NUMA system that has multiple nodes at same distance from
other nodes, the fallback list generation prefers same node order
for them instead of round-robin thereby penalizing one node over
others. This series fixes it.

More description of the problem and the fix is present in the
patch description.

Bharata B Rao (1):
  mm/page_alloc: Print node fallback order

Krupa Ramakrishnan (1):
  mm/page_alloc: Use accumulated load when building node fallback list

 mm/page_alloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FIX PATCH 1/2] mm/page_alloc: Print node fallback order
  2021-08-30 12:16 [FIX PATCH 0/2] Fix NUMA nodes fallback list ordering Bharata B Rao
@ 2021-08-30 12:16 ` Bharata B Rao
  2021-08-30 12:26   ` Mel Gorman
                     ` (2 more replies)
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
  1 sibling, 3 replies; 13+ messages in thread
From: Bharata B Rao @ 2021-08-30 12:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan, Bharata B Rao

Print information message about the allocation fallback order
for each NUMA node during boot.

No functional changes here. This makes it easier to illustrate
the problem in the node fallback list generation, which the
next patch fixes.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/page_alloc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9cb36bb..22f7ad6ec11c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6277,6 +6277,10 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
+	pr_info("Fallback order for Node %d: ", local_node);
+	for (node = 0; node < nr_nodes; node++)
+		pr_cont("%d ", node_order[node]);
+	pr_cont("\n");
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-30 12:16 [FIX PATCH 0/2] Fix NUMA nodes fallback list ordering Bharata B Rao
  2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
@ 2021-08-30 12:16 ` Bharata B Rao
  2021-08-30 12:29   ` Mel Gorman
                     ` (3 more replies)
  1 sibling, 4 replies; 13+ messages in thread
From: Bharata B Rao @ 2021-08-30 12:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan, Krupa Ramakrishnan,
	Bharata B Rao

From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>

In build_zonelists(), when the fallback list is built for the nodes,
the node load gets reinitialized during each iteration. This results
in nodes with same distances occupying the same slot in different
node fallback lists rather than appearing in the intended round-
robin manner. This results in one node getting picked for allocation
more compared to other nodes with the same distance.

As an example, consider a 4 node system with the following distance
matrix.

Node 0  1  2  3
----------------
0    10 12 32 32
1    12 10 32 32
2    32 32 10 12
3    32 32 12 10

For this case, the node fallback list gets built like this:

Node	Fallback list
---------------------
0	0 1 2 3
1	1 0 3 2
2	2 3 0 1
3	3 2 0 1 <-- Unexpected fallback order

In the fallback list for nodes 2 and 3, the nodes 0 and 1
appear in the same order which results in more allocations
getting satisfied from node 0 compared to node 1.

The effect of this on remote memory bandwidth as seen by stream
benchmark is shown below:

Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

----------------------------------------
		BANDWIDTH (MB/s)
    TEST	Case 1		Case 2
----------------------------------------
    COPY	57479.6		110791.8
   SCALE	55372.9		105685.9
     ADD	50460.6		96734.2
  TRIADD	50397.6		97119.1
----------------------------------------

The bandwidth drop in Case 1 occurs because most of the allocations
get satisfied by node 0 as it appears first in the fallback order
for both nodes 2 and 3.

This can be fixed by accumulating the node load in build_zonelists()
rather than reinitializing it during each iteration. With this the
nodes with the same distance rightly get assigned in the round robin
manner. In fact this was how it was originally until the
commit f0c0b2b808f2 ("change zonelist order: zonelist order selection
logic") dropped the load accumulation and resorted to initializing
the load during each iteration. While zonelist ordering was removed by
commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"),
the change to the node load accumulation in build_zonelists() remained.
So essentially this patch reverts back to the accumulated node load
logic.

After this fix, the fallback order gets built like this:

Node Fallback list
------------------
0    0 1 2 3
1    1 0 3 2
2    2 3 0 1
3    3 2 1 0 <-- Note the change here

The bandwidth in Case 1 improves and matches Case 2 as shown below.

----------------------------------------
		BANDWIDTH (MB/s)
    TEST	Case 1		Case 2
----------------------------------------
    COPY	110438.9	110107.2
   SCALE	105930.5	105817.5
     ADD	97005.1		96159.8
  TRIADD	97441.5		96757.1
----------------------------------------

The correctness of the fallback list generation has been verified
for the above node configuration where the node 3 starts as
memory-less node and comes up online only during memory hotplug.

[bharata@amd.com: Added changelog, review, test validation]

Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection
logic")
Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22f7ad6ec11c..47f4d160971e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6268,7 +6268,7 @@ static void build_zonelists(pg_data_t *pgdat)
 		 */
 		if (node_distance(local_node, node) !=
 		    node_distance(local_node, prev_node))
-			node_load[node] = load;
+			node_load[node] += load;
 
 		node_order[nr_nodes++] = node;
 		prev_node = node;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 1/2] mm/page_alloc: Print node fallback order
  2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
@ 2021-08-30 12:26   ` Mel Gorman
  2021-09-03  4:15   ` Anshuman Khandual
  2021-09-03  4:31   ` Anshuman Khandual
  2 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2021-08-30 12:26 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, akpm, kamezawa.hiroyu, lee.schermerhorn,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan

On Mon, Aug 30, 2021 at 05:46:02PM +0530, Bharata B Rao wrote:
> Print information message about the allocation fallback order
> for each NUMA node during boot.
> 
> No functional changes here. This makes it easier to illustrate
> the problem in the node fallback list generation, which the
> next patch fixes.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>

While the message is just informational, I think it's possible that the
bug could have been found earlier if this message was present so...

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
@ 2021-08-30 12:29   ` Mel Gorman
  2021-08-31  9:58   ` Anshuman Khandual
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2021-08-30 12:29 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, akpm, kamezawa.hiroyu, lee.schermerhorn,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan

On Mon, Aug 30, 2021 at 05:46:03PM +0530, Bharata B Rao wrote:
> From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
> 
> In build_zonelists(), when the fallback list is built for the nodes,
> the node load gets reinitialized during each iteration. This results
> in nodes with same distances occupying the same slot in different
> node fallback lists rather than appearing in the intended round-
> robin manner. This results in one node getting picked for allocation
> more compared to other nodes with the same distance.
> 
> As an example, consider a 4 node system with the following distance
> matrix.
> 
> Node 0  1  2  3
> ----------------
> 0    10 12 32 32
> 1    12 10 32 32
> 2    32 32 10 12
> 3    32 32 12 10
> 
> For this case, the node fallback list gets built like this:
> 
> Node	Fallback list
> ---------------------
> 0	0 1 2 3
> 1	1 0 3 2
> 2	2 3 0 1
> 3	3 2 0 1 <-- Unexpected fallback order
> 
> In the fallback list for nodes 2 and 3, the nodes 0 and 1
> appear in the same order which results in more allocations
> getting satisfied from node 0 compared to node 1.
> 
> The effect of this on remote memory bandwidth as seen by stream
> benchmark is shown below:
> 
> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
> 	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
> 	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
> 
> ----------------------------------------
> 		BANDWIDTH (MB/s)
>     TEST	Case 1		Case 2
> ----------------------------------------
>     COPY	57479.6		110791.8
>    SCALE	55372.9		105685.9
>      ADD	50460.6		96734.2
>   TRIADD	50397.6		97119.1
> ----------------------------------------
> 
> The bandwidth drop in Case 1 occurs because most of the allocations
> get satisfied by node 0 as it appears first in the fallback order
> for both nodes 2 and 3.
> 
> This can be fixed by accumulating the node load in build_zonelists()
> rather than reinitializing it during each iteration. With this the
> nodes with the same distance rightly get assigned in the round robin
> manner. In fact this was how it was originally until the
> commit f0c0b2b808f2 ("change zonelist order: zonelist order selection
> logic") dropped the load accumulation and resorted to initializing
> the load during each iteration. While zonelist ordering was removed by
> commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"),
> the change to the node load accumulation in build_zonelists() remained.
> So essentially this patch reverts back to the accumulated node load
> logic.
> 
> After this fix, the fallback order gets built like this:
> 
> Node Fallback list
> ------------------
> 0    0 1 2 3
> 1    1 0 3 2
> 2    2 3 0 1
> 3    3 2 1 0 <-- Note the change here
> 
> The bandwidth in Case 1 improves and matches Case 2 as shown below.
> 
> ----------------------------------------
> 		BANDWIDTH (MB/s)
>     TEST	Case 1		Case 2
> ----------------------------------------
>     COPY	110438.9	110107.2
>    SCALE	105930.5	105817.5
>      ADD	97005.1		96159.8
>   TRIADD	97441.5		96757.1
> ----------------------------------------
> 
> The correctness of the fallback list generation has been verified
> for the above node configuration where the node 3 starts as
> memory-less node and comes up online only during memory hotplug.
> 
> [bharata@amd.com: Added changelog, review, test validation]
> 
> Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection
> logic")
> Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
> Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
> Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
> Signed-off-by: Bharata B Rao <bharata@amd.com>

Minor not, The Fixes should be all one line even if it goes over the
line length limit.

Otherwise, I can confirm that 2-socket Intel machines with SNC enabled
suffer a similar problem -- the fallback lists for equal-distance nodes
ends up overloading one node. Aside from the problems you mention, the
overloaded node could trigger premature reclaim and inconsistent
behaviour depending on where the task executes so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
  2021-08-30 12:29   ` Mel Gorman
@ 2021-08-31  9:58   ` Anshuman Khandual
  2021-08-31 15:26     ` Ramakrishnan, Krupa
  2021-09-03  4:20   ` Anshuman Khandual
  2021-09-03  4:43   ` Bharata B Rao
  3 siblings, 1 reply; 13+ messages in thread
From: Anshuman Khandual @ 2021-08-31  9:58 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan



On 8/30/21 5:46 PM, Bharata B Rao wrote:
> As an example, consider a 4 node system with the following distance
> matrix.
> 
> Node 0  1  2  3
> ----------------
> 0    10 12 32 32
> 1    12 10 32 32
> 2    32 32 10 12
> 3    32 32 12 10
> 
> For this case, the node fallback list gets built like this:
> 
> Node	Fallback list
> ---------------------
> 0	0 1 2 3
> 1	1 0 3 2
> 2	2 3 0 1
> 3	3 2 0 1 <-- Unexpected fallback order
> 
> In the fallback list for nodes 2 and 3, the nodes 0 and 1
> appear in the same order which results in more allocations
> getting satisfied from node 0 compared to node 1.
> 
> The effect of this on remote memory bandwidth as seen by stream
> benchmark is shown below:
> 
> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
> 	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
> 	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
> 
> ----------------------------------------
> 		BANDWIDTH (MB/s)
>     TEST	Case 1		Case 2
> ----------------------------------------
>     COPY	57479.6		110791.8
>    SCALE	55372.9		105685.9
>      ADD	50460.6		96734.2
>   TRIADD	50397.6		97119.1
> ----------------------------------------
> 
> The bandwidth drop in Case 1 occurs because most of the allocations
> get satisfied by node 0 as it appears first in the fallback order
> for both nodes 2 and 3.

I am wondering what causes this performance drop here ? Would not the memory
access latency be similar between {2, 3} --->  { 0 } and {2, 3} --->  { 1 },
given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the
above distance matrix. Even if the preferred node order changes from { 0 } to
{ 1 } for the accessing node { 3 }, it should not change the latency as such.

Is the performance drop here, is caused by excessive allocation on node { 0 }
resulting from page allocation latency instead.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-31  9:58   ` Anshuman Khandual
@ 2021-08-31 15:26     ` Ramakrishnan, Krupa
  2021-09-03  4:01       ` Anshuman Khandual
  0 siblings, 1 reply; 13+ messages in thread
From: Ramakrishnan, Krupa @ 2021-08-31 15:26 UTC (permalink / raw)
  To: Anshuman Khandual, Rao, Bharata Bhasker, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman, Srinivasan, Sadagopan

[AMD Official Use Only]

The bandwidth is limited by underutilization of cross socket links and not the  latency. Hotspotting on  one node will not engage all  hardware resources based on our routing protocol which results in the lower bandwidth. Distributing equally across nodes 0 and 1 will yield the best results as it stresses the full system capabilities.

Thanks
Krupa Ramakrishnan

-----Original Message-----
From: Anshuman Khandual <anshuman.khandual@arm.com> 
Sent: 31 August, 2021 4:58
To: Rao, Bharata Bhasker <bharata@amd.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
Cc: akpm@linux-foundation.org; kamezawa.hiroyu@jp.fujitsu.com; lee.schermerhorn@hp.com; mgorman@suse.de; Ramakrishnan, Krupa <Krupa.Ramakrishnan@amd.com>; Srinivasan, Sadagopan <Sadagopan.Srinivasan@amd.com>
Subject: Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list

[CAUTION: External Email]

On 8/30/21 5:46 PM, Bharata B Rao wrote:
> As an example, consider a 4 node system with the following distance 
> matrix.
>
> Node 0  1  2  3
> ----------------
> 0    10 12 32 32
> 1    12 10 32 32
> 2    32 32 10 12
> 3    32 32 12 10
>
> For this case, the node fallback list gets built like this:
>
> Node  Fallback list
> ---------------------
> 0     0 1 2 3
> 1     1 0 3 2
> 2     2 3 0 1
> 3     3 2 0 1 <-- Unexpected fallback order
>
> In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in 
> the same order which results in more allocations getting satisfied 
> from node 0 compared to node 1.
>
> The effect of this on remote memory bandwidth as seen by stream 
> benchmark is shown below:
>
> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
>       (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) 
> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
>       (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
>
> ----------------------------------------
>               BANDWIDTH (MB/s)
>     TEST      Case 1          Case 2
> ----------------------------------------
>     COPY      57479.6         110791.8
>    SCALE      55372.9         105685.9
>      ADD      50460.6         96734.2
>   TRIADD      50397.6         97119.1
> ----------------------------------------
>
> The bandwidth drop in Case 1 occurs because most of the allocations 
> get satisfied by node 0 as it appears first in the fallback order for 
> both nodes 2 and 3.

I am wondering what causes this performance drop here ? Would not the memory access latency be similar between {2, 3} --->  { 0 } and {2, 3} --->  { 1 }, given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the above distance matrix. Even if the preferred node order changes from { 0 } to { 1 } for the accessing node { 3 }, it should not change the latency as such.

Is the performance drop here, is caused by excessive allocation on node { 0 } resulting from page allocation latency instead.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-31 15:26     ` Ramakrishnan, Krupa
@ 2021-09-03  4:01       ` Anshuman Khandual
  0 siblings, 0 replies; 13+ messages in thread
From: Anshuman Khandual @ 2021-09-03  4:01 UTC (permalink / raw)
  To: Ramakrishnan, Krupa, Rao, Bharata Bhasker, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman, Srinivasan, Sadagopan



On 8/31/21 8:56 PM, Ramakrishnan, Krupa wrote:
> [AMD Official Use Only]
> 
> The bandwidth is limited by underutilization of cross socket links and not the  latency. Hotspotting on  one node will not engage all  hardware resources based on our routing protocol which results in the lower bandwidth. Distributing equally across nodes 0 and 1 will yield the best results as it stresses the full system capabilities.

Makes sense. Nonetheless this patch clearly solves a problem. 

> 
> Thanks
> Krupa Ramakrishnan
> 
> -----Original Message-----
> From: Anshuman Khandual <anshuman.khandual@arm.com> 
> Sent: 31 August, 2021 4:58
> To: Rao, Bharata Bhasker <bharata@amd.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; kamezawa.hiroyu@jp.fujitsu.com; lee.schermerhorn@hp.com; mgorman@suse.de; Ramakrishnan, Krupa <Krupa.Ramakrishnan@amd.com>; Srinivasan, Sadagopan <Sadagopan.Srinivasan@amd.com>
> Subject: Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
> 
> [CAUTION: External Email]
> 
> On 8/30/21 5:46 PM, Bharata B Rao wrote:
>> As an example, consider a 4 node system with the following distance 
>> matrix.
>>
>> Node 0  1  2  3
>> ----------------
>> 0    10 12 32 32
>> 1    12 10 32 32
>> 2    32 32 10 12
>> 3    32 32 12 10
>>
>> For this case, the node fallback list gets built like this:
>>
>> Node  Fallback list
>> ---------------------
>> 0     0 1 2 3
>> 1     1 0 3 2
>> 2     2 3 0 1
>> 3     3 2 0 1 <-- Unexpected fallback order
>>
>> In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in 
>> the same order which results in more allocations getting satisfied 
>> from node 0 compared to node 1.
>>
>> The effect of this on remote memory bandwidth as seen by stream 
>> benchmark is shown below:
>>
>> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
>>       (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) 
>> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
>>       (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
>>
>> ----------------------------------------
>>               BANDWIDTH (MB/s)
>>     TEST      Case 1          Case 2
>> ----------------------------------------
>>     COPY      57479.6         110791.8
>>    SCALE      55372.9         105685.9
>>      ADD      50460.6         96734.2
>>   TRIADD      50397.6         97119.1
>> ----------------------------------------
>>
>> The bandwidth drop in Case 1 occurs because most of the allocations 
>> get satisfied by node 0 as it appears first in the fallback order for 
>> both nodes 2 and 3.
> 
> I am wondering what causes this performance drop here ? Would not the memory access latency be similar between {2, 3} --->  { 0 } and {2, 3} --->  { 1 }, given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the above distance matrix. Even if the preferred node order changes from { 0 } to { 1 } for the accessing node { 3 }, it should not change the latency as such.
> 
> Is the performance drop here, is caused by excessive allocation on node { 0 } resulting from page allocation latency instead.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 1/2] mm/page_alloc: Print node fallback order
  2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
  2021-08-30 12:26   ` Mel Gorman
@ 2021-09-03  4:15   ` Anshuman Khandual
  2021-09-03  4:17     ` Bharata B Rao
  2021-09-03  4:31   ` Anshuman Khandual
  2 siblings, 1 reply; 13+ messages in thread
From: Anshuman Khandual @ 2021-09-03  4:15 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan



On 8/30/21 5:46 PM, Bharata B Rao wrote:
> Print information message about the allocation fallback order
> for each NUMA node during boot.
> 
> No functional changes here. This makes it easier to illustrate
> the problem in the node fallback list generation, which the
> next patch fixes.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  mm/page_alloc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..22f7ad6ec11c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6277,6 +6277,10 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>  	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>  	build_thisnode_zonelists(pgdat);
> +	pr_info("Fallback order for Node %d: ", local_node);
> +	for (node = 0; node < nr_nodes; node++)
> +		pr_cont("%d ", node_order[node]);
> +	pr_cont("\n");
>  }
>  
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> 

A small nit, checkpatch.pl throws up an warning. Should this use
pr_info() instead ?

WARNING: Avoid logging continuation uses where feasible
#29: FILE: mm/page_alloc.c:6282:
+               pr_cont("%d ", node_order[node]);

WARNING: Avoid logging continuation uses where feasible
#30: FILE: mm/page_alloc.c:6283:
+       pr_cont("\n");

total: 0 errors, 2 warnings, 10 lines checked

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 1/2] mm/page_alloc: Print node fallback order
  2021-09-03  4:15   ` Anshuman Khandual
@ 2021-09-03  4:17     ` Bharata B Rao
  0 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2021-09-03  4:17 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan


On 9/3/2021 9:45 AM, Anshuman Khandual wrote:
> 
> 
> On 8/30/21 5:46 PM, Bharata B Rao wrote:
>> Print information message about the allocation fallback order
>> for each NUMA node during boot.
>>
>> No functional changes here. This makes it easier to illustrate
>> the problem in the node fallback list generation, which the
>> next patch fixes.
>>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> ---
>>  mm/page_alloc.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index eeb3a9cb36bb..22f7ad6ec11c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6277,6 +6277,10 @@ static void build_zonelists(pg_data_t *pgdat)
>>  
>>  	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>>  	build_thisnode_zonelists(pgdat);
>> +	pr_info("Fallback order for Node %d: ", local_node);
>> +	for (node = 0; node < nr_nodes; node++)
>> +		pr_cont("%d ", node_order[node]);
>> +	pr_cont("\n");
>>  }
>>  
>>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>>
> 
> A small nit, checkpatch.pl throws up an warning. Should this use
> pr_info() instead ?
> 
> WARNING: Avoid logging continuation uses where feasible
> #29: FILE: mm/page_alloc.c:6282:
> +               pr_cont("%d ", node_order[node]);
> 
> WARNING: Avoid logging continuation uses where feasible
> #30: FILE: mm/page_alloc.c:6283:
> +       pr_cont("\n");
> 
> total: 0 errors, 2 warnings, 10 lines checked

Yes, I am aware of this, but then it made sense for the fallback list
to be printed in one line continuously. Hence used pr_cont().

Regards,
Bharata.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
  2021-08-30 12:29   ` Mel Gorman
  2021-08-31  9:58   ` Anshuman Khandual
@ 2021-09-03  4:20   ` Anshuman Khandual
  2021-09-03  4:43   ` Bharata B Rao
  3 siblings, 0 replies; 13+ messages in thread
From: Anshuman Khandual @ 2021-09-03  4:20 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan



On 8/30/21 5:46 PM, Bharata B Rao wrote:
> From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
> 
> In build_zonelists(), when the fallback list is built for the nodes,
> the node load gets reinitialized during each iteration. This results
> in nodes with same distances occupying the same slot in different
> node fallback lists rather than appearing in the intended round-
> robin manner. This results in one node getting picked for allocation
> more compared to other nodes with the same distance.
> 
> As an example, consider a 4 node system with the following distance
> matrix.
> 
> Node 0  1  2  3
> ----------------
> 0    10 12 32 32
> 1    12 10 32 32
> 2    32 32 10 12
> 3    32 32 12 10
> 
> For this case, the node fallback list gets built like this:
> 
> Node	Fallback list
> ---------------------
> 0	0 1 2 3
> 1	1 0 3 2
> 2	2 3 0 1
> 3	3 2 0 1 <-- Unexpected fallback order
> 
> In the fallback list for nodes 2 and 3, the nodes 0 and 1
> appear in the same order which results in more allocations
> getting satisfied from node 0 compared to node 1.
> 
> The effect of this on remote memory bandwidth as seen by stream
> benchmark is shown below:
> 
> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
> 	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
> 	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
> 
> ----------------------------------------
> 		BANDWIDTH (MB/s)
>     TEST	Case 1		Case 2
> ----------------------------------------
>     COPY	57479.6		110791.8
>    SCALE	55372.9		105685.9
>      ADD	50460.6		96734.2
>   TRIADD	50397.6		97119.1
> ----------------------------------------
> 
> The bandwidth drop in Case 1 occurs because most of the allocations
> get satisfied by node 0 as it appears first in the fallback order
> for both nodes 2 and 3.
> 
> This can be fixed by accumulating the node load in build_zonelists()
> rather than reinitializing it during each iteration. With this the
> nodes with the same distance rightly get assigned in the round robin
> manner. In fact this was how it was originally until the
> commit f0c0b2b808f2 ("change zonelist order: zonelist order selection
> logic") dropped the load accumulation and resorted to initializing
> the load during each iteration. While zonelist ordering was removed by
> commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"),
> the change to the node load accumulation in build_zonelists() remained.
> So essentially this patch reverts back to the accumulated node load
> logic.
> 
> After this fix, the fallback order gets built like this:
> 
> Node Fallback list
> ------------------
> 0    0 1 2 3
> 1    1 0 3 2
> 2    2 3 0 1
> 3    3 2 1 0 <-- Note the change here
> 
> The bandwidth in Case 1 improves and matches Case 2 as shown below.
> 
> ----------------------------------------
> 		BANDWIDTH (MB/s)
>     TEST	Case 1		Case 2
> ----------------------------------------
>     COPY	110438.9	110107.2
>    SCALE	105930.5	105817.5
>      ADD	97005.1		96159.8
>   TRIADD	97441.5		96757.1
> ----------------------------------------
> 
> The correctness of the fallback list generation has been verified
> for the above node configuration where the node 3 starts as
> memory-less node and comes up online only during memory hotplug.
> 
> [bharata@amd.com: Added changelog, review, test validation]
> 
> Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection
> logic")
> Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
> Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
> Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 22f7ad6ec11c..47f4d160971e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6268,7 +6268,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  		 */
>  		if (node_distance(local_node, node) !=
>  		    node_distance(local_node, prev_node))
> -			node_load[node] = load;
> +			node_load[node] += load;
>  
>  		node_order[nr_nodes++] = node;
>  		prev_node = node;
> 

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 1/2] mm/page_alloc: Print node fallback order
  2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
  2021-08-30 12:26   ` Mel Gorman
  2021-09-03  4:15   ` Anshuman Khandual
@ 2021-09-03  4:31   ` Anshuman Khandual
  2 siblings, 0 replies; 13+ messages in thread
From: Anshuman Khandual @ 2021-09-03  4:31 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, lee.schermerhorn, mgorman,
	Krupa.Ramakrishnan, Sadagopan.Srinivasan



On 8/30/21 5:46 PM, Bharata B Rao wrote:
> Print information message about the allocation fallback order
> for each NUMA node during boot.
> 
> No functional changes here. This makes it easier to illustrate
> the problem in the node fallback list generation, which the
> next patch fixes.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  mm/page_alloc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..22f7ad6ec11c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6277,6 +6277,10 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>  	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>  	build_thisnode_zonelists(pgdat);
> +	pr_info("Fallback order for Node %d: ", local_node);
> +	for (node = 0; node < nr_nodes; node++)
> +		pr_cont("%d ", node_order[node]);
> +	pr_cont("\n");
>  }
>  
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> 

Node allocation preference sequence for a given accessing node is an
important information on large systems. This information during boot
and (hotplug --> online sequence) will help the user.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list
  2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
                     ` (2 preceding siblings ...)
  2021-09-03  4:20   ` Anshuman Khandual
@ 2021-09-03  4:43   ` Bharata B Rao
  3 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2021-09-03  4:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, kamezawa.hiroyu, mgorman, Krupa.Ramakrishnan, Sadagopan.Srinivasan


On 8/30/2021 5:46 PM, Bharata B Rao wrote:
> From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
> 
> In build_zonelists(), when the fallback list is built for the nodes,
> the node load gets reinitialized during each iteration. This results
> in nodes with same distances occupying the same slot in different
> node fallback lists rather than appearing in the intended round-
> robin manner. This results in one node getting picked for allocation
> more compared to other nodes with the same distance.
> 
> As an example, consider a 4 node system with the following distance
> matrix.
> 
> Node 0  1  2  3
> ----------------
> 0    10 12 32 32
> 1    12 10 32 32
> 2    32 32 10 12
> 3    32 32 12 10
> 
> For this case, the node fallback list gets built like this:
> 
> Node	Fallback list
> ---------------------
> 0	0 1 2 3
> 1	1 0 3 2
> 2	2 3 0 1
> 3	3 2 0 1 <-- Unexpected fallback order

FWIW, for a dual-socket 8 node system with the following distance matrix,

node   0   1   2   3   4   5   6   7
  0:  10  12  12  12  32  32  32  32
  1:  12  10  12  12  32  32  32  32
  2:  12  12  10  12  32  32  32  32
  3:  12  12  12  10  32  32  32  32
  4:  32  32  32  32  10  12  12  12
  5:  32  32  32  32  12  10  12  12
  6:  32  32  32  32  12  12  10  12
  7:  32  32  32  32  12  12  12  10

the fallback list looks like this:

Before
=======
Fallback order for Node 0: 0 1 2 3 4 5 6 7
Fallback order for Node 1: 1 2 3 0 5 6 7 4
Fallback order for Node 2: 2 3 0 1 6 7 4 5
Fallback order for Node 3: 3 0 1 2 7 4 5 6
Fallback order for Node 4: 4 5 6 7 0 1 2 3
Fallback order for Node 5: 5 6 7 4 0 1 2 3
Fallback order for Node 6: 6 7 4 5 0 1 2 3
Fallback order for Node 7: 7 4 5 6 0 1 2 3

After the fix
==============
Fallback order for Node 0: 0 1 2 3 4 5 6 7
Fallback order for Node 1: 1 2 3 0 5 6 7 4
Fallback order for Node 2: 2 3 0 1 6 7 4 5
Fallback order for Node 3: 3 0 1 2 7 4 5 6
Fallback order for Node 4: 4 5 6 7 0 1 2 3
Fallback order for Node 5: 5 6 7 4 1 2 3 0
Fallback order for Node 6: 6 7 4 5 2 3 0 1
Fallback order for Node 7: 7 4 5 6 3 0 1 2

So the problem becomes more pronounced for bigger NUMA systems.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-03  4:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 12:16 [FIX PATCH 0/2] Fix NUMA nodes fallback list ordering Bharata B Rao
2021-08-30 12:16 ` [FIX PATCH 1/2] mm/page_alloc: Print node fallback order Bharata B Rao
2021-08-30 12:26   ` Mel Gorman
2021-09-03  4:15   ` Anshuman Khandual
2021-09-03  4:17     ` Bharata B Rao
2021-09-03  4:31   ` Anshuman Khandual
2021-08-30 12:16 ` [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list Bharata B Rao
2021-08-30 12:29   ` Mel Gorman
2021-08-31  9:58   ` Anshuman Khandual
2021-08-31 15:26     ` Ramakrishnan, Krupa
2021-09-03  4:01       ` Anshuman Khandual
2021-09-03  4:20   ` Anshuman Khandual
2021-09-03  4:43   ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).