Re: [PATCH] mm: migrate: set demotion targets differently

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Jagdish Gediya <jvgediya@linux.ibm.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
	dave.hansen@linux.intel.com, ying.huang@intel.com
Subject: Re: [PATCH] mm: migrate: set demotion targets differently
Date: Wed, 30 Mar 2022 14:37:35 +0800	[thread overview]
Message-ID: <784aee91-6a01-6e67-389e-1e1883796894@linux.alibaba.com> (raw)
In-Reply-To: <YkMR8OY779Bcri3I@li-6e1fa1cc-351b-11b2-a85c-b897023bb5f3.ibm.com>

On 3/29/2022 10:04 PM, Jagdish Gediya wrote:
> On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
> Hi Baolin,
>> Hi Jagdish,
>>
>> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
>>> The current implementation to identify the demotion
>>> targets limits some of the opportunities to share
>>> the demotion targets between multiple source nodes.
>>>
>>> Implement a logic to identify the loop in the demotion
>>> targets such that all the possibilities of demotion can
>>> be utilized. Don't share the used targets between all
>>> the nodes, instead create the used targets from scratch
>>> for each individual node based on for what all node this
>>> node is a demotion target. This helps to share the demotion
>>> targets without missing any possible way of demotion.
>>>
>>> e.g. with below NUMA topology, where node 0 & 1 are
>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>> only nodes, and node 4 is slowest memory only node,
>>>
>>> available: 5 nodes (0-4)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus:
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node 4 cpus:
>>> node 4 size: n MB
>>> node 4 free: n MB
>>> node distances:
>>> node   0   1   2   3   4
>>>     0:  10  20  40  40  80
>>>     1:  20  10  40  40  80
>>>     2:  40  40  10  40  80
>>>     3:  40  40  40  10  80
>>>     4:  80  80  80  80  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3, 2
>>>    1              4
>>>    2              X
>>>    3              X
>>>    4		X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              3, 2
>>>    1              3, 2
>>>    2              3
>>>    3              4
>>>    4		X
>>
>> Node 2 and node 3 both are slow memory and have same distance, why node 2
>> should demote cold memory to node 3? They should have the same target
>> demotion node 4, which is the slowest memory node, right?
>>
> Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.

If node 2 can demote to node 3, which means node 3's memory is colder 
than node 2, right? The accessing time of node 3 should be larger than 
node 2, then we can demote colder memory to node 3 from node 2.

But node 2 and node 3 are same memory type and have same distance, the 
accessing time of node 2 and node 3 should be same too, so why add so 
many page migration between node 2 and node 3? I'm still not sure the 
benefits.

Huang Ying and Dave, how do you think about this demotion targets?

>>>
>>> e.g. with below NUMA topology, where node 0, 1 & 2 are
>>> cpu + dram nodes and node 3 is slow memory node,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 4 5
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node   0   1   2   3
>>>     0:  10  20  20  40
>>>     1:  20  10  20  40
>>>     2:  20  20  10  40
>>>     3:  40  40  40  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              X
>>>    2              X
>>>    3              X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              3
>>>    2              3
>>>    3              X
>>
>> Sounds reasonable.
>>
>>>
>>> with below NUMA topology, where node 0 & 2 are cpu + dram
>>> nodes and node 1 & 3 are slow memory nodes,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus:
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 2 3
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node   0   1   2   3
>>>     0:  10  40  20  80
>>>     1:  40  10  80  80
>>>     2:  20  80  10  40
>>>     3:  80  80  40  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              X
>>>    2              3
>>>    3              X
>>
>> If I understand correctly, this is not true. The demotion route should be as
>> below with existing implementation:
>> node 0 ---> node 1
>> node 1 ---> X
>> node 2 ---> node 3
>> node 3 ---> X
>>
> Its typo, It should be 0 -> 1, Will correct it in v2.
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              1
>>>    1              3
>>>    2              3
>>>    3              X
>>>
>>> As it can be seen above, node 3 can be demotion target for node
>>> 1 but existing implementation doesn't configure it that way. It
>>> is better to move pages from node 1 to node 3 instead of moving
>>> it from node 1 to swap.
>>
>> Which means node 3 is the slowest memory node?
>>
> Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2.

OK.