Re: [PATCH] mm: migrate: set demotion targets differently

From: "Huang, Ying" <ying.huang@intel.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, baolin.wang@linux.alibaba.com,
	dave.hansen@linux.intel.com, Fan Du <fan.du@intel.com>
Subject: Re: [PATCH] mm: migrate: set demotion targets differently
Date: Thu, 31 Mar 2022 15:23:16 +0800	[thread overview]
Message-ID: <87h77ebn6j.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <87ilruy5zt.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Thu, 31 Mar 2022 12:15:58 +0530")

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Hi, Jagdish,
>>>>
>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>>
>>>
>>> ...
>>>
>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>
>>>>> available: 5 nodes (0-4)
>>>>> node 0 cpus: 0 1
>>>>> node 0 size: n MB
>>>>> node 0 free: n MB
>>>>> node 1 cpus: 2 3
>>>>> node 1 size: n MB
>>>>> node 1 free: n MB
>>>>> node 2 cpus:
>>>>> node 2 size: n MB
>>>>> node 2 free: n MB
>>>>> node 3 cpus:
>>>>> node 3 size: n MB
>>>>> node 3 free: n MB
>>>>> node 4 cpus:
>>>>> node 4 size: n MB
>>>>> node 4 free: n MB
>>>>> node distances:
>>>>> node   0   1   2   3   4
>>>>>   0:  10  20  40  40  80
>>>>>   1:  20  10  40  40  80
>>>>>   2:  40  40  10  40  80
>>>>>   3:  40  40  40  10  80
>>>>>   4:  80  80  80  80  10
>>>>>
>>>>> The existing implementation gives below demotion targets,
>>>>>
>>>>> node    demotion_target
>>>>>  0              3, 2
>>>>>  1              4
>>>>>  2              X
>>>>>  3              X
>>>>>  4		X
>>>>>
>>>>> With this patch applied, below are the demotion targets,
>>>>>
>>>>> node    demotion_target
>>>>>  0              3, 2
>>>>>  1              3, 2
>>>>>  2              3
>>>>>  3              4
>>>>>  4		X
>>>>
>>>> For such machine, I think the perfect demotion order is,
>>>>
>>>> node    demotion_target
>>>>  0              2, 3
>>>>  1              2, 3
>>>>  2              4
>>>>  3              4
>>>>  4              X
>>>
>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>> differentiate between them, firmware can present a single NUMA node
>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>> that there is some difference between these two memory devices. This is
>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>
>> Do you have more information about this?
>
> Not sure I follow the question there. I was checking shouldn't firmware
> do a single NUMA node if two memory devices are of the same type? How will
> optane present such a config? Both the DIMMs will have the same
> proximity domain value and hence dax kmem will add them to the same NUMA
> node?

Sorry for confusing.  I just wanted to check whether you have more
information about the machine configuration above.  The machines in my
hand have no complex NUMA topology as in the patch description.

> If you are suggesting that firmware doesn't do that, then I agree with you
> that a demotion target like the below is good. 
>
>  node    demotion_target
>   0              2, 3
>   1              2, 3
>   2              4
>   3              4
>   4              X
>
> We can also achieve that with a smiple change as below.

Glad to see the demotion order can be implemented in a simple way.

My concern is that is it necessary to do this?  If there are real
machines with the NUMA topology, then I think it's good to add the
support.  But if not, why do we make the code complex unnecessarily?

I don't have these kind of machines, do you have and will have?

> @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void)
>  {
>  	nodemask_t next_pass	= NODE_MASK_NONE;
>  	nodemask_t this_pass	= NODE_MASK_NONE;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> +	nodemask_t this_pass_used_targets = NODE_MASK_NONE;
>  	int node, best_distance;
>  
>  	/*
> @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void)
>  	/*
>  	 * To avoid cycles in the migration "graph", ensure
>  	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> +	 * setting them in 'this_pass_used_targets'.  Do this only
>  	 * once per pass so that multiple source nodes can
>  	 * share a target node.
>  	 *
> -	 * 'used_targets' will become unavailable in future
> +	 * 'this_pass_used_targets' will become unavailable in future
>  	 * passes.  This limits some opportunities for
>  	 * multiple source nodes to share a destination.
>  	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> +	nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass);
>  
>  	for_each_node_mask(node, this_pass) {
> +
> +		nodemask_t used_targets = this_pass_used_targets;
> +
>  		best_distance = -1;
>  
>  		/*

Best Regards,
Huang, Ying