From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30902C433F5 for ; Tue, 29 Mar 2022 12:25:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236272AbiC2M1A (ORCPT ); Tue, 29 Mar 2022 08:27:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53118 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232550AbiC2M06 (ORCPT ); Tue, 29 Mar 2022 08:26:58 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0EE134D637 for ; Tue, 29 Mar 2022 05:25:14 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0V8Z.XYI_1648556711; Received: from 30.0.141.87(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0V8Z.XYI_1648556711) by smtp.aliyun-inc.com(127.0.0.1); Tue, 29 Mar 2022 20:25:11 +0800 Message-ID: Date: Tue, 29 Mar 2022 20:26:05 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [PATCH] mm: migrate: set demotion targets differently To: Jagdish Gediya , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, dave.hansen@linux.intel.com, ying.huang@intel.com References: <20220329115222.8923-1-jvgediya@linux.ibm.com> From: Baolin Wang In-Reply-To: <20220329115222.8923-1-jvgediya@linux.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jagdish, On 3/29/2022 7:52 PM, Jagdish Gediya wrote: > The current implementation to identify the demotion > targets limits some of the opportunities to share > the demotion targets between multiple source nodes. > > Implement a logic to identify the loop in the demotion > targets such that all the possibilities of demotion can > be utilized. Don't share the used targets between all > the nodes, instead create the used targets from scratch > for each individual node based on for what all node this > node is a demotion target. This helps to share the demotion > targets without missing any possible way of demotion. > > e.g. with below NUMA topology, where node 0 & 1 are > cpu + dram nodes, node 2 & 3 are equally slower memory > only nodes, and node 4 is slowest memory only node, > > available: 5 nodes (0-4) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node 4 cpus: > node 4 size: n MB > node 4 free: n MB > node distances: > node 0 1 2 3 4 > 0: 10 20 40 40 80 > 1: 20 10 40 40 80 > 2: 40 40 10 40 80 > 3: 40 40 40 10 80 > 4: 80 80 80 80 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3, 2 > 1 4 > 2 X > 3 X > 4 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3, 2 > 1 3, 2 > 2 3 > 3 4 > 4 X Node 2 and node 3 both are slow memory and have same distance, why node 2 should demote cold memory to node 3? They should have the same target demotion node 4, which is the slowest memory node, right? > > e.g. with below NUMA topology, where node 0, 1 & 2 are > cpu + dram nodes and node 3 is slow memory node, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 4 5 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 20 20 40 > 1: 20 10 20 40 > 2: 20 20 10 40 > 3: 40 40 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 X > 3 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3 > 1 3 > 2 3 > 3 X Sounds reasonable. > > with below NUMA topology, where node 0 & 2 are cpu + dram > nodes and node 1 & 3 are slow memory nodes, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 40 20 80 > 1: 40 10 80 80 > 2: 20 80 10 40 > 3: 80 80 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 3 > 3 X If I understand correctly, this is not true. The demotion route should be as below with existing implementation: node 0 ---> node 1 node 1 ---> X node 2 ---> node 3 node 3 ---> X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 1 > 1 3 > 2 3 > 3 X > > As it can be seen above, node 3 can be demotion target for node > 1 but existing implementation doesn't configure it that way. It > is better to move pages from node 1 to node 3 instead of moving > it from node 1 to swap. Which means node 3 is the slowest memory node? > > Signed-off-by: Jagdish Gediya > Signed-off-by: Aneesh Kumar K.V > --- > mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------ > 1 file changed, 41 insertions(+), 34 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 3d60823afd2d..7ec8d934e706 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used, > */ > static void __set_migration_target_nodes(void) > { > - nodemask_t next_pass = NODE_MASK_NONE; > - nodemask_t this_pass = NODE_MASK_NONE; > nodemask_t used_targets = NODE_MASK_NONE; > int node, best_distance; > + nodemask_t *src_nodes; > + > + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL); > + if (!src_nodes) > + return; > > /* > * Avoid any oddities like cycles that could occur > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void) > */ > disable_all_migrate_targets(); > > - /* > - * Allocations go close to CPUs, first. Assume that > - * the migration path starts at the nodes with CPUs. > - */ > - next_pass = node_states[N_CPU]; > -again: > - this_pass = next_pass; > - next_pass = NODE_MASK_NONE; > - /* > - * To avoid cycles in the migration "graph", ensure > - * that migration sources are not future targets by > - * setting them in 'used_targets'. Do this only > - * once per pass so that multiple source nodes can > - * share a target node. > - * > - * 'used_targets' will become unavailable in future > - * passes. This limits some opportunities for > - * multiple source nodes to share a destination. > - */ > - nodes_or(used_targets, used_targets, this_pass); > + for_each_online_node(node) { > + int tmp_node; > > - for_each_node_mask(node, this_pass) { > best_distance = -1; > + used_targets = NODE_MASK_NONE; > + > + /* > + * Avoid adding same node as the demotion target. > + */ > + node_set(node, used_targets); > + > + /* > + * Add CPU NUMA nodes to the used target list so that it > + * won't be considered a demotion target. > + */ > + nodes_or(used_targets, used_targets, node_states[N_CPU]); > + > + /* > + * Add all nodes that has appeared as source node of demotion > + * for this target node. > + * > + * To avoid cycles in the migration "graph", ensure > + * that migration sources are not future targets by > + * setting them in 'used_targets'. > + */ > + for_each_node_mask(tmp_node, src_nodes[node]) > + nodes_or(used_targets, used_targets, src_nodes[tmp_node]); > + > + /* > + * Now update the demotion src nodes with other nodes in graph > + * which got computed above. > + */ > + nodes_or(src_nodes[node], src_nodes[node], used_targets); > > /* > * Try to set up the migration path for the node, and the target > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void) > best_distance = node_distance(node, target_node); > > /* > - * Visit targets from this pass in the next pass. > - * Eventually, every node will have been part of > - * a pass, and will become set in 'used_targets'. > + * Add this node in the src_nodes list so that we can > + * detect the looping. > */ > - node_set(target_node, next_pass); > + node_set(node, src_nodes[target_node]); > } while (1); > } > - /* > - * 'next_pass' contains nodes which became migration > - * targets in this pass. Make additional passes until > - * no more migrations targets are available. > - */ > - if (!nodes_empty(next_pass)) > - goto again; > + > + kfree(src_nodes); > } > > /*