From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 940E1C433F5 for ; Tue, 29 Mar 2022 14:05:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237729AbiC2OGw (ORCPT ); Tue, 29 Mar 2022 10:06:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44862 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232306AbiC2OGt (ORCPT ); Tue, 29 Mar 2022 10:06:49 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 409B9163E06 for ; Tue, 29 Mar 2022 07:05:05 -0700 (PDT) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 22TDhVIM012937; Tue, 29 Mar 2022 14:04:59 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=pp1; bh=dJlka0qXhPU7f2TFEHuFeLpBGs5SfiPowXo2kN2nvp4=; b=YvfydyhpmeHBCAiYU/qJA71LxRTmxlwreSKyC9oMHIJSN+8yMqCdu70Pb+2KPbP74sY1 lslEAqFWAejC59+2/rSq/8+fD7DTOLFj8DOQDKiIvyIzdxLOrUYwGCsHH2158VnSGtD1 7EZ8eopFZdlU9f5Tx+pqDres5iv3s8MrB2MuacK5qlpfTeG3yfe8hsTJqqqS7Pq5TQfF Gu0h6cet4/MLQSHwJcNErkY+Yt8aoUtNu7Mthyo6rqKoKpo/6OZI1fe4IpK6Kcjqwg0I NdpvTATbEhcI+F4Z0LBb7XR/vMYivmNY8VJroolJVqWl2eskAH5cnqdOcokuMtQN4Tz9 fQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3f40q1ca11-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Mar 2022 14:04:58 +0000 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 22TDhTku002677; Tue, 29 Mar 2022 14:04:58 GMT Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 3f40q1ca08-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Mar 2022 14:04:58 +0000 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 22TE4SDT015994; Tue, 29 Mar 2022 14:04:55 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma04ams.nl.ibm.com with ESMTP id 3f1tf8xbum-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Mar 2022 14:04:55 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 22TE4r5i35062128 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 29 Mar 2022 14:04:53 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 93A0642042; Tue, 29 Mar 2022 14:04:53 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 172B742041; Tue, 29 Mar 2022 14:04:50 +0000 (GMT) Received: from li-6e1fa1cc-351b-11b2-a85c-b897023bb5f3.ibm.com (unknown [9.211.138.152]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Tue, 29 Mar 2022 14:04:49 +0000 (GMT) Date: Tue, 29 Mar 2022 19:34:32 +0530 From: Jagdish Gediya To: Baolin Wang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, dave.hansen@linux.intel.com, ying.huang@intel.com Subject: Re: [PATCH] mm: migrate: set demotion targets differently Message-ID: References: <20220329115222.8923-1-jvgediya@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-TM-AS-GCONF: 00 X-Proofpoint-GUID: B7PKZ-dEYCU-cfMUGvoteWTL5sfq-7mL X-Proofpoint-ORIG-GUID: fRhK3oXVRX1J-cp-LaRsZNqr41sO579Y X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.850,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-03-29_05,2022-03-29_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=999 mlxscore=0 clxscore=1015 impostorscore=0 bulkscore=0 priorityscore=1501 lowpriorityscore=0 malwarescore=0 suspectscore=0 spamscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2202240000 definitions=main-2203290086 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote: Hi Baolin, > Hi Jagdish, > > On 3/29/2022 7:52 PM, Jagdish Gediya wrote: > > The current implementation to identify the demotion > > targets limits some of the opportunities to share > > the demotion targets between multiple source nodes. > > > > Implement a logic to identify the loop in the demotion > > targets such that all the possibilities of demotion can > > be utilized. Don't share the used targets between all > > the nodes, instead create the used targets from scratch > > for each individual node based on for what all node this > > node is a demotion target. This helps to share the demotion > > targets without missing any possible way of demotion. > > > > e.g. with below NUMA topology, where node 0 & 1 are > > cpu + dram nodes, node 2 & 3 are equally slower memory > > only nodes, and node 4 is slowest memory only node, > > > > available: 5 nodes (0-4) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node 4 cpus: > > node 4 size: n MB > > node 4 free: n MB > > node distances: > > node 0 1 2 3 4 > > 0: 10 20 40 40 80 > > 1: 20 10 40 40 80 > > 2: 40 40 10 40 80 > > 3: 40 40 40 10 80 > > 4: 80 80 80 80 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 4 > > 2 X > > 3 X > > 4 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 3, 2 > > 2 3 > > 3 4 > > 4 X > > Node 2 and node 3 both are slow memory and have same distance, why node 2 > should demote cold memory to node 3? They should have the same target > demotion node 4, which is the slowest memory node, right? > Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3. > > > > e.g. with below NUMA topology, where node 0, 1 & 2 are > > cpu + dram nodes and node 3 is slow memory node, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 4 5 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 20 20 40 > > 1: 20 10 20 40 > > 2: 20 20 10 40 > > 3: 40 40 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 X > > 3 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3 > > 1 3 > > 2 3 > > 3 X > > Sounds reasonable. > > > > > with below NUMA topology, where node 0 & 2 are cpu + dram > > nodes and node 1 & 3 are slow memory nodes, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 40 20 80 > > 1: 40 10 80 80 > > 2: 20 80 10 40 > > 3: 80 80 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 3 > > 3 X > > If I understand correctly, this is not true. The demotion route should be as > below with existing implementation: > node 0 ---> node 1 > node 1 ---> X > node 2 ---> node 3 > node 3 ---> X > Its typo, It should be 0 -> 1, Will correct it in v2. > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 1 > > 1 3 > > 2 3 > > 3 X > > > > As it can be seen above, node 3 can be demotion target for node > > 1 but existing implementation doesn't configure it that way. It > > is better to move pages from node 1 to node 3 instead of moving > > it from node 1 to swap. > > Which means node 3 is the slowest memory node? > Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2. > > > > Signed-off-by: Jagdish Gediya > > Signed-off-by: Aneesh Kumar K.V > > --- > > mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------ > > 1 file changed, 41 insertions(+), 34 deletions(-) > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 3d60823afd2d..7ec8d934e706 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used, > > */ > > static void __set_migration_target_nodes(void) > > { > > - nodemask_t next_pass = NODE_MASK_NONE; > > - nodemask_t this_pass = NODE_MASK_NONE; > > nodemask_t used_targets = NODE_MASK_NONE; > > int node, best_distance; > > + nodemask_t *src_nodes; > > + > > + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL); > > + if (!src_nodes) > > + return; > > /* > > * Avoid any oddities like cycles that could occur > > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void) > > */ > > disable_all_migrate_targets(); > > - /* > > - * Allocations go close to CPUs, first. Assume that > > - * the migration path starts at the nodes with CPUs. > > - */ > > - next_pass = node_states[N_CPU]; > > -again: > > - this_pass = next_pass; > > - next_pass = NODE_MASK_NONE; > > - /* > > - * To avoid cycles in the migration "graph", ensure > > - * that migration sources are not future targets by > > - * setting them in 'used_targets'. Do this only > > - * once per pass so that multiple source nodes can > > - * share a target node. > > - * > > - * 'used_targets' will become unavailable in future > > - * passes. This limits some opportunities for > > - * multiple source nodes to share a destination. > > - */ > > - nodes_or(used_targets, used_targets, this_pass); > > + for_each_online_node(node) { > > + int tmp_node; > > - for_each_node_mask(node, this_pass) { > > best_distance = -1; > > + used_targets = NODE_MASK_NONE; > > + > > + /* > > + * Avoid adding same node as the demotion target. > > + */ > > + node_set(node, used_targets); > > + > > + /* > > + * Add CPU NUMA nodes to the used target list so that it > > + * won't be considered a demotion target. > > + */ > > + nodes_or(used_targets, used_targets, node_states[N_CPU]); > > + > > + /* > > + * Add all nodes that has appeared as source node of demotion > > + * for this target node. > > + * > > + * To avoid cycles in the migration "graph", ensure > > + * that migration sources are not future targets by > > + * setting them in 'used_targets'. > > + */ > > + for_each_node_mask(tmp_node, src_nodes[node]) > > + nodes_or(used_targets, used_targets, src_nodes[tmp_node]); > > + > > + /* > > + * Now update the demotion src nodes with other nodes in graph > > + * which got computed above. > > + */ > > + nodes_or(src_nodes[node], src_nodes[node], used_targets); > > /* > > * Try to set up the migration path for the node, and the target > > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void) > > best_distance = node_distance(node, target_node); > > /* > > - * Visit targets from this pass in the next pass. > > - * Eventually, every node will have been part of > > - * a pass, and will become set in 'used_targets'. > > + * Add this node in the src_nodes list so that we can > > + * detect the looping. > > */ > > - node_set(target_node, next_pass); > > + node_set(node, src_nodes[target_node]); > > } while (1); > > } > > - /* > > - * 'next_pass' contains nodes which became migration > > - * targets in this pass. Make additional passes until > > - * no more migrations targets are available. > > - */ > > - if (!nodes_empty(next_pass)) > > - goto again; > > + > > + kfree(src_nodes); > > } > > /*