From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91DC3C48BC2 for ; Tue, 22 Jun 2021 00:55:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6BA3B6023E for ; Tue, 22 Jun 2021 00:55:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231298AbhFVA5j (ORCPT ); Mon, 21 Jun 2021 20:57:39 -0400 Received: from mga03.intel.com ([134.134.136.65]:2221 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231127AbhFVA5i (ORCPT ); Mon, 21 Jun 2021 20:57:38 -0400 IronPort-SDR: +Q5U9qzgDuzdN1pEUnOzi64vnugbB2vgh8uKTnH4EmG6XW3cfw2mV0Tj9FDaMzdGvh6+pjmHDi lYzSZ/NnrEnQ== X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="207000739" X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="207000739" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 17:55:23 -0700 IronPort-SDR: mUggc9ePv/v55XF4K8RrJjPDx7y/v5/71SjTI2s55xRsFIw6mpTUN8xGGB60P9hX/wB4ZpDN3W fwxo5/yZc+DA== X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="454075302" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 17:55:20 -0700 From: "Huang, Ying" To: Yang Shi Cc: Zi Yan , Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Michal Hocko , Wei Xu , David Rientjes , Dan Williams , David Hildenbrand , osalvador Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order References: <20210618061537.434999-1-ying.huang@intel.com> <20210618061537.434999-3-ying.huang@intel.com> <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 22 Jun 2021 08:55:18 +0800 In-Reply-To: (Yang Shi's message of "Mon, 21 Jun 2021 12:51:30 -0700") Message-ID: <87wnqmn2fd.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Yang Shi writes: > On Sat, Jun 19, 2021 at 1:19 AM Huang, Ying wrote: >> >> Zi Yan writes: >> >> > On 18 Jun 2021, at 2:15, Huang Ying wrote: >> > >> >> From: Dave Hansen >> >> >> >> When memory fills up on a node, memory contents can be >> >> automatically migrated to another node. The biggest problems are >> >> knowing when to migrate and to where the migration should be >> >> targeted. >> >> >> >> The most straightforward way to generate the "to where" list would >> >> be to follow the page allocator fallback lists. Those lists >> >> already tell us if memory is full where to look next. It would >> >> also be logical to move memory in that order. >> >> >> >> But, the allocator fallback lists have a fatal flaw: most nodes >> >> appear in all the lists. This would potentially lead to migration >> >> cycles (A->B, B->A, A->B, ...). >> >> >> >> Instead of using the allocator fallback lists directly, keep a >> >> separate node migration ordering. But, reuse the same data used >> >> to generate page allocator fallback in the first place: >> >> find_next_best_node(). >> >> >> >> This means that the firmware data used to populate node distances >> >> essentially dictates the ordering for now. It should also be >> >> architecture-neutral since all NUMA architectures have a working >> >> find_next_best_node(). >> >> >> >> The protocol for node_demotion[] access and writing is not >> >> standard. It has no specific locking and is intended to be read >> >> locklessly. Readers must take care to avoid observing changes >> >> that appear incoherent. This was done so that node_demotion[] >> >> locking has no chance of becoming a bottleneck on large systems >> >> with lots of CPUs in direct reclaim. >> >> >> >> This code is unused for now. It will be called later in the >> >> series. >> >> >> >> Signed-off-by: Dave Hansen >> >> Signed-off-by: "Huang, Ying" >> >> Reviewed-by: Yang Shi >> >> Cc: Michal Hocko >> >> Cc: Wei Xu >> >> Cc: David Rientjes >> >> Cc: Dan Williams >> >> Cc: David Hildenbrand >> >> Cc: osalvador >> >> >> >> -- >> >> >> >> Changes from 20200122: >> >> * Add big node_demotion[] comment >> >> Changes from 20210302: >> >> * Fix typo in node_demotion[] comment >> >> --- >> >> mm/internal.h | 5 ++ >> >> mm/migrate.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++- >> >> mm/page_alloc.c | 2 +- >> >> 3 files changed, 180 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/mm/internal.h b/mm/internal.h >> >> index 2f1182948aa6..0344cd78e170 100644 >> >> --- a/mm/internal.h >> >> +++ b/mm/internal.h >> >> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, >> >> >> >> #ifdef CONFIG_NUMA >> >> extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); >> >> +extern int find_next_best_node(int node, nodemask_t *used_node_mask); >> >> #else >> >> static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, >> >> unsigned int order) >> >> { >> >> return NODE_RECLAIM_NOSCAN; >> >> } >> >> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask) >> >> +{ >> >> + return NUMA_NO_NODE; >> >> +} >> >> #endif >> >> >> >> extern int hwpoison_filter(struct page *p); >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> >> index 6cab668132f9..111f8565f75d 100644 >> >> --- a/mm/migrate.c >> >> +++ b/mm/migrate.c >> >> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage, >> >> return rc; >> >> } >> >> >> >> + >> >> +/* >> >> + * node_demotion[] example: >> >> + * >> >> + * Consider a system with two sockets. Each socket has >> >> + * three classes of memory attached: fast, medium and slow. >> >> + * Each memory class is placed in its own NUMA node. The >> >> + * CPUs are placed in the node with the "fast" memory. The >> >> + * 6 NUMA nodes (0-5) might be split among the sockets like >> >> + * this: >> >> + * >> >> + * Socket A: 0, 1, 2 >> >> + * Socket B: 3, 4, 5 >> >> + * >> >> + * When Node 0 fills up, its memory should be migrated to >> >> + * Node 1. When Node 1 fills up, it should be migrated to >> >> + * Node 2. The migration path start on the nodes with the >> >> + * processors (since allocations default to this node) and >> >> + * fast memory, progress through medium and end with the >> >> + * slow memory: >> >> + * >> >> + * 0 -> 1 -> 2 -> stop >> >> + * 3 -> 4 -> 5 -> stop >> >> + * >> >> + * This is represented in the node_demotion[] like this: >> >> + * >> >> + * { 1, // Node 0 migrates to 1 >> >> + * 2, // Node 1 migrates to 2 >> >> + * -1, // Node 2 does not migrate >> >> + * 4, // Node 3 migrates to 4 >> >> + * 5, // Node 4 migrates to 5 >> >> + * -1} // Node 5 does not migrate >> >> + */ >> >> + >> >> +/* >> >> + * Writes to this array occur without locking. READ_ONCE() >> >> + * is recommended for readers to ensure consistent reads. >> >> + */ >> >> static int node_demotion[MAX_NUMNODES] __read_mostly = >> >> {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; >> >> >> >> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly = >> >> */ >> >> int next_demotion_node(int node) >> >> { >> >> - return node_demotion[node]; >> >> + /* >> >> + * node_demotion[] is updated without excluding >> >> + * this function from running. READ_ONCE() avoids >> >> + * reading multiple, inconsistent 'node' values >> >> + * during an update. >> >> + */ >> >> + return READ_ONCE(node_demotion[node]); >> >> } >> > >> > Is it necessary to have two separate patches to add node_demotion and >> > next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2? >> > >> > Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU. >> > I guess it might be much simpler to just introduce node_demotion with RCU >> > in this patch and Patch 3 only takes care of hotplug events. >> >> Hi, Dave, >> >> What do you think about this? > > Squashing patch #1 and #2 had been mentioned in the previous review > and it seems Dave agreed. > https://lore.kernel.org/linux-mm/4573cb9a-31ca-3b3d-96bc-5d94876b9709@intel.com/ Thanks a lot for your information! Best Regards, Huang, Ying [snip] From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A79DC48BE5 for ; Tue, 22 Jun 2021 00:55:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 42C446128E for ; Tue, 22 Jun 2021 00:55:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 42C446128E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C420C6B0062; Mon, 21 Jun 2021 20:55:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C186C6B006C; Mon, 21 Jun 2021 20:55:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE0746B0070; Mon, 21 Jun 2021 20:55:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0052.hostedemail.com [216.40.44.52]) by kanga.kvack.org (Postfix) with ESMTP id 7FD7A6B0062 for ; Mon, 21 Jun 2021 20:55:27 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 191688150 for ; Tue, 22 Jun 2021 00:55:27 +0000 (UTC) X-FDA: 78279541494.19.8BB7DF5 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf21.hostedemail.com (Postfix) with ESMTP id 82502E0009AB for ; Tue, 22 Jun 2021 00:55:24 +0000 (UTC) IronPort-SDR: QnabT/qPy9KC7lGkzmSmy2lPziiPKglMMusoB1Y57EChfBCIJ+P6q1LGm+MmJsQ+7tD86+ZD40 AQmPyhfHdK0Q== X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="207000740" X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="207000740" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 17:55:23 -0700 IronPort-SDR: mUggc9ePv/v55XF4K8RrJjPDx7y/v5/71SjTI2s55xRsFIw6mpTUN8xGGB60P9hX/wB4ZpDN3W fwxo5/yZc+DA== X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="454075302" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 17:55:20 -0700 From: "Huang, Ying" To: Yang Shi Cc: Zi Yan , Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Michal Hocko , Wei Xu , David Rientjes , Dan Williams , David Hildenbrand , osalvador Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order References: <20210618061537.434999-1-ying.huang@intel.com> <20210618061537.434999-3-ying.huang@intel.com> <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 22 Jun 2021 08:55:18 +0800 In-Reply-To: (Yang Shi's message of "Mon, 21 Jun 2021 12:51:30 -0700") Message-ID: <87wnqmn2fd.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Authentication-Results: imf21.hostedemail.com; dkim=none; spf=none (imf21.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=ying.huang@intel.com; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none) X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 82502E0009AB X-Stat-Signature: exy7w5yj97nkkgdknq3ewzm51ku7a8sw X-HE-Tag: 1624323324-650372 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Yang Shi writes: > On Sat, Jun 19, 2021 at 1:19 AM Huang, Ying wrote: >> >> Zi Yan writes: >> >> > On 18 Jun 2021, at 2:15, Huang Ying wrote: >> > >> >> From: Dave Hansen >> >> >> >> When memory fills up on a node, memory contents can be >> >> automatically migrated to another node. The biggest problems are >> >> knowing when to migrate and to where the migration should be >> >> targeted. >> >> >> >> The most straightforward way to generate the "to where" list would >> >> be to follow the page allocator fallback lists. Those lists >> >> already tell us if memory is full where to look next. It would >> >> also be logical to move memory in that order. >> >> >> >> But, the allocator fallback lists have a fatal flaw: most nodes >> >> appear in all the lists. This would potentially lead to migration >> >> cycles (A->B, B->A, A->B, ...). >> >> >> >> Instead of using the allocator fallback lists directly, keep a >> >> separate node migration ordering. But, reuse the same data used >> >> to generate page allocator fallback in the first place: >> >> find_next_best_node(). >> >> >> >> This means that the firmware data used to populate node distances >> >> essentially dictates the ordering for now. It should also be >> >> architecture-neutral since all NUMA architectures have a working >> >> find_next_best_node(). >> >> >> >> The protocol for node_demotion[] access and writing is not >> >> standard. It has no specific locking and is intended to be read >> >> locklessly. Readers must take care to avoid observing changes >> >> that appear incoherent. This was done so that node_demotion[] >> >> locking has no chance of becoming a bottleneck on large systems >> >> with lots of CPUs in direct reclaim. >> >> >> >> This code is unused for now. It will be called later in the >> >> series. >> >> >> >> Signed-off-by: Dave Hansen >> >> Signed-off-by: "Huang, Ying" >> >> Reviewed-by: Yang Shi >> >> Cc: Michal Hocko >> >> Cc: Wei Xu >> >> Cc: David Rientjes >> >> Cc: Dan Williams >> >> Cc: David Hildenbrand >> >> Cc: osalvador >> >> >> >> -- >> >> >> >> Changes from 20200122: >> >> * Add big node_demotion[] comment >> >> Changes from 20210302: >> >> * Fix typo in node_demotion[] comment >> >> --- >> >> mm/internal.h | 5 ++ >> >> mm/migrate.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++- >> >> mm/page_alloc.c | 2 +- >> >> 3 files changed, 180 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/mm/internal.h b/mm/internal.h >> >> index 2f1182948aa6..0344cd78e170 100644 >> >> --- a/mm/internal.h >> >> +++ b/mm/internal.h >> >> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, >> >> >> >> #ifdef CONFIG_NUMA >> >> extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); >> >> +extern int find_next_best_node(int node, nodemask_t *used_node_mask); >> >> #else >> >> static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, >> >> unsigned int order) >> >> { >> >> return NODE_RECLAIM_NOSCAN; >> >> } >> >> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask) >> >> +{ >> >> + return NUMA_NO_NODE; >> >> +} >> >> #endif >> >> >> >> extern int hwpoison_filter(struct page *p); >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> >> index 6cab668132f9..111f8565f75d 100644 >> >> --- a/mm/migrate.c >> >> +++ b/mm/migrate.c >> >> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage, >> >> return rc; >> >> } >> >> >> >> + >> >> +/* >> >> + * node_demotion[] example: >> >> + * >> >> + * Consider a system with two sockets. Each socket has >> >> + * three classes of memory attached: fast, medium and slow. >> >> + * Each memory class is placed in its own NUMA node. The >> >> + * CPUs are placed in the node with the "fast" memory. The >> >> + * 6 NUMA nodes (0-5) might be split among the sockets like >> >> + * this: >> >> + * >> >> + * Socket A: 0, 1, 2 >> >> + * Socket B: 3, 4, 5 >> >> + * >> >> + * When Node 0 fills up, its memory should be migrated to >> >> + * Node 1. When Node 1 fills up, it should be migrated to >> >> + * Node 2. The migration path start on the nodes with the >> >> + * processors (since allocations default to this node) and >> >> + * fast memory, progress through medium and end with the >> >> + * slow memory: >> >> + * >> >> + * 0 -> 1 -> 2 -> stop >> >> + * 3 -> 4 -> 5 -> stop >> >> + * >> >> + * This is represented in the node_demotion[] like this: >> >> + * >> >> + * { 1, // Node 0 migrates to 1 >> >> + * 2, // Node 1 migrates to 2 >> >> + * -1, // Node 2 does not migrate >> >> + * 4, // Node 3 migrates to 4 >> >> + * 5, // Node 4 migrates to 5 >> >> + * -1} // Node 5 does not migrate >> >> + */ >> >> + >> >> +/* >> >> + * Writes to this array occur without locking. READ_ONCE() >> >> + * is recommended for readers to ensure consistent reads. >> >> + */ >> >> static int node_demotion[MAX_NUMNODES] __read_mostly = >> >> {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; >> >> >> >> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly = >> >> */ >> >> int next_demotion_node(int node) >> >> { >> >> - return node_demotion[node]; >> >> + /* >> >> + * node_demotion[] is updated without excluding >> >> + * this function from running. READ_ONCE() avoids >> >> + * reading multiple, inconsistent 'node' values >> >> + * during an update. >> >> + */ >> >> + return READ_ONCE(node_demotion[node]); >> >> } >> > >> > Is it necessary to have two separate patches to add node_demotion and >> > next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2? >> > >> > Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU. >> > I guess it might be much simpler to just introduce node_demotion with RCU >> > in this patch and Patch 3 only takes care of hotplug events. >> >> Hi, Dave, >> >> What do you think about this? > > Squashing patch #1 and #2 had been mentioned in the previous review > and it seems Dave agreed. > https://lore.kernel.org/linux-mm/4573cb9a-31ca-3b3d-96bc-5d94876b9709@intel.com/ Thanks a lot for your information! Best Regards, Huang, Ying [snip]