From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=YCYJ=LP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	HTML_MESSAGE,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,
	MSGID_FROM_MTA_HEADER,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 36D0BC4743C
	for <linux-mm@archiver.kernel.org>; Mon, 21 Jun 2021 14:50:28 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8C46E60E0B
	for <linux-mm@archiver.kernel.org>; Mon, 21 Jun 2021 14:50:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8C46E60E0B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id EE89A6B006C; Mon, 21 Jun 2021 10:50:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EBF4D6B0070; Mon, 21 Jun 2021 10:50:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CC46C6B0072; Mon, 21 Jun 2021 10:50:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171])
	by kanga.kvack.org (Postfix) with ESMTP id 893836B006C
	for <linux-mm@kvack.org>; Mon, 21 Jun 2021 10:50:26 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 197F81263F
	for <linux-mm@kvack.org>; Mon, 21 Jun 2021 14:50:26 +0000 (UTC)
X-FDA: 78278016852.13.9FB1109
Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2078.outbound.protection.outlook.com [40.107.94.78])
	by imf18.hostedemail.com (Postfix) with ESMTP id 5FB6120015E7
	for <linux-mm@kvack.org>; Mon, 21 Jun 2021 14:50:25 +0000 (UTC)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=OKC/dFJUg1VK+msiIEuDpJGTEkJlDnaYAuSTdRG/zD4pmq1M/UlJ3H2LKP7dZu2rOs70peX96u1pjBmKaF6C0eAWjBQH3C76M6T/V2XhZ1Bn+qpROEe1p6ZxqKvg8ovQ7yMaVIZRwE7gB8z7l7rTrCFQBgTCpOCW47wpfbg4EL+fNdpp63CQmaeTR5X1V6RQ6idX6cBgEDOg1u4iKWsvGEiBIMgn545BO0CMtLhFnAdk8Z8VuiAj1bN52yPCFoQJu2oRAZcLdQ6nroAs6T7W62y13cEdRzfosEm5Q2+vH8XqINwXib3ibfpU2Qr688ny60hTqGPYnoRUgc6DfCnoTA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=YmGYS/kOe7O5n5yEK83G/ePg1q3FcobzrJeaJEFd7Y0=;
 b=NeM/dzCZHJaO81aM2ylmZSZiLnOjH8OfqvnACJdw9L6rgngFr1AQRnBMQhsw2i3bSio67E0WxQnMURDiK/0P2YpBMn/nkXF2UmxNw9tKjvrb5TeFIDPmR36QanEYXAvvTX6eiRqoVePF5NtrllmMb6zkxvLXutQsu8+Cq99NRgHbza/p1T0luXn8wSDc5ZVL3hXE0gWn7mCYGOX2vp5BtIf9k2f3mIbTZcPY3FJDTJea9JsHwdo2hsF9nDB82AQ9J3IL/poukIYDIJU+mhmsqAjoOMgbXsQG7o3elDUB/0TB8OIu7KJMduPLrJOUKGeKeUkl8BKviEx5Ifxga/FCsA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com;
 dkim=pass header.d=nvidia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com;
 s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=YmGYS/kOe7O5n5yEK83G/ePg1q3FcobzrJeaJEFd7Y0=;
 b=e0qkpJWO6EF2VqZCBhFmt75mJY42eC41geD0hudkYly533J3HNo+I/xC+NEkI6pbOz+vz3WNes4c+F8q06O0dHYh5ZgaBf5wT9ysRgSuaX40awNRfL3K9GTLtipoQgjRmjvu7uvvFRzWBT/SnT6BIkqC8RxAjpvHBYSgaE5DKYWvxaQg334uU7FdmqjGKiMuD37uQE2GH92zqcdWUX20Y1v67z01vcpaGXijCKT9sen+x3HeBTrowFjP68cyGbLahQSKKaGh41s6WOWhZkNixCTFD78CCF6EBu8GA5O4nYuwhVLEDUbZrKc4TGYqEG8I8BiYFVA+kbdi6iv4MAvZXg==
Received: from MN2PR12MB3823.namprd12.prod.outlook.com (2603:10b6:208:168::26)
 by BL0PR12MB4724.namprd12.prod.outlook.com (2603:10b6:208:87::23) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4242.21; Mon, 21 Jun
 2021 14:50:21 +0000
Received: from MN2PR12MB3823.namprd12.prod.outlook.com
 ([fe80::dcee:535c:30e:95f4]) by MN2PR12MB3823.namprd12.prod.outlook.com
 ([fe80::dcee:535c:30e:95f4%6]) with mapi id 15.20.4242.023; Mon, 21 Jun 2021
 14:50:21 +0000
From: Zi Yan <ziy@nvidia.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, Yang Shi <shy828301@gmail.com>,
 Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>,
 David Rientjes <rientjes@google.com>,
 Dan Williams <dan.j.williams@intel.com>,
 David Hildenbrand <david@redhat.com>, osalvador <osalvador@suse.de>
Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order
Date: Mon, 21 Jun 2021 10:50:14 -0400
X-Mailer: MailMate (1.14r5812)
Message-ID: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com>
In-Reply-To: <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com>
References: <20210618061537.434999-1-ying.huang@intel.com>
 <20210618061537.434999-3-ying.huang@intel.com>
 <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com>
 <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: multipart/signed;
 boundary="=_MailMate_B213593D-221A-421A-B151-2B8E71B15048_=";
 micalg=pgp-sha512; protocol="application/pgp-signature"
X-Originating-IP: [216.228.112.21]
X-ClientProxiedBy: BL1PR13CA0008.namprd13.prod.outlook.com
 (2603:10b6:208:256::13) To MN2PR12MB3823.namprd12.prod.outlook.com
 (2603:10b6:208:168::26)
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
Received: from [10.2.58.56] (216.228.112.21) by BL1PR13CA0008.namprd13.prod.outlook.com (2603:10b6:208:256::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4264.7 via Frontend Transport; Mon, 21 Jun 2021 14:50:18 +0000
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: 05597fdc-4b24-472f-e501-08d934c3e4d2
X-MS-TrafficTypeDiagnostic: BL0PR12MB4724:
X-MS-Exchange-Transport-Forked: True
X-Microsoft-Antispam-PRVS:
	<BL0PR12MB4724B22DA378C8A59972BBF8C20A9@BL0PR12MB4724.namprd12.prod.outlook.com>
X-MS-Oob-TLC-OOBClassifiers: OLM:10000;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info:
	RPgikIgtjaCr66XZPJ8VeeoNznJEwpyNYxJnTpt0C2cVuCRz5zmnaZkRJ+JIuzwwk561p/1vcueOvnQSEJE1wbQlg76V6rngar5+3P06Th9D9qyCznCiuiMqltNFC1+0zMnHkSz3DaCAZaMdzeuH5BbdapYOKCEYrM5v8fOgd3B+miuwLOXQiyyKMAnw4ylDjH5NddOTRdBXONibj6ZYNgEaSgUIOlJ43KZTWkjfJKO9NudViXkKMAJrU0VZVTprl1Db4SOJFD0+psP1rb/81w31Dn5BE8JU68iVMIB1QF3NmfYK0Ctd2YUj9J/PW/nvsnpmtA8lKdb25plpaaYy4YNK8kuix48f6nUULELuZCFj7J+8+rpaENwXE0MvEWaxIvR5zD8fHX1DHaY/zfBXTh2ft638xv/gc5Ugwo3Ia9xgBYKUORcHgTnj98sz6tQ2sa7ZVGUEToOj6y4EmZ5PkFzSgfQS3UaTrnY9efEQSeCQ9jXx5RcMZzo3K6ilbGykVlbNkmlJioKm6UJkZPrvYJy76Ixq8WbgzudrqhQV9w9ZMRzXpRgrV5FeRK4JjKsL21orhKmj5y2E2VsazzkKC7wKi3ddKGi/uVEmeXzsTMd/pFcDfq/7QcVAINPkvs0Shgjx7rcI4BFjDZLjq22ZhlH3MIM7+FUVkmoYZRUZdlv5gCerHm5Y4DyLlgAgp0Vg
X-Forefront-Antispam-Report:
	CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MN2PR12MB3823.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(366004)(86362001)(33656002)(7416002)(53546011)(186003)(2616005)(66556008)(26005)(16526019)(4326008)(6666004)(36756003)(956004)(66946007)(6486002)(38100700002)(2906002)(83380400001)(66476007)(6916009)(30864003)(54906003)(5660300002)(16576012)(33964004)(21480400003)(8676002)(498600001)(235185007)(8936002)(45980500001)(72826004);DIR:OUT;SFP:1101;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0:
	=?utf-8?B?dWxyMkZzcHBFeWljVVhidWZ3VjlkUHFrN0crVDBKeUpLQzZMalhBcHF5aG1F?=
 =?utf-8?B?OWFqeUZ6Z2FKaCtkVlpiL0t0c04zTDltR2kwZVpFMGpDelVlRW8yNHhQZ0FI?=
 =?utf-8?B?dGpXckIyeFJsaGFCV0dlWHR3ZXhHaGtPQkg3RWxNVSttVUpkOUdZYXcvcDVV?=
 =?utf-8?B?TzdHZHJZSFJFZExFSE1MWDNoU0VZUGlSUm1KYUVqR3g3Q0ZZcTZIVjVOdTY5?=
 =?utf-8?B?ZWZmN2RqY3NXbmliWUp2dVJENlFycG5nemNtZnpTOGhhZlJMN2xNTWJzLy9G?=
 =?utf-8?B?QVN5eUZ6K3N5aThUbHVFQ2x2Z1lSZnZrRDU5Z0RubCtBaWlNSDdYc0wvVzNO?=
 =?utf-8?B?NHpaWXhpbnVCcUllWGtnZW9yb2xsUVJXVllWM1RFUGpPMloyZVNndEZNU0ly?=
 =?utf-8?B?WUZ0TUtmS01nOTNBNndFYVJ0QzdkU1RwWVZ1Qit2RFZpdXk4ZlNHNkRUUzhy?=
 =?utf-8?B?RXVsQzhFbGg0ZFBtMWJCb1RmcHB3WUpSczdHRU1vTjlrc29GZ05Xd3JMR2M1?=
 =?utf-8?B?NjZJbFN0ZHk4U2tNUXp3ODVkOVJNLzBLa1ZLejlRU29sUW1jSGtMb3FyYnlT?=
 =?utf-8?B?aExDTkdlYkFaMEFXbU5PUm9laE8waDVwVWxEbGlVSkVvcFVRcGVUcHRZNUpN?=
 =?utf-8?B?dld3OUtKZmhLNEhQVWRPeWVpTnF5dU9EelAzZlZLcjNvUUlHZGlQU0xHaEh4?=
 =?utf-8?B?K0trRjNDeU9Zeit3VEVjMFVxNi90aFUxb250dERwUGt4ZXVwNnp4bjJLSXAz?=
 =?utf-8?B?SjVDWVJ4MlpFbTlBNlFENkRoNGZCRGpuY3RvTDIwbm9OeVh5Y3diUDVqbHRo?=
 =?utf-8?B?M2N6ZzFYZWdIc0pLckVVUFdoN0x6OHFrN3VuQXRVK3l4NUZoZkN3RXFQNGdr?=
 =?utf-8?B?Vkd0cGxaR2RXNm5ST2Z3WkFyaEtINCswTkljMTRGdWl0WGZqQ1l3SmRlZ2xT?=
 =?utf-8?B?TTJUaUV5N1E3c3Q4UVROSlJUb0RhZWpKT1c1U3JQZlRQSFprL2NoU29yMVh5?=
 =?utf-8?B?R1hrRW5GTkd1VVR4aXVac0Vqbi9rbTlzODA2YWRPcjhYcGlFdWJTQzh0c3Rh?=
 =?utf-8?B?Q1EvbS9RVUJVKzQ0d2RCTnlWcTZNOUtqUkZzTlp5OGhhbHhvdSsyWmhMU2JC?=
 =?utf-8?B?dFVZWWI3T1NsM3MzVWczb1plM0lwdWc1dlNLb0x3QnZoM2RmTnFOSGlnajVZ?=
 =?utf-8?B?QzdqUEFRL0dTZ1l6YkFaQXE5aFJlV0ZWU0pueEF3cDRLaS9QR3ZhTzlaWk9Q?=
 =?utf-8?B?bTh3aW1TTnp1K0R1djc5a3R0dHh0dlh2K0o3NWJ4TUlKVGNYSHhYTW5YRUlr?=
 =?utf-8?B?Y01rUk9RNkx6Yk1reDdUbmg1T3pSdm15WElYVnhNTEN3b0pUUnQ1RXlvUGpD?=
 =?utf-8?B?a2RFcWsyMVRuSGdGMUx1VDlGMGRnaHR0T0xTOWw4Yksya1RQejRoL29seE9q?=
 =?utf-8?B?dUlqa1NEbW81UjNSbTRRZURpWGtHUmdBWDJ0bnpaV3gxWEFjano0VUxJbnNy?=
 =?utf-8?B?aG9qL205eFdPY2hTU2dRYkZLcWJjYTNIQUFTeHc0TnRBa2wrS2ljSzJsdmpN?=
 =?utf-8?B?OVVGdHpGL3JNQS9uOHFmWnlsbXhDUWlLVVNZSDlQMUhZaG5ISWJVY2pnaXNF?=
 =?utf-8?B?WXpDdmpBWHlZZ0hsWmZaM21Cam42SVJsQlFzMUtnSTZPc2l3QjI4QTlSakJp?=
 =?utf-8?B?VVJnRFFQd2NkNytydGFLTlhpOHMxWVpqYWc1MEFodmN1S0pHcVppeUs5V0dW?=
 =?utf-8?Q?kkO6YXEl8c81DP48Ca37Rk9ITIIuLAOSiE+AirA?=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 05597fdc-4b24-472f-e501-08d934c3e4d2
X-MS-Exchange-CrossTenant-AuthSource: MN2PR12MB3823.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 21 Jun 2021 14:50:21.7186
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: Ule+ORHCaoaeqBQB/GkSIkqH21ALpgOignOaphBU2LBMI2oyVwNKE/f9U2/7ekg5
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL0PR12MB4724
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 5FB6120015E7
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=Nvidia.com header.s=selector2 header.b=e0qkpJWO;
	spf=none (imf18.hostedemail.com: domain of ziy@nvidia.com has no SPF policy when checking 40.107.94.78) smtp.mailfrom=ziy@nvidia.com;
	dmarc=pass (policy=none) header.from=nvidia.com
X-Stat-Signature: he6arcs1uw55ba9chdkenpihcth6pyzi
X-HE-Tag: 1624287025-824446
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

--=_MailMate_B213593D-221A-421A-B151-2B8E71B15048_=
Content-Type: multipart/alternative;
 boundary="=_MailMate_EE747415-A08F-42A4-80A0-0ACA9A83928A_="


--=_MailMate_EE747415-A08F-42A4-80A0-0ACA9A83928A_=
Content-Type: text/plain; charset=UTF-8; markup=markdown
Content-Transfer-Encoding: quoted-printable

On 19 Jun 2021, at 4:18, Huang, Ying wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 18 Jun 2021, at 2:15, Huang Ying wrote:
>>
>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>
>>> When memory fills up on a node, memory contents can be
>>> automatically migrated to another node.  The biggest problems are
>>> knowing when to migrate and to where the migration should be
>>> targeted.
>>>
>>> The most straightforward way to generate the "to where" list would
>>> be to follow the page allocator fallback lists.  Those lists
>>> already tell us if memory is full where to look next.  It would
>>> also be logical to move memory in that order.
>>>
>>> But, the allocator fallback lists have a fatal flaw: most nodes
>>> appear in all the lists.  This would potentially lead to migration
>>> cycles (A->B, B->A, A->B, ...).
>>>
>>> Instead of using the allocator fallback lists directly, keep a
>>> separate node migration ordering.  But, reuse the same data used
>>> to generate page allocator fallback in the first place:
>>> find_next_best_node().
>>>
>>> This means that the firmware data used to populate node distances
>>> essentially dictates the ordering for now.  It should also be
>>> architecture-neutral since all NUMA architectures have a working
>>> find_next_best_node().
>>>
>>> The protocol for node_demotion[] access and writing is not
>>> standard.  It has no specific locking and is intended to be read
>>> locklessly.  Readers must take care to avoid observing changes
>>> that appear incoherent.  This was done so that node_demotion[]
>>> locking has no chance of becoming a bottleneck on large systems
>>> with lots of CPUs in direct reclaim.
>>>
>>> This code is unused for now.  It will be called later in the
>>> series.
>>>
>>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Wei Xu <weixugc@google.com>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: osalvador <osalvador@suse.de>
>>>
>>> --
>>>
>>> Changes from 20200122:
>>>  * Add big node_demotion[] comment
>>> Changes from 20210302:
>>>  * Fix typo in node_demotion[] comment
>>> ---
>>>  mm/internal.h   |   5 ++
>>>  mm/migrate.c    | 175 ++++++++++++++++++++++++++++++++++++++++++++++=
+-
>>>  mm/page_alloc.c |   2 +-
>>>  3 files changed, 180 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 2f1182948aa6..0344cd78e170 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_lim=
its(unsigned long *start_pfn,
>>>
>>>  #ifdef CONFIG_NUMA
>>>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
>>> +extern int find_next_best_node(int node, nodemask_t *used_node_mask)=
;
>>>  #else
>>>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask=
,
>>>  				unsigned int order)
>>>  {
>>>  	return NODE_RECLAIM_NOSCAN;
>>>  }
>>> +static inline int find_next_best_node(int node, nodemask_t *used_nod=
e_mask)
>>> +{
>>> +	return NUMA_NO_NODE;
>>> +}
>>>  #endif
>>>
>>>  extern int hwpoison_filter(struct page *p);
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6cab668132f9..111f8565f75d 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page,=
 struct page *newpage,
>>>  	return rc;
>>>  }
>>>
>>> +
>>> +/*
>>> + * node_demotion[] example:
>>> + *
>>> + * Consider a system with two sockets.  Each socket has
>>> + * three classes of memory attached: fast, medium and slow.
>>> + * Each memory class is placed in its own NUMA node.  The
>>> + * CPUs are placed in the node with the "fast" memory.  The
>>> + * 6 NUMA nodes (0-5) might be split among the sockets like
>>> + * this:
>>> + *
>>> + *	Socket A: 0, 1, 2
>>> + *	Socket B: 3, 4, 5
>>> + *
>>> + * When Node 0 fills up, its memory should be migrated to
>>> + * Node 1.  When Node 1 fills up, it should be migrated to
>>> + * Node 2.  The migration path start on the nodes with the
>>> + * processors (since allocations default to this node) and
>>> + * fast memory, progress through medium and end with the
>>> + * slow memory:
>>> + *
>>> + *	0 -> 1 -> 2 -> stop
>>> + *	3 -> 4 -> 5 -> stop
>>> + *
>>> + * This is represented in the node_demotion[] like this:
>>> + *
>>> + *	{  1, // Node 0 migrates to 1
>>> + *	   2, // Node 1 migrates to 2
>>> + *	  -1, // Node 2 does not migrate
>>> + *	   4, // Node 3 migrates to 4
>>> + *	   5, // Node 4 migrates to 5
>>> + *	  -1} // Node 5 does not migrate
>>> + */
>>> +
>>> +/*
>>> + * Writes to this array occur without locking.  READ_ONCE()
>>> + * is recommended for readers to ensure consistent reads.
>>> + */
>>>  static int node_demotion[MAX_NUMNODES] __read_mostly =3D
>>>  	{[0 ...  MAX_NUMNODES - 1] =3D NUMA_NO_NODE};
>>>
>>> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_=
mostly =3D
>>>   */
>>>  int next_demotion_node(int node)
>>>  {
>>> -	return node_demotion[node];
>>> +	/*
>>> +	 * node_demotion[] is updated without excluding
>>> +	 * this function from running.  READ_ONCE() avoids
>>> +	 * reading multiple, inconsistent 'node' values
>>> +	 * during an update.
>>> +	 */
>>> +	return READ_ONCE(node_demotion[node]);
>>>  }
>>
>> Is it necessary to have two separate patches to add node_demotion and
>> next_demotion_node() then modify it immediately? Maybe merge Patch 1 i=
nto 2?
>>
>> Hmm, I just checked Patch 3 and it changes node_demotion again and use=
s RCU.
>> I guess it might be much simpler to just introduce node_demotion with =
RCU
>> in this patch and Patch 3 only takes care of hotplug events.
>
> Hi, Dave,
>
> What do you think about this?
>
>>>
>>>  /*
>>> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma =
*migrate)
>>>  }
>>>  EXPORT_SYMBOL(migrate_vma_finalize);
>>>  #endif /* CONFIG_DEVICE_PRIVATE */
>>> +
>>> +/* Disable reclaim-based migration. */
>>> +static void disable_all_migrate_targets(void)
>>> +{
>>> +	int node;
>>> +
>>> +	for_each_online_node(node)
>>> +		node_demotion[node] =3D NUMA_NO_NODE;
>>> +}
>>> +
>>> +/*
>>> + * Find an automatic demotion target for 'node'.
>>> + * Failing here is OK.  It might just indicate
>>> + * being at the end of a chain.
>>> + */
>>> +static int establish_migrate_target(int node, nodemask_t *used)
>>> +{
>>> +	int migration_target;
>>> +
>>> +	/*
>>> +	 * Can not set a migration target on a
>>> +	 * node with it already set.
>>> +	 *
>>> +	 * No need for READ_ONCE() here since this
>>> +	 * in the write path for node_demotion[].
>>> +	 * This should be the only thread writing.
>>> +	 */
>>> +	if (node_demotion[node] !=3D NUMA_NO_NODE)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	migration_target =3D find_next_best_node(node, used);
>>> +	if (migration_target =3D=3D NUMA_NO_NODE)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	node_demotion[node] =3D migration_target;
>>> +
>>> +	return migration_target;
>>> +}
>>> +
>>> +/*
>>> + * When memory fills up on a node, memory contents can be
>>> + * automatically migrated to another node instead of
>>> + * discarded at reclaim.
>>> + *
>>> + * Establish a "migration path" which will start at nodes
>>> + * with CPUs and will follow the priorities used to build the
>>> + * page allocator zonelists.
>>> + *
>>> + * The difference here is that cycles must be avoided.  If
>>> + * node0 migrates to node1, then neither node1, nor anything
>>> + * node1 migrates to can migrate to node0.
>>> + *
>>> + * This function can run simultaneously with readers of
>>> + * node_demotion[].  However, it can not run simultaneously
>>> + * with itself.  Exclusion is provided by memory hotplug events
>>> + * being single-threaded.
>>> + */
>>> +static void __set_migration_target_nodes(void)
>>> +{
>>> +	nodemask_t next_pass	=3D NODE_MASK_NONE;
>>> +	nodemask_t this_pass	=3D NODE_MASK_NONE;
>>> +	nodemask_t used_targets =3D NODE_MASK_NONE;
>>> +	int node;
>>> +
>>> +	/*
>>> +	 * Avoid any oddities like cycles that could occur
>>> +	 * from changes in the topology.  This will leave
>>> +	 * a momentary gap when migration is disabled.
>>> +	 */
>>> +	disable_all_migrate_targets();
>>> +
>>> +	/*
>>> +	 * Ensure that the "disable" is visible across the system.
>>> +	 * Readers will see either a combination of before+disable
>>> +	 * state or disable+after.  They will never see before and
>>> +	 * after state together.
>>> +	 *
>>> +	 * The before+after state together might have cycles and
>>> +	 * could cause readers to do things like loop until this
>>> +	 * function finishes.  This ensures they can only see a
>>> +	 * single "bad" read and would, for instance, only loop
>>> +	 * once.
>>> +	 */
>>> +	smp_wmb();
>>> +
>>> +	/*
>>> +	 * Allocations go close to CPUs, first.  Assume that
>>> +	 * the migration path starts at the nodes with CPUs.
>>> +	 */
>>> +	next_pass =3D node_states[N_CPU];
>>
>> Is there a plan of allowing user to change where the migration
>> path starts? Or maybe one step further providing an interface
>> to allow user to specify the demotion path. Something like
>> /sys/devices/system/node/node*/node_demotion.
>
> I don't think that's necessary at least for now.  Do you know any real
> world use case for this?

In our P9+volta system, GPU memory is exposed as a NUMA node.
For the GPU workloads with data size greater than GPU memory size,
it will be very helpful to allow pages in GPU memory to be migrated/demot=
ed
to CPU memory. With your current assumption, GPU memory -> CPU memory
demotion seems not possible, right? This should also apply to any
system with a device memory exposed as a NUMA node and workloads running
on the device and using CPU memory as a lower tier memory than the device=

memory.


=E2=80=94
Best Regards,
Yan, Zi

--=_MailMate_EE747415-A08F-42A4-80A0-0ACA9A83928A_=
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/xhtml; charset=3Dutf-8"=
>
</head>
<body>
<div style=3D"font-family:sans-serif"><div style=3D"white-space:normal">
<p dir=3D"auto">On 19 Jun 2021, at 4:18, Huang, Ying wrote:</p>
<blockquote style=3D"border-left:2px solid #777; color:#777; margin:0 0 5=
px; padding-left:5px">
<p dir=3D"auto">Zi Yan <a href=3D"mailto:ziy@nvidia.com" style=3D"color:#=
777">ziy@nvidia.com</a> writes:</p>
<blockquote style=3D"border-left:2px solid #777; color:#999; margin:0 0 5=
px; padding-left:5px; border-left-color:#999">
<p dir=3D"auto">On 18 Jun 2021, at 2:15, Huang Ying wrote:</p>
<blockquote style=3D"border-left:2px solid #777; color:#BBB; margin:0 0 5=
px; padding-left:5px; border-left-color:#BBB">
<p dir=3D"auto">From: Dave Hansen <a href=3D"mailto:dave.hansen@linux.int=
el.com" style=3D"color:#BBB">dave.hansen@linux.intel.com</a></p>
<p dir=3D"auto">When memory fills up on a node, memory contents can be<br=
>
automatically migrated to another node.  The biggest problems are<br>
knowing when to migrate and to where the migration should be<br>
targeted.</p>
<p dir=3D"auto">The most straightforward way to generate the "to where" l=
ist would<br>
be to follow the page allocator fallback lists.  Those lists<br>
already tell us if memory is full where to look next.  It would<br>
also be logical to move memory in that order.</p>
<p dir=3D"auto">But, the allocator fallback lists have a fatal flaw: most=
 nodes<br>
appear in all the lists.  This would potentially lead to migration<br>
cycles (A-&gt;B, B-&gt;A, A-&gt;B, ...).</p>
<p dir=3D"auto">Instead of using the allocator fallback lists directly, k=
eep a<br>
separate node migration ordering.  But, reuse the same data used<br>
to generate page allocator fallback in the first place:<br>
find_next_best_node().</p>
<p dir=3D"auto">This means that the firmware data used to populate node d=
istances<br>
essentially dictates the ordering for now.  It should also be<br>
architecture-neutral since all NUMA architectures have a working<br>
find_next_best_node().</p>
<p dir=3D"auto">The protocol for node_demotion[] access and writing is no=
t<br>
standard.  It has no specific locking and is intended to be read<br>
locklessly.  Readers must take care to avoid observing changes<br>
that appear incoherent.  This was done so that node_demotion[]<br>
locking has no chance of becoming a bottleneck on large systems<br>
with lots of CPUs in direct reclaim.</p>
<p dir=3D"auto">This code is unused for now.  It will be called later in =
the<br>
series.</p>
<p dir=3D"auto">Signed-off-by: Dave Hansen <a href=3D"mailto:dave.hansen@=
linux.intel.com" style=3D"color:#BBB">dave.hansen@linux.intel.com</a><br>=

Signed-off-by: "Huang, Ying" <a href=3D"mailto:ying.huang@intel.com" styl=
e=3D"color:#BBB">ying.huang@intel.com</a><br>
Reviewed-by: Yang Shi <a href=3D"mailto:shy828301@gmail.com" style=3D"col=
or:#BBB">shy828301@gmail.com</a><br>
Cc: Michal Hocko <a href=3D"mailto:mhocko@suse.com" style=3D"color:#BBB">=
mhocko@suse.com</a><br>
Cc: Wei Xu <a href=3D"mailto:weixugc@google.com" style=3D"color:#BBB">wei=
xugc@google.com</a><br>
Cc: David Rientjes <a href=3D"mailto:rientjes@google.com" style=3D"color:=
#BBB">rientjes@google.com</a><br>
Cc: Dan Williams <a href=3D"mailto:dan.j.williams@intel.com" style=3D"col=
or:#BBB">dan.j.williams@intel.com</a><br>
Cc: David Hildenbrand <a href=3D"mailto:david@redhat.com" style=3D"color:=
#BBB">david@redhat.com</a><br>
Cc: osalvador <a href=3D"mailto:osalvador@suse.de" style=3D"color:#BBB">o=
salvador@suse.de</a></p>
<p dir=3D"auto">--</p>
<p dir=3D"auto">Changes from 20200122:</p>
<ul>
<li>Add big node_demotion[] comment</li>
</ul>
<p dir=3D"auto">Changes from 20210302:</p>
<ul>
<li>Fix typo in node_demotion[] comment</li>
</ul>
<hr style=3D"background:#333; background-image:linear-gradient(to right, =
#ccc, #333, #ccc); border:0; height:1px" height=3D"1">
<p dir=3D"auto">mm/internal.h   |   5 ++<br>
mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-<br=
>
mm/page_alloc.c |   2 +-<br>
3 files changed, 180 insertions(+), 2 deletions(-)</p>
<p dir=3D"auto">diff --git a/mm/internal.h b/mm/internal.h<br>
index 2f1182948aa6..0344cd78e170 100644<br>
--- a/mm/internal.h<br>
+++ b/mm/internal.h<br>
@@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(=
unsigned long *start_pfn,</p>
<p dir=3D"auto">#ifdef CONFIG_NUMA<br>
extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);<br>
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);<br=
>
#else<br>
static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,<br>=

unsigned int order)<br>
{<br>
return NODE_RECLAIM_NOSCAN;<br>
}<br>
+static inline int find_next_best_node(int node, nodemask_t *used_node_ma=
sk)<br>
+{</p>
<ul>
<li>return NUMA_NO_NODE;</li>
</ul>
<p dir=3D"auto">+}<br>
#endif</p>
<p dir=3D"auto">extern int hwpoison_filter(struct page *p);<br>
diff --git a/mm/migrate.c b/mm/migrate.c<br>
index 6cab668132f9..111f8565f75d 100644<br>
--- a/mm/migrate.c<br>
+++ b/mm/migrate.c<br>
@@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, str=
uct page *newpage,<br>
return rc;<br>
}</p>
<ul>
<li>
</ul>
<p dir=3D"auto">+/*</p>
<ul>
<li>
<ul>
<li>node_demotion[] example:</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>Consider a system with two sockets.  Each socket has</li>
</ul>
</li>
<li>
<ul>
<li>three classes of memory attached: fast, medium and slow.</li>
</ul>
</li>
<li>
<ul>
<li>Each memory class is placed in its own NUMA node.  The</li>
</ul>
</li>
<li>
<ul>
<li>CPUs are placed in the node with the "fast" memory.  The</li>
</ul>
</li>
<li>
<ul>
<li>6 NUMA nodes (0-5) might be split among the sockets like</li>
</ul>
</li>
<li>
<ul>
<li>this:</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>Socket A: 0, 1, 2</li>
</ul>
</li>
<li>
<ul>
<li>Socket B: 3, 4, 5</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>When Node 0 fills up, its memory should be migrated to</li>
</ul>
</li>
<li>
<ul>
<li>Node 1.  When Node 1 fills up, it should be migrated to</li>
</ul>
</li>
<li>
<ul>
<li>Node 2.  The migration path start on the nodes with the</li>
</ul>
</li>
<li>
<ul>
<li>processors (since allocations default to this node) and</li>
</ul>
</li>
<li>
<ul>
<li>fast memory, progress through medium and end with the</li>
</ul>
</li>
<li>
<ul>
<li>slow memory:</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>0 -&gt; 1 -&gt; 2 -&gt; stop</li>
</ul>
</li>
<li>
<ul>
<li>3 -&gt; 4 -&gt; 5 -&gt; stop</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>This is represented in the node_demotion[] like this:</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>{  1, // Node 0 migrates to 1</li>
</ul>
</li>
<li>
<ul>
<li>2, // Node 1 migrates to 2</li>
</ul>
</li>
<li>
<ul>
<li>-1, // Node 2 does not migrate</li>
</ul>
</li>
<li>
<ul>
<li>4, // Node 3 migrates to 4</li>
</ul>
</li>
<li>
<ul>
<li>5, // Node 4 migrates to 5</li>
</ul>
</li>
<li>
<ul>
<li>-1} // Node 5 does not migrate</li>
</ul>
</li>
<li>*/</li>
<li>
</ul>
<p dir=3D"auto">+/*</p>
<ul>
<li>
<ul>
<li>Writes to this array occur without locking.  READ_ONCE()</li>
</ul>
</li>
<li>
<ul>
<li>is recommended for readers to ensure consistent reads.</li>
</ul>
</li>
<li>*/</li>
</ul>
<p dir=3D"auto">static int node_demotion[MAX_NUMNODES] __read_mostly =3D<=
br>
{[0 ...  MAX_NUMNODES - 1] =3D NUMA_NO_NODE};</p>
<p dir=3D"auto">@@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNO=
DES] __read_mostly =3D<br>
*/<br>
int next_demotion_node(int node)<br>
{</p>
<ul>
<li>return node_demotion[node];</li>
</ul>
<ul>
<li>/*</li>
<li>
<ul>
<li>node_demotion[] is updated without excluding</li>
</ul>
</li>
<li>
<ul>
<li>this function from running.  READ_ONCE() avoids</li>
</ul>
</li>
<li>
<ul>
<li>reading multiple, inconsistent 'node' values</li>
</ul>
</li>
<li>
<ul>
<li>during an update.</li>
</ul>
</li>
<li>*/</li>
<li>return READ_ONCE(node_demotion[node]);</li>
</ul>
<p dir=3D"auto">}</p>
</blockquote>
<p dir=3D"auto">Is it necessary to have two separate patches to add node_=
demotion and<br>
next_demotion_node() then modify it immediately? Maybe merge Patch 1 into=
 2?</p>
<p dir=3D"auto">Hmm, I just checked Patch 3 and it changes node_demotion =
again and uses RCU.<br>
I guess it might be much simpler to just introduce node_demotion with RCU=
<br>
in this patch and Patch 3 only takes care of hotplug events.</p>
</blockquote>
<p dir=3D"auto">Hi, Dave,</p>
<p dir=3D"auto">What do you think about this?</p>
<blockquote style=3D"border-left:2px solid #777; color:#999; margin:0 0 5=
px; padding-left:5px; border-left-color:#999">
<blockquote style=3D"border-left:2px solid #777; color:#BBB; margin:0 0 5=
px; padding-left:5px; border-left-color:#BBB">
<p dir=3D"auto">/*<br>
@@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma <em>=
migrate)<br>
}<br>
EXPORT_SYMBOL(migrate_vma_finalize);<br>
#endif /</em> CONFIG_DEVICE_PRIVATE <em>/<br>
+<br>
+/</em> Disable reclaim-based migration. */<br>
+static void disable_all_migrate_targets(void)<br>
+{</p>
<ul>
<li>int node;</li>
<li>
<li>for_each_online_node(node)</li>
<li>
<pre style=3D"background-color:#F7F7F7; border-radius:5px 5px 5px 5px; ma=
rgin-left:15px; margin-right:15px; max-width:90vw; overflow-x:auto; paddi=
ng:5px" bgcolor=3D"#F7F7F7"><code style=3D"background-color:#F7F7F7; bord=
er-radius:3px; margin:0; padding:0" bgcolor=3D"#F7F7F7">  node_demotion[n=
ode] =3D NUMA_NO_NODE;
</code></pre>
</li>
</ul>
<p dir=3D"auto">+}<br>
+<br>
+/*</p>
<ul>
<li>
<ul>
<li>Find an automatic demotion target for 'node'.</li>
</ul>
</li>
<li>
<ul>
<li>Failing here is OK.  It might just indicate</li>
</ul>
</li>
<li>
<ul>
<li>being at the end of a chain.</li>
</ul>
</li>
<li>*/</li>
</ul>
<p dir=3D"auto">+static int establish_migrate_target(int node, nodemask_t=
 *used)<br>
+{</p>
<ul>
<li>int migration_target;</li>
<li>
<li>/*</li>
<li>
<ul>
<li>Can not set a migration target on a</li>
</ul>
</li>
<li>
<ul>
<li>node with it already set.</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>No need for READ_ONCE() here since this</li>
</ul>
</li>
<li>
<ul>
<li>in the write path for node_demotion[].</li>
</ul>
</li>
<li>
<ul>
<li>This should be the only thread writing.</li>
</ul>
</li>
<li>*/</li>
<li>if (node_demotion[node] !=3D NUMA_NO_NODE)</li>
<li>
<pre style=3D"background-color:#F7F7F7; border-radius:5px 5px 5px 5px; ma=
rgin-left:15px; margin-right:15px; max-width:90vw; overflow-x:auto; paddi=
ng:5px" bgcolor=3D"#F7F7F7"><code style=3D"background-color:#F7F7F7; bord=
er-radius:3px; margin:0; padding:0" bgcolor=3D"#F7F7F7">  return NUMA_NO_=
NODE;
</code></pre>
</li>
<li>
<li>migration_target =3D find_next_best_node(node, used);</li>
<li>if (migration_target =3D=3D NUMA_NO_NODE)</li>
<li>
<pre style=3D"background-color:#F7F7F7; border-radius:5px 5px 5px 5px; ma=
rgin-left:15px; margin-right:15px; max-width:90vw; overflow-x:auto; paddi=
ng:5px" bgcolor=3D"#F7F7F7"><code style=3D"background-color:#F7F7F7; bord=
er-radius:3px; margin:0; padding:0" bgcolor=3D"#F7F7F7">  return NUMA_NO_=
NODE;
</code></pre>
</li>
<li>
<li>node_demotion[node] =3D migration_target;</li>
<li>
<li>return migration_target;</li>
</ul>
<p dir=3D"auto">+}<br>
+<br>
+/*</p>
<ul>
<li>
<ul>
<li>When memory fills up on a node, memory contents can be</li>
</ul>
</li>
<li>
<ul>
<li>automatically migrated to another node instead of</li>
</ul>
</li>
<li>
<ul>
<li>discarded at reclaim.</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>Establish a "migration path" which will start at nodes</li>
</ul>
</li>
<li>
<ul>
<li>with CPUs and will follow the priorities used to build the</li>
</ul>
</li>
<li>
<ul>
<li>page allocator zonelists.</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>The difference here is that cycles must be avoided.  If</li>
</ul>
</li>
<li>
<ul>
<li>node0 migrates to node1, then neither node1, nor anything</li>
</ul>
</li>
<li>
<ul>
<li>node1 migrates to can migrate to node0.</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>This function can run simultaneously with readers of</li>
</ul>
</li>
<li>
<ul>
<li>node_demotion[].  However, it can not run simultaneously</li>
</ul>
</li>
<li>
<ul>
<li>with itself.  Exclusion is provided by memory hotplug events</li>
</ul>
</li>
<li>
<ul>
<li>being single-threaded.</li>
</ul>
</li>
<li>*/</li>
</ul>
<p dir=3D"auto">+static void __set_migration_target_nodes(void)<br>
+{</p>
<ul>
<li>nodemask_t next_pass	=3D NODE_MASK_NONE;</li>
<li>nodemask_t this_pass	=3D NODE_MASK_NONE;</li>
<li>nodemask_t used_targets =3D NODE_MASK_NONE;</li>
<li>int node;</li>
<li>
<li>/*</li>
<li>
<ul>
<li>Avoid any oddities like cycles that could occur</li>
</ul>
</li>
<li>
<ul>
<li>from changes in the topology.  This will leave</li>
</ul>
</li>
<li>
<ul>
<li>a momentary gap when migration is disabled.</li>
</ul>
</li>
<li>*/</li>
<li>disable_all_migrate_targets();</li>
<li>
<li>/*</li>
<li>
<ul>
<li>Ensure that the "disable" is visible across the system.</li>
</ul>
</li>
<li>
<ul>
<li>Readers will see either a combination of before+disable</li>
</ul>
</li>
<li>
<ul>
<li>state or disable+after.  They will never see before and</li>
</ul>
</li>
<li>
<ul>
<li>after state together.</li>
</ul>
</li>
<li>
<ul>
<li>
</ul>
</li>
<li>
<ul>
<li>The before+after state together might have cycles and</li>
</ul>
</li>
<li>
<ul>
<li>could cause readers to do things like loop until this</li>
</ul>
</li>
<li>
<ul>
<li>function finishes.  This ensures they can only see a</li>
</ul>
</li>
<li>
<ul>
<li>single "bad" read and would, for instance, only loop</li>
</ul>
</li>
<li>
<ul>
<li>once.</li>
</ul>
</li>
<li>*/</li>
<li>smp_wmb();</li>
<li>
<li>/*</li>
<li>
<ul>
<li>Allocations go close to CPUs, first.  Assume that</li>
</ul>
</li>
<li>
<ul>
<li>the migration path starts at the nodes with CPUs.</li>
</ul>
</li>
<li>*/</li>
<li>next_pass =3D node_states[N_CPU];</li>
</ul>
</blockquote>
<p dir=3D"auto">Is there a plan of allowing user to change where the migr=
ation<br>
path starts? Or maybe one step further providing an interface<br>
to allow user to specify the demotion path. Something like<br>
/sys/devices/system/node/node*/node_demotion.</p>
</blockquote>
<p dir=3D"auto">I don't think that's necessary at least for now.  Do you =
know any real<br>
world use case for this?</p>
</blockquote>
<p dir=3D"auto">In our P9+volta system, GPU memory is exposed as a NUMA n=
ode.<br>
For the GPU workloads with data size greater than GPU memory size,<br>
it will be very helpful to allow pages in GPU memory to be migrated/demot=
ed<br>
to CPU memory. With your current assumption, GPU memory -&gt; CPU memory<=
br>
demotion seems not possible, right? This should also apply to any<br>
system with a device memory exposed as a NUMA node and workloads running<=
br>
on the device and using CPU memory as a lower tier memory than the device=
<br>
memory.</p>
<p dir=3D"auto">=E2=80=94<br>
Best Regards,<br>
Yan, Zi</p>

</div></div>
</body>
</html>

--=_MailMate_EE747415-A08F-42A4-80A0-0ACA9A83928A_=--

--=_MailMate_B213593D-221A-421A-B151-2B8E71B15048_=
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQJDBAEBCgAtFiEEh7yFAW3gwjwQ4C9anbJR82th+ooFAmDQpycPHHppeUBudmlk
aWEuY29tAAoJEJ2yUfNrYfqKXBwQAInh/fb9llQpNNIsGWQ5OWXGNw7Om2QVNRUX
ftMx0eQWXCNo3jqP7Fn5kNHeop6vlIILVooCu+Q6AynoPHIEsEMuS4McFNY+s+zT
NKeAY+3FsdWWMJKr8n1SPAV0XZeVuQP2cyDeJGA6R8bP51+oNbt4WdV0NJqr1Vvw
bA8eLja4JVqqTbrcpKFttMKo+kqU8ooFfXvxbespFfnxofokXMV+os2fleVfbDHH
Wc6vmyD4hJ/FuZw5oONq1i4e7DQ9scnUYbj/zy2JLcoLSbkbWzjo5DiuyWs2+G3p
wm8y/5M1TgbJ1q5BUmwE7FtjVT5U2Kt6vbdKsy9TIIdiJ7nbcSRlhCqXe6f20CWH
4SN4ocGXWNLelAcGpFbsmR5Un8D9dseOKIs0MkdztlfG6rrLq6xGUg/LcSD4z9vb
GaWCzZm2/NMN7SjRRbrgmJXHeyk/1f4z9MvyaR54MjbS4QG7RQ+SaK8Kb2w57ay3
a7h5METrg47rpJMsAip80sp40JLeM92ns3+gpH0QgbUNL8fM5xqqSLJnlS6MgPqp
qOX8Y9Fl5HCYVKaGkpJXJARBAWzlX2DzP3cNdooTenNjOw0SlbnjLqZo4e4gCAjc
cNUhOhl4ACoyWZupeLmc6I7WmKZxeYWEdhimhxEUz4LIGx7GAVG7KwhIiQ7b7/ld
H+4gp2VS
=5Sfx
-----END PGP SIGNATURE-----

--=_MailMate_B213593D-221A-421A-B151-2B8E71B15048_=--