From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4AA19C433EF for ; Sat, 7 May 2022 07:56:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B8086B007D; Sat, 7 May 2022 03:56:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 366E66B007E; Sat, 7 May 2022 03:56:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22F026B0080; Sat, 7 May 2022 03:56:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1031A6B007D for ; Sat, 7 May 2022 03:56:24 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C86562DFC5 for ; Sat, 7 May 2022 07:56:23 +0000 (UTC) X-FDA: 79438189446.13.2FD372B Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf03.hostedemail.com (Postfix) with ESMTP id 3757620021 for ; Sat, 7 May 2022 07:56:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651910182; x=1683446182; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=QjP7gFNkYoRcvp5aKs/kvkBClgI5+qwDadM8XMqpay0=; b=I5b/5+mJ5LKM+EKsZt9BQgXnvhrnA089+BQ76nyZLtPf3CtbOLAvwi0k L2Ag93AgvzD+mC2yvvyfFjKSeXZIdHCvpJceay9N1db2tNJPazBzeV0lF zR0bNp50E9fxlg2V3QdWVBM6aiAxfXQFsRhj41z0QHCK8vNj1dcJuen4k jz3oRYYag59cIRsBxVgBpZrSbO+j4rKo4VkGEXZUycCgRF2JMfJbB6mcF 2fZy86kpQmjh+s8IYmPMvLufELZzKDtbuyTcBSPHctUCZgH+yJjg7h71y qJI9zdg+3E3+0GFLuuIhnnhcX+Us0bDSG5AgxMNmhTgJazlInMylrc2B8 w==; X-IronPort-AV: E=McAfee;i="6400,9594,10339"; a="268298955" X-IronPort-AV: E=Sophos;i="5.91,206,1647327600"; d="scan'208";a="268298955" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2022 00:56:18 -0700 X-IronPort-AV: E=Sophos;i="5.91,206,1647327600"; d="scan'208";a="564227035" Received: from sjin6-mobl1.ccr.corp.intel.com ([10.254.214.15]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2022 00:56:14 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces From: "ying.huang@intel.com" To: Dan Williams , Yang Shi Cc: Wei Xu , Andrew Morton , Dave Hansen , Linux MM , Greg Thelen , "Aneesh Kumar K.V" , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan Cameron Date: Sat, 07 May 2022 15:56:12 +0800 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 3757620021 X-Stat-Signature: o41imw9mkbygugapt4q4639gadtbum3b Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="I5b/5+mJ"; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf03.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=ying.huang@intel.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1651910174-404049 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Dan, On Sun, 2022-05-01 at 11:35 -0700, Dan Williams wrote: > On Fri, Apr 29, 2022 at 8:59 PM Yang Shi wrote: > > > > Hi Wei, > > > > Thanks for the nice writing. Please see the below inline comments. > > > > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu wrote: > > > > > > The current kernel has the basic memory tiering support: Inactive > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower > > > tier NUMA node to make room for new allocations on the higher tier > > > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > > > migrated (promoted) to a higher tier NUMA node to improve the > > > performance. > > > > > > A tiering relationship between NUMA nodes in the form of demotion path > > > is created during the kernel initialization and updated when a NUMA > > > node is hot-added or hot-removed. The current implementation puts all > > > nodes with CPU into the top tier, and then builds the tiering hierarchy > > > tier-by-tier by establishing the per-node demotion targets based on > > > the distances between nodes. > > > > > > The current memory tiering interface needs to be improved to address > > > several important use cases: > > > > > > * The current tiering initialization code always initializes > > >   each memory-only NUMA node into a lower tier. But a memory-only > > >   NUMA node may have a high performance memory device (e.g. a DRAM > > >   device attached via CXL.mem or a DRAM-backed memory-only node on > > >   a virtual machine) and should be put into the top tier. > > > > > > * The current tiering hierarchy always puts CPU nodes into the top > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > > >   with CPUs are better to be placed into the next lower tier. > > > > > > * Also because the current tiering hierarchy always puts CPU nodes > > >   into the top tier, when a CPU is hot-added (or hot-removed) and > > >   triggers a memory node from CPU-less into a CPU node (or vice > > >   versa), the memory tiering hierarchy gets changed, even though no > > >   memory node is added or removed. This can make the tiering > > >   hierarchy much less stable. > > > > I'd prefer the firmware builds up tiers topology then passes it to > > kernel so that kernel knows what nodes are in what tiers. No matter > > what nodes are hot-removed/hot-added they always stay in their tiers > > defined by the firmware. I think this is important information like > > numa distances. NUMA distance alone can't satisfy all the usecases > > IMHO. > > Just want to note here that the platform firmware can only describe > the tiers of static memory present at boot. CXL hotplug breaks this > model and the kernel is left to dynamically determine the device's > performance characteristics and the performance of the topology to > reach that device. Now, the platform firmware does set expectations > for the perfomance class of different memory ranges, but there is no > way to know in advance the performance of devices that will be asked > to be physically or logically added to the memory configuration. That > said, it's probably still too early to define ABI for those > exceptional cases where the kernel needs to make a policy decision > about a device that does not fit into the firmware's performance > expectations, but just note that there are limits to the description > that platform firmware can provide. > Does this mean we will need some kind of in-kernel memory latency measurement mechanism to determine the tier of the memory device finally? Best Regards, Huang, Ying > I agree that NUMA distance alone is inadequate and the kernel needs to > make better use of data like ACPI HMAT to determine the default > tiering order.