All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ying Huang <ying.huang@intel.com>
To: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
Date: Wed, 08 Jun 2022 14:18:16 +0800	[thread overview]
Message-ID: <9f9506fbf35c4ec4b253d53914973be6e1f5a893.camel@intel.com> (raw)
In-Reply-To: <2b4f053b-de25-986c-f764-5cc6a28f4953@linux.ibm.com>

On Wed, 2022-06-08 at 10:28 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 3:02 AM, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > > 
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > > 
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > > 
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > > 
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > > 
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > > 
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > > 
> > > This patch series address the above by defining memory tiers explicitly.
> > > 
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > > 
> > > /sys/devices/system/memtier/memtierN/
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > > 
> > > For now we are not adding the dynamic number of memory tiers.
> > > But a future series supporting that is possible. Currently
> > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > node gets added to DEFAULT_MEMORY_TIER(1).
> > > 
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > 
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > 
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   include/linux/memory-tiers.h |  20 ++++
> > >   mm/Kconfig                   |  11 ++
> > >   mm/Makefile                  |   1 +
> > >   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > >   4 files changed, 220 insertions(+)
> > >   create mode 100644 include/linux/memory-tiers.h
> > >   create mode 100644 mm/memory-tiers.c
> > > 
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU    0
> > > +#define MEMORY_TIER_DRAM       1
> > > +#define MEMORY_TIER_PMEM       2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU    300
> > > +#define MEMORY_RANK_DRAM       200
> > > +#define MEMORY_RANK_PMEM       100
> > > +
> > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..08a3d330740b 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >   config ARCH_ENABLE_THP_MIGRATION
> > >          bool
> > > 
> > > +config TIERED_MEMORY
> > > +       bool "Support for explicit memory tiers"
> > > +       def_bool n
> > > +       depends on MIGRATION && NUMA
> > > +       help
> > > +         Support to split nodes into memory tiers explicitly and
> > > +         to demote pages on reclaim to lower tiers. This option
> > > +         also exposes sysfs interface to read nodes available in
> > > +         specific tier and to move specific node among different
> > > +         possible tiers.
> > 
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
> > 
> 
> This was added so that we could avoid doing multiple
> 
> #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> 
> Initially I had that as def_bool y and depends on MIGRATION && NUMA. But 
> it was later suggested that def_bool is not recommended for newer config.
> 
> How about
> 
>   config TIERED_MEMORY
>   	bool "Support for explicit memory tiers"

Need to remove this line too to make it invisible for users?

Best Regards,
HUang, Ying

> -	def_bool n
> -	depends on MIGRATION && NUMA
> -	help
> -	  Support to split nodes into memory tiers explicitly and
> -	  to demote pages on reclaim to lower tiers. This option
> -	  also exposes sysfs interface to read nodes available in
> -	  specific tier and to move specific node among different
> -	  possible tiers.
> +	def_bool MIGRATION && NUMA
> 
>   config HUGETLB_PAGE_SIZE_VARIABLE
>   	def_bool n
> 
> ie, we just make it a Kconfig variable without exposing it to the user?
> 
> -aneesh



  reply	other threads:[~2022-06-08  7:18 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-06-07 18:43   ` Tim Chen
2022-06-07 20:18     ` Wei Xu
2022-06-08  4:30     ` Aneesh Kumar K V
2022-06-08  6:06       ` Ying Huang
2022-06-08  4:37     ` Aneesh Kumar K V
2022-06-08  6:10       ` Ying Huang
2022-06-08  8:04         ` Aneesh Kumar K V
2022-06-07 21:32   ` Yang Shi
2022-06-08  1:34     ` Ying Huang
2022-06-08 16:37       ` Yang Shi
2022-06-09  6:52         ` Ying Huang
2022-06-08  4:58     ` Aneesh Kumar K V
2022-06-08  6:18       ` Ying Huang [this message]
2022-06-08 16:42       ` Yang Shi
2022-06-09  8:17         ` Aneesh Kumar K V
2022-06-09 16:04           ` Yang Shi
2022-06-08 14:11   ` Johannes Weiner
2022-06-08 14:21     ` Aneesh Kumar K V
2022-06-08 15:55     ` Johannes Weiner
2022-06-08 16:13       ` Aneesh Kumar K V
2022-06-08 18:16         ` Johannes Weiner
2022-06-09  2:33           ` Aneesh Kumar K V
2022-06-09 13:55             ` Johannes Weiner
2022-06-09 14:22               ` Jonathan Cameron
2022-06-09 20:41                 ` Johannes Weiner
2022-06-10  6:15                   ` Ying Huang
2022-06-10  9:57                   ` Jonathan Cameron
2022-06-13 14:05                     ` Johannes Weiner
2022-06-13 14:23                       ` Aneesh Kumar K V
2022-06-13 15:50                         ` Johannes Weiner
2022-06-14  6:48                           ` Ying Huang
2022-06-14  8:01                           ` Aneesh Kumar K V
2022-06-14 18:56                             ` Johannes Weiner
2022-06-15  6:23                               ` Aneesh Kumar K V
2022-06-16  1:11                               ` Ying Huang
2022-06-16  3:45                                 ` Wei Xu
2022-06-16  4:47                                   ` Aneesh Kumar K V
2022-06-16  5:51                                     ` Ying Huang
2022-06-17 10:41                                 ` Jonathan Cameron
2022-06-20  1:54                                   ` Huang, Ying
2022-06-14 16:45                       ` Jonathan Cameron
2022-06-21  8:27                         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-06-07 20:15   ` Tim Chen
2022-06-08  4:55     ` Aneesh Kumar K V
2022-06-08  6:42       ` Ying Huang
2022-06-08 16:06       ` Tim Chen
2022-06-08 16:15         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-06-06 13:39   ` Bharata B Rao
2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-06-07 22:51   ` Tim Chen
2022-06-08  5:02     ` Aneesh Kumar K V
2022-06-08  6:52     ` Ying Huang
2022-06-08  6:50   ` Ying Huang
2022-06-08  8:19     ` Aneesh Kumar K V
2022-06-08  8:00   ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-07 23:40   ` Tim Chen
2022-06-08  6:59   ` Ying Huang
2022-06-08  8:20     ` Aneesh Kumar K V
2022-06-08  8:23       ` Ying Huang
2022-06-08  8:29         ` Aneesh Kumar K V
2022-06-08  8:34           ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-06-06  3:11   ` Ying Huang
2022-06-06  3:52     ` Aneesh Kumar K V
2022-06-06  7:24       ` Ying Huang
2022-06-06  8:33         ` Aneesh Kumar K V
2022-06-08  7:26           ` Ying Huang
2022-06-08  8:28             ` Aneesh Kumar K V
2022-06-08  8:32               ` Ying Huang
2022-06-08 14:37                 ` Aneesh Kumar K.V
2022-06-08 20:14                   ` Tim Chen
2022-06-10  6:04                   ` Ying Huang
2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
2022-06-08 14:20   ` Aneesh Kumar K V
2022-06-09  8:53     ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9f9506fbf35c4ec4b253d53914973be6e1f5a893.camel@intel.com \
    --to=ying.huang@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.