All of lore.kernel.org
 help / color / mirror / Atom feed
From: "ying.huang@intel.com" <ying.huang@intel.com>
To: Wei Xu <weixugc@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Michal Hocko <mhocko@kernel.org>,
	Tim C Chen <tim.c.chen@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dan Williams <dan.j.williams@intel.com>,
	David Rientjes <rientjes@google.com>,
	Linux MM <linux-mm@kvack.org>,
	Brice Goglin <brice.goglin@gmail.com>,
	Hesham Almatary <hesham.almatary@huawei.com>
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2)
Date: Mon, 16 May 2022 09:57:10 +0800	[thread overview]
Message-ID: <83b2229ce198d446ace6112b39ceaa34c0864b41.camel@intel.com> (raw)
In-Reply-To: <CAAPL-u-vPgCKVYOLqiScYu5Q_jApPsoHZnNGNfN+k2YuFFR_nw@mail.gmail.com>

On Fri, 2022-05-13 at 00:00 -0700, Wei Xu wrote:
> On Thu, May 12, 2022 at 7:53 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-12 at 16:37 +0800, ying.huang@intel.com wrote:
> > > On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> > > > On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
> > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > 
> > > > > Wei Xu <weixugc@google.com> writes:
> > > > > 
> > > > > > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > 
> > > > > > > On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > > > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > > > > > > > Sysfs Interfaces
> > > > > > > > > ================
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > 
> > > > > > > > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > 
> > > > > > > > >    Format: node_list
> > > > > > > > > 
> > > > > > > > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > 
> > > > > > > > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > 
> > > > > > > > >    The absolute value of a tier id number has no specific meaning.
> > > > > > > > >    What matters is the relative order of the tier id numbers.
> > > > > > > > > 
> > > > > > > > >    When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > >    sysfs files.
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > 
> > > > > > > > >    where N = 0, 1, ...
> > > > > > > > > 
> > > > > > > > >    Format: int or empty
> > > > > > > > > 
> > > > > > > > >    When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > >    is empty for a CPU-only NUMA node.
> > > > > > > > > 
> > > > > > > > >    When written, the kernel moves the node into the specified memory
> > > > > > > > >    tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > >    are not affected.
> > > > > > > > > 
> > > > > > > > >    Initially, we can make this interface read-only.
> > > > > > > > 
> > > > > > > > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > > > > > > > information we needed.  Do we really need
> > > > > > > > "/sys/devices/system/memtier/memtierN/nodelist"?
> > > > > > > > 
> > > > > > > > That can be gotten via a simple shell command line,
> > > > > > > > 
> > > > > > > > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > > > > > > > 
> > > > > > > 
> > > > > > > It will be really useful to fetch the memory tier node list in an easy
> > > > > > > fashion rather than reading multiple sysfs directories. If we don't have
> > > > > > > other attributes for memorytier, we could keep
> > > > > > > "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> > > > > > > avoiding /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > > -aneesh
> > > > > > 
> > > > > > It is harder to implement memtierN as just a file and doesn't follow
> > > > > > the existing sysfs pattern, either.  Besides, it is extensible to have
> > > > > > memtierN as a directory.
> > > > > 
> > > > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > > > index 6248326f944d..251f38ec3816 100644
> > > > > --- a/drivers/base/node.c
> > > > > +++ b/drivers/base/node.c
> > > > > @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
> > > > >         NULL
> > > > >  };
> > > > > 
> > > > > +#define MAX_TIER 3
> > > > > +nodemask_t memory_tier[MAX_TIER];
> > > > > +
> > > > > +#define _TIER_ATTR_RO(name, tier_index)                                        \
> > > > > +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> > > > > +
> > > > > +struct memory_tier_attr {
> > > > > +       struct device_attribute attr;
> > > > > +       int tier_index;
> > > > > +       int (*write)(nodemask_t nodes);
> > > > > +};
> > > > > +
> > > > > +static ssize_t show_tier(struct device *dev,
> > > > > +                        struct device_attribute *attr, char *buf)
> > > > > +{
> > > > > +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> > > > > +
> > > > > +       return sysfs_emit(buf, "%*pbl\n",
> > > > > +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> > > > > +}
> > > > > +
> > > > >  static const struct attribute_group memory_root_attr_group = {
> > > > >         .attrs = node_state_attrs,
> > > > >  };
> > > > > 
> > > > > +
> > > > > +#define TOP_TIER 0
> > > > > +static struct memory_tier_attr memory_tiers[] = {
> > > > > +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> > > > > +};
> > > > > +
> > > > > +static struct attribute *memory_tier_attrs[] = {
> > > > > +       &memory_tiers[0].attr.attr,
> > > > > +       NULL
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group memory_tier_attr_group = {
> > > > > +       .attrs = memory_tier_attrs,
> > > > > +};
> > > > > +
> > > > >  static const struct attribute_group *cpu_root_attr_groups[] = {
> > > > >         &memory_root_attr_group,
> > > > > +       &memory_tier_attr_group,
> > > > >         NULL,
> > > > >  };
> > > > > 
> > > > > 
> > > > > As long as we have the ability to see the nodelist, I am good with the
> > > > > proposal.
> > > > > 
> > > > > -aneesh
> > > > 
> > > > I am OK with moving back the memory tier nodelist into node/.  When
> > > > there are more memory tier attributes needed, we can then create the
> > > > memory tier subtree and replace the tier nodelist in node/ with
> > > > symlinks.
> > > 
> > > What attributes do you imagine that we may put in memory_tierX/ sysfs
> > > directory?  If we have good candidates in mind, we may just do that.
> > > What I can imagine now is "demote", like "memory_reclaim" in nodeX/ or
> > > node/ directory you proposed before.  Is it necessary to show something
> > > like "meminfo", "vmstat" there?
> > 
> > My words may be confusing, so let me say it in another way.
> 
> I can understand. :)
> 
> > Just for brainstorm, if we have
> > 
> >   /sys/devices/system/memtier/memtierN/
> > 
> > What can we put in it in addition to "nodelist" or links to the nodes?
> > For example,
> > 
> >   /sys/devices/system/memtier/memtierN/demote
> > 
> > When write a page number to it, the specified number of pages will be
> > demoted from memtierN to memtierN+1, like the
> > /sys/devices/system/node/memory_reclaim interface you proposed before.
> 
> "demote" might be fine to add there.  Just to clarify, we (Google)
> currently don't yet have the need for an interface to do system-wide
> demotion from one tier to another.  What we need is memory.demote
> (similar to memory.reclaim) for memory cgroup based demotions.
> 
> Other things that might be added include tier-specific properties
> (e.g. expected latency and bandwidth when available) and tier-specific
> stats.
> 
> Under /sys/devices/system/memtier/, we may add global properties about
> memory tiers, e.g. max number of tiers, min/max tier ids (which might
> be useful if we hide unpopulated memory tiers).
> 
> > Or, is it necessary to add
> > 
> >   /sys/devices/system/memtier/memtierN/meminfo
> >   /sys/devices/system/memtier/memtierN/vmstat
> 
> The userspace can aggregate such data from node/nodeN/{meminfo,
> vmstat} based on the memory tier nodelist. But I am not against adding
> these files to memtierN/ for user convenience.
> 
> > I don't mean to propose these.  Just want to know whether there's
> > requirement for these kind of stuff?  And what else may be required.
> 
> This sounds good.  I think a memtier directory may eventually become a
> necessity, though I don't feel too strongly about adding it right now.

If a memtier directory may eventually become a necessity and we really
want convenient nodelist somewhere, I'm OK to add the memtier directory
now.

Best Regards,
Huang, Ying

> > Best Regards,
> > Huang, Ying
> > 
> > > > 
> > > > So the revised sysfs interfaces are:
> > > > 
> > > > * /sys/devices/system/node/memory_tierN (read-only)
> > > > 
> > > >   where N = 0, 1, 2
> > > > 
> > > >   Format: node_list
> > > > 
> > > > * /sys/devices/system/node/nodeN/memory_tier (read/write)
> > > > 
> > > >   where N = 0, 1, ...
> > > > 
> > > >   Format: int or empty
> > > 
> > 
> > 
> > 



  reply	other threads:[~2022-05-16  1:57 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
2022-05-12  7:03 ` ying.huang
2022-05-12  7:12   ` Aneesh Kumar K V
2022-05-12  7:18     ` ying.huang
2022-05-12  7:22     ` Wei Xu
2022-05-12  7:36       ` Aneesh Kumar K.V
2022-05-12  8:15         ` Wei Xu
2022-05-12  8:37           ` ying.huang
2022-05-13  2:52             ` ying.huang
2022-05-13  7:00               ` Wei Xu
2022-05-16  1:57                 ` ying.huang [this message]
2022-05-12 21:12           ` Tim Chen
2022-05-12 21:31             ` Wei Xu
2022-05-12 15:00 ` Jonathan Cameron
2022-05-18  7:09   ` Wei Xu
2022-05-18 12:00     ` Jonathan Cameron
2022-05-24  7:36       ` Wei Xu
2022-05-24 13:26         ` Aneesh Kumar K.V
2022-05-25  5:27           ` Wei Xu
2022-05-25  7:47             ` Alistair Popple
2022-05-25 11:48               ` Jonathan Cameron
2022-05-25 15:32                 ` Wei Xu
2022-05-20  3:06     ` Ying Huang
2022-05-24  7:04       ` Wei Xu
2022-05-24  8:24         ` Ying Huang
2022-05-25  5:32           ` Wei Xu
2022-05-25  9:03             ` Ying Huang
2022-05-25 10:01               ` Aneesh Kumar K V
2022-05-25 11:36                 ` Mika Penttilä
2022-05-25 15:33                   ` Wei Xu
2022-05-25 17:27                 ` Wei Xu
2022-05-26  9:32                   ` Jonathan Cameron
2022-05-26 20:30                     ` Wei Xu
2022-05-27  9:26                   ` Aneesh Kumar K V
2022-05-25 15:36               ` Wei Xu
2022-05-26  1:09                 ` Ying Huang
2022-05-26  3:53                   ` Wei Xu
2022-05-26  6:54                     ` Ying Huang
2022-05-26  7:08                       ` Wei Xu
2022-05-26  7:39                         ` Ying Huang
2022-05-26 20:55                           ` Wei Xu
2022-05-27  9:10                             ` Jonathan Cameron
2022-05-30  6:54                               ` Ying Huang
2022-05-13  3:25 ` ying.huang
2022-05-13  6:36   ` Wei Xu
2022-05-13  7:04     ` ying.huang
2022-05-13  7:21       ` Wei Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83b2229ce198d446ace6112b39ceaa34c0864b41.camel@intel.com \
    --to=ying.huang@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.