All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	Wei Xu <weixugc@google.com>, Huang Ying <ying.huang@intel.com>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
Date: Mon, 13 Jun 2022 10:05:06 -0400	[thread overview]
Message-ID: <YqdEEhJFr3SlfvSJ@cmpxchg.org> (raw)
In-Reply-To: <20220610105708.0000679b@Huawei.com>

On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> On Thu, 9 Jun 2022 16:41:04 -0400
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > Would it make more sense to have the platform/devicetree/driver
> > provide more fine-grained distance values similar to NUMA distances,
> > and have a driver-scope tunable to override/correct? And then have the
> > distance value function as the unique tier ID and rank in one.
> 
> Absolutely a good thing to provide that information, but it's black
> magic. There are too many contradicting metrics (latency vs bandwidth etc)
> even not including a more complex system model like Jerome Glisse proposed
> a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
> CXL 2.0 got this more right than anything else I've seen as provides
> discoverable topology along with details like latency to cross between
> particular switch ports.  Actually using that data (other than by throwing
> it to userspace controls for HPC apps etc) is going to take some figuring out.
> Even the question of what + how we expose this info to userspace is non
> obvious.

Right, I don't think those would be scientifically accurate - but
neither is a number between 1 and 3. The way I look at it is more
about spreading out the address space a bit, to allow expressing
nuanced differences without risking conflicts and overlaps. Hopefully
this results in the shipped values stabilizing over time and thus
requiring less and less intervention and overriding from userspace.

> > Going further, it could be useful to separate the business of hardware
> > properties (and configuring quirks) from the business of configuring
> > MM policies that should be applied to the resulting tier hierarchy.
> > They're somewhat orthogonal tuning tasks, and one of them might become
> > obsolete before the other (if the quality of distance values provided
> > by drivers improves before the quality of MM heuristics ;). Separating
> > them might help clarify the interface for both designers and users.
> > 
> > E.g. a memdev class scope with a driver-wide distance value, and a
> > memdev scope for per-device values that default to "inherit driver
> > value". The memtier subtree would then have an r/o structure, but
> > allow tuning per-tier interleaving ratio[1], demotion rules etc.
> 
> Ok that makes sense.  I'm not sure if that ends up as an implementation
> detail, or effects the userspace interface of this particular element.
> 
> I'm not sure completely read only is flexible enough (though mostly RO is fine)
> as we keep sketching out cases where any attempt to do things automatically
> does the wrong thing and where we need to add an extra tier to get
> everything to work.  Short of having a lot of tiers I'm not sure how
> we could have the default work well.  Maybe a lot of "tiers" is fine
> though perhaps we need to rename them if going this way and then they
> don't really work as current concept of tier.
> 
> Imagine a system with subtle difference between different memories such
> as 10% latency increase for same bandwidth.  To get an advantage from
> demoting to such a tier will require really stable usage and long
> run times. Whilst you could design a demotion scheme that takes that
> into account, I think we are a long way from that today.

Good point: there can be a clear hardware difference, but it's a
policy choice whether the MM should treat them as one or two tiers.

What do you think of a per-driver/per-device (overridable) distance
number, combined with a configurable distance cutoff for what
constitutes separate tiers. E.g. cutoff=20 means two devices with
distances of 10 and 20 respectively would be in the same tier, devices
with 10 and 100 would be in separate ones. The kernel then generates
and populates the tiers based on distances and grouping cutoff, and
populates the memtier directory tree and nodemasks in sysfs.

It could be simple tier0, tier1, tier2 numbering again, but the
numbers now would mean something to the user. A rank tunable is no
longer necessary.

I think even the nodemasks in the memtier tree could be read-only
then, since corrections should only be necessary when either the
device distance is wrong or the tier grouping cutoff.

Can you think of scenarios where that scheme would fall apart?

  reply	other threads:[~2022-06-13 16:13 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-06-07 18:43   ` Tim Chen
2022-06-07 20:18     ` Wei Xu
2022-06-08  4:30     ` Aneesh Kumar K V
2022-06-08  6:06       ` Ying Huang
2022-06-08  4:37     ` Aneesh Kumar K V
2022-06-08  6:10       ` Ying Huang
2022-06-08  8:04         ` Aneesh Kumar K V
2022-06-07 21:32   ` Yang Shi
2022-06-08  1:34     ` Ying Huang
2022-06-08 16:37       ` Yang Shi
2022-06-09  6:52         ` Ying Huang
2022-06-08  4:58     ` Aneesh Kumar K V
2022-06-08  6:18       ` Ying Huang
2022-06-08 16:42       ` Yang Shi
2022-06-09  8:17         ` Aneesh Kumar K V
2022-06-09 16:04           ` Yang Shi
2022-06-08 14:11   ` Johannes Weiner
2022-06-08 14:21     ` Aneesh Kumar K V
2022-06-08 15:55     ` Johannes Weiner
2022-06-08 16:13       ` Aneesh Kumar K V
2022-06-08 18:16         ` Johannes Weiner
2022-06-09  2:33           ` Aneesh Kumar K V
2022-06-09 13:55             ` Johannes Weiner
2022-06-09 14:22               ` Jonathan Cameron
2022-06-09 20:41                 ` Johannes Weiner
2022-06-10  6:15                   ` Ying Huang
2022-06-10  9:57                   ` Jonathan Cameron
2022-06-13 14:05                     ` Johannes Weiner [this message]
2022-06-13 14:23                       ` Aneesh Kumar K V
2022-06-13 15:50                         ` Johannes Weiner
2022-06-14  6:48                           ` Ying Huang
2022-06-14  8:01                           ` Aneesh Kumar K V
2022-06-14 18:56                             ` Johannes Weiner
2022-06-15  6:23                               ` Aneesh Kumar K V
2022-06-16  1:11                               ` Ying Huang
2022-06-16  3:45                                 ` Wei Xu
2022-06-16  4:47                                   ` Aneesh Kumar K V
2022-06-16  5:51                                     ` Ying Huang
2022-06-17 10:41                                 ` Jonathan Cameron
2022-06-20  1:54                                   ` Huang, Ying
2022-06-14 16:45                       ` Jonathan Cameron
2022-06-21  8:27                         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-06-07 20:15   ` Tim Chen
2022-06-08  4:55     ` Aneesh Kumar K V
2022-06-08  6:42       ` Ying Huang
2022-06-08 16:06       ` Tim Chen
2022-06-08 16:15         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-06-06 13:39   ` Bharata B Rao
2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-06-07 22:51   ` Tim Chen
2022-06-08  5:02     ` Aneesh Kumar K V
2022-06-08  6:52     ` Ying Huang
2022-06-08  6:50   ` Ying Huang
2022-06-08  8:19     ` Aneesh Kumar K V
2022-06-08  8:00   ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-07 23:40   ` Tim Chen
2022-06-08  6:59   ` Ying Huang
2022-06-08  8:20     ` Aneesh Kumar K V
2022-06-08  8:23       ` Ying Huang
2022-06-08  8:29         ` Aneesh Kumar K V
2022-06-08  8:34           ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-06-06  3:11   ` Ying Huang
2022-06-06  3:52     ` Aneesh Kumar K V
2022-06-06  7:24       ` Ying Huang
2022-06-06  8:33         ` Aneesh Kumar K V
2022-06-08  7:26           ` Ying Huang
2022-06-08  8:28             ` Aneesh Kumar K V
2022-06-08  8:32               ` Ying Huang
2022-06-08 14:37                 ` Aneesh Kumar K.V
2022-06-08 20:14                   ` Tim Chen
2022-06-10  6:04                   ` Ying Huang
2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
2022-06-08 14:20   ` Aneesh Kumar K V
2022-06-09  8:53     ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YqdEEhJFr3SlfvSJ@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.