Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers

From: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	Wei Xu <weixugc@google.com>, Huang Ying <ying.huang@intel.com>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
Date: Mon, 13 Jun 2022 19:53:03 +0530	[thread overview]
Message-ID: <4297bd21-e984-9d78-2bca-e70c11749a72@linux.ibm.com> (raw)
In-Reply-To: <YqdEEhJFr3SlfvSJ@cmpxchg.org>

On 6/13/22 7:35 PM, Johannes Weiner wrote:
> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
>>

....

>> I'm not sure completely read only is flexible enough (though mostly RO is fine)
>> as we keep sketching out cases where any attempt to do things automatically
>> does the wrong thing and where we need to add an extra tier to get
>> everything to work.  Short of having a lot of tiers I'm not sure how
>> we could have the default work well.  Maybe a lot of "tiers" is fine
>> though perhaps we need to rename them if going this way and then they
>> don't really work as current concept of tier.
>>
>> Imagine a system with subtle difference between different memories such
>> as 10% latency increase for same bandwidth.  To get an advantage from
>> demoting to such a tier will require really stable usage and long
>> run times. Whilst you could design a demotion scheme that takes that
>> into account, I think we are a long way from that today.
> 
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
> 
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.
> 

Right now core/generic code doesn't get involved in building tiers. It 
just defines three tiers where drivers could place the respective 
devices they manage. The above suggestion would imply we are moving 
quite a lot of policy decision logic into the generic code?.

At some point, we will have to depend on more attributes other than 
distance(may be HMAT?) and each driver should have the flexibility to 
place the device it is managing in a specific tier? By then we may 
decide to support more than 3 static tiers which the core kernel 
currently does.

If the kernel still can't make the right decision, userspace could 
rearrange them in any order using rank values. Without something like 
rank, if userspace needs to fix things up,  it gets hard with device
hotplugging. ie, the userspace policy could be that any new PMEM tier 
device that is hotplugged, park it with a very low-rank value and hence 
lowest in demotion order by default. (echo 10 > 
/sys/devices/system/memtier/memtier2/rank) . After that userspace could 
selectively move the new devices to the correct memory tier?

> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.
> 
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
> 
> Can you think of scenarios where that scheme would fall apart?

-aneesh