Re: RFC: Memory Tiering Kernel Interfaces

From: Alistair Popple <apopple@nvidia.com>
To: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Linux MM <linux-mm@kvack.org>, Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Michal Hocko <mhocko@kernel.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Tim Chen <tim.c.chen@linux.intel.com>
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Tue, 10 May 2022 15:37:33 +1000	[thread overview]
Message-ID: <875ymerl81.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <CAAPL-u-0HwL6p1SA73LPfFyywG55QqE9O+q=83fhShoJAVVxyQ@mail.gmail.com>

Wei Xu <weixugc@google.com> writes:

> On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> [...]
>>
>> >> >
>> >> >
>> >> > Tiering Hierarchy Initialization
>> >> > `=============================='
>> >> >
>> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>> >> >
>> >> > A device driver can remove its memory nodes from the top tier, e.g.
>> >> > a dax driver can remove PMEM nodes from the top tier.
>> >>
>> >> With the topology built by firmware we should not need this.
>>
>> I agree that in an ideal world the hierarchy should be built by firmware based
>> on something like the HMAT. But I also think being able to override this will be
>> useful in getting there. Therefore a way of overriding the generated hierarchy
>> would be good, either via sysfs or kernel boot parameter if we don't want to
>> commit to a particular user interface now.
>>
>> However I'm less sure letting device-drivers override this is a good idea. How
>> for example would a GPU driver make sure it's node is in the top tier? By moving
>> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
>> could get messy if say there were two drivers both of which wanted their node to
>> be in the top tier.
>
> The suggestion is to allow a device driver to opt out its memory
> devices from the top-tier, not the other way around.

So how would demotion work in the case of accelerators then? In that
case we would want GPU memory to demote to DRAM, but that won't happen
if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
only override available with this proposal would move GPU memory into a
lower tier, which is the opposite of what's needed there.

>
> I agree that the kernel should still be responsible for the final
> node-tier assignment by taking into account all factors: the firmware
> tables, device driver requests, and user-overrides (kernel argument or
> sysfs).
>
>> > I agree. But before we have such a firmware, the kernel needs to do
>> > its best to initialize memory tiers.
>> >
>> > Given that we know PMEM is slower than DRAM, but a dax device might
>> > not be PMEM, a better place to set the tier for PMEM nodes can be the
>> > ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
>> > the ACPI_SRAT_MEM_NON_VOLATILE bit.
>> >
>> >> >
>> >> > The kernel builds the memory tiering hierarchy and per-node demotion
>> >> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> >> > best distance nodes in the next lower tier are assigned to
>> >> > node_demotion[N].preferred and all the nodes in the next lower tier
>> >> > are assigned to node_demotion[N].allowed.
>> >>
>> >> I'm not sure whether it should be allowed to demote to multiple lower
>> >> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>> >> figure out a good way to define demotion targets, it could be extended
>> >> to support this easily.
>> >
>> > You mean to only support MAX_TIERS=2 for now.  I am fine with that.
>> > There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
>> > not clear yet whether we want to enable transparent memory tiering to
>> > all the 3 tiers on such systems.
>>
>> At some point I think we will need to deal with 3 tiers but I'd be ok with
>> limiting it to 2 for now if it makes things simpler.
>>
>> - Alistair
>>
>> >> >
>> >> > node_demotion[N].preferred can be empty if no preferred demotion node
>> >> > is available for node N.
>> >> >
>> >> > If the userspace overrides the tiers via the memory_tiers sysfs
>> >> > interface, the kernel then only rebuilds the per-node demotion order
>> >> > accordingly.
>> >> >
>> >> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>> >> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>> >> > node.
>> >> >
>> >> >
>> >> > Memory Allocation for Demotion
>> >> > `============================'
>> >> >
>> >> > When allocating a new demotion target page, both a preferred node
>> >> > and the allowed nodemask are provided to the allocation function.
>> >> > The default kernel allocation fallback order is used to allocate the
>> >> > page from the specified node and nodemask.
>> >> >
>> >> > The memopolicy of cpuset, vma and owner task of the source page can
>> >> > be set to refine the demotion nodemask, e.g. to prevent demotion or
>> >> > select a particular allowed node as the demotion target.
>> >> >
>> >> >
>> >> > Examples
>> >> > `======'
>> >> >
>> >> > * Example 1:
>> >> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>> >> >
>> >> >   Node 0 has node 2 as the preferred demotion target and can also
>> >> >   fallback demotion to node 3.
>> >> >
>> >> >   Node 1 has node 3 as the preferred demotion target and can also
>> >> >   fallback demotion to node 2.
>> >> >
>> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >> >   e.g. cpuset.mems=0,2
>> >> >
>> >> > node distances:
>> >> > node   0    1    2    3
>> >> >    0  10   20   30   40
>> >> >    1  20   10   40   30
>> >> >    2  30   40   10   40
>> >> >    3  40   30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2-3
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2-3]
>> >> >   1: [3], [2-3]
>> >> >   2: [],  []
>> >> >   3: [],  []
>> >> >
>> >> > * Example 2:
>> >> >   Node 0 & 1 are DRAM nodes.
>> >> >   Node 2 is a PMEM node and closer to node 0.
>> >> >
>> >> >   Node 0 has node 2 as the preferred and only demotion target.
>> >> >
>> >> >   Node 1 has no preferred demotion target, but can still demote
>> >> >   to node 2.
>> >> >
>> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >> >   e.g. cpuset.mems=0,2
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   40
>> >> >    2  30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [],  [2]
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 3:
>> >> >   Node 0 & 1 are DRAM nodes.
>> >> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>> >> >
>> >> >   Node 0 has node 2 as the preferred and only demotion target.
>> >> >
>> >> >   Node 1 has node 2 as the preferred and only demotion target.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   30
>> >> >    2  30   30   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [2], [2]
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 4:
>> >> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>> >> >
>> >> >   All nodes are top-tier.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   30
>> >> >    2  30   30   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-2
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [],  []
>> >> >   1: [],  []
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 5:
>> >> >   Node 0 is a DRAM node with CPU.
>> >> >   Node 1 is a HBM node.
>> >> >   Node 2 is a PMEM node.
>> >> >
>> >> >   With userspace override, node 1 is the top tier and has node 0 as
>> >> >   the preferred and only demotion target.
>> >> >
>> >> >   Node 0 is in the second tier, tier 1, and has node 2 as the
>> >> >   preferred and only demotion target.
>> >> >
>> >> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   21   30
>> >> >    1  21   10   40
>> >> >    2  30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers (userspace override)
>> >> > 1
>> >> > 0
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [0], [0]
>> >> >   2: [],  []
>> >> >
>> >> > -- Wei