linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
To: Hesham Almatary <hesham.almatary@huawei.com>,
	Yang Shi <shy828301@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Linux MM <linux-mm@kvack.org>, Greg Thelen <gthelen@google.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alistair Popple <apopple@nvidia.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Michal Hocko <mhocko@kernel.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Wei Xu <weixugc@google.com>
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Tue, 10 May 2022 17:40:10 +0530	[thread overview]
Message-ID: <e1bf6346-fd93-13ee-0b38-c1d956df0e99@linux.ibm.com> (raw)
In-Reply-To: <c272e43d-47c5-d7d4-cb17-95dc6f28f5cd@huawei.com>

On 5/10/22 3:29 PM, Hesham Almatary wrote:
> Hello Yang,
> 
> On 5/10/2022 4:24 AM, Yang Shi wrote:
>> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
>> <hesham.almatary@huawei.com> wrote:


...

>>>
>>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>>> in tier 0,
>>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>>> (could be a bigger DDR or something) in tier 2. The distances are as
>>> follows:
>>>
>>> --------------          --------------
>>> |   Node 0   |          |   Node 1   |
>>> |  -------   |          |  -------   |
>>> | |  DDR  |  |          | |  DDR  |  |
>>> |  -------   |          |  -------   |
>>> |            |          |            |
>>> --------------          --------------
>>>          | 20               | 120    |
>>>          v                  v        |
>>> ----------------------------       |
>>> | Node 2     PMEM          |       | 100
>>> ----------------------------       |
>>>          | 100                       |
>>>          v                           v
>>> --------------------------------------
>>> | Node 3    Large mem                |
>>> --------------------------------------
>>>
>>> node distances:
>>> node   0    1    2    3
>>>      0  10   20   20  120
>>>      1  20   10  120  100
>>>      2  20  120   10  100
>>>      3  120 100  100   10
>>>
>>> /sys/devices/system/node/memory_tiers
>>> 0-1
>>> 2
>>> 3
>>>
>>> N_TOPTIER_MEMORY: 0-1
>>>
>>>
>>> In this case, we want to be able to "skip" the demotion path from Node 1
>>> to Node 2,
>>>
>>> and make demotion go directely to Node 3 as it is closer, distance wise.
>>> How can
>>>
>>> we accommodate this scenario (or at least not rule it out as future
>>> work) with the
>>>
>>> current RFC?
>> If I remember correctly NUMA distance is hardcoded in SLIT by the
>> firmware, it is supposed to reflect the latency. So I suppose it is
>> the firmware's responsibility to have correct information. And the RFC
>> assumes higher tier memory has better performance than lower tier
>> memory (latency, bandwidth, throughput, etc), so it sounds like a
>> buggy firmware to have lower tier memory with shorter distance than
>> higher tier memory IMHO.
> 
> You are correct if you're assuming the topology is all hierarchically
> 
> symmetric, but unfortuantely, in real hardware (e.g., my example above)
> 
> it is not. The distance/latency between two nodes in the same tier
> 
> and a third node, is different. The firmware still provides the correct
> 
> latency, but putting a node in a tier is up to the kernel/user, and
> 
> is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> 
> perspective, but to tier 2 from Node 0's.
> 
> 
> A more detailed example (building on my previous one) is when having
> 
> the GPU connected to a switch:
> 
> ----------------------------
> | Node 2     PMEM          |
> ----------------------------
>        ^
>        |
> --------------          --------------
> |   Node 0   |          |   Node 1   |
> |  -------   |          |  -------   |
> | |  DDR  |  |          | |  DDR  |  |
> |  -------   |          |  -------   |
> |    CPU     |          |    GPU     |
> --------------          --------------
>         |                  |
>         v                  v
> ----------------------------
> |         Switch           |
> ----------------------------
>         |
>         v
> --------------------------------------
> | Node 3    Large mem                |
> --------------------------------------
> 
> Here, demoting from Node 1 to Node 3 directly would be faster as
> 
> it only has to go through one hub, compared to demoting from Node 1
> 
> to Node 2, where it goes through two hubs. I hope that example
> 
> clarifies things a little bit.
> 

Alistair mentioned that we want to consider GPU memory to be expensive 
and want to demote from GPU to regular DRAM. In that case for the above 
case we should end up with


tier 0 - > Node3
tier 1 ->  Node0, Node1
tier 2 ->  Node2

Hence

  node 0: allowed=2
  node 1: allowed=2
  node 2: allowed = empty
  node 3: allowed = 0-1 , based on fallback order 1, 0

-aneesh




  reply	other threads:[~2022-05-10 12:10 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
2022-05-09 14:32       ` Hesham Almatary
2022-05-10  3:24         ` Yang Shi
2022-05-10  9:59           ` Hesham Almatary
2022-05-10 12:10             ` Aneesh Kumar K V [this message]
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 10:01           ` Hesham Almatary
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple
2022-05-06  0:25   ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e1bf6346-fd93-13ee-0b38-c1d956df0e99@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).