linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Hillf Danton <hdanton@sina.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@suse.de>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [RFC -V5 1/6] NUMA balancing: optimize page placement for memory tiering system
Date: Fri, 05 Feb 2021 16:37:48 +0800	[thread overview]
Message-ID: <87pn1ekizn.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20210205075312.2515-1-hdanton@sina.com> (Hillf Danton's message of "Fri, 5 Feb 2021 15:53:12 +0800")

Hillf Danton <hdanton@sina.com> writes:

> On Thu,  4 Feb 2021 18:10:51 +0800 Huang Ying wrote:
>> With the advent of various new memory types, some machines will have
>> multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
>> memory subsystem of these machines can be called memory tiering
>> system, because the performance of the different types of memory are
>> usually different.
>> 
>> In such system, because of the memory accessing pattern changing etc,
>> some pages in the slow memory may become hot globally.  So in this
>> patch, the NUMA balancing mechanism is enhanced to optimize the page
>> placement among the different memory types according to hot/cold
>> dynamically.
>> 
>> In a typical memory tiering system, there are CPUs, fast memory and
>> slow memory in each physical NUMA node.  The CPUs and the fast memory
>> will be put in one logical node (called fast memory node), while the
>> slow memory will be put in another (faked) logical node (called slow
>> memory node).  That is, the fast memory is regarded as local while the
>> slow memory is regarded as remote.  So it's possible for the recently
>> accessed pages in the slow memory node to be promoted to the fast
>> memory node via the existing NUMA balancing mechanism.
>> 
>> The original NUMA balancing mechanism will stop to migrate pages if the free
>> memory of the target node will become below the high watermark.  This
>> is a reasonable policy if there's only one memory type.  But this
>> makes the original NUMA balancing mechanism almost not work to optimize page
>> placement among different memory types.  Details are as follows.
>> 
>> It's the common cases that the working-set size of the workload is
>> larger than the size of the fast memory nodes.  Otherwise, it's
>> unnecessary to use the slow memory at all.  So in the common cases,
>> there are almost always no enough free pages in the fast memory nodes,
>> so that the globally hot pages in the slow memory node cannot be
>
> In assumption like
>
> 1/ the workload's working set size is 1.5x larger than one DRAM node,
> 2/ PMEM is 10x (or 5x) larger than DRAM,
>
> what difference is it going to make if the spinning hard disk swap
> can be replaced with PMEM? With PMEM swap, the page demotion is swapout
> and we will pay nothing for page promotion.

Per my understanding, this is the difference between PMEM as swap and
accessing PMEM directly + promotion.

PMEM as swap:

- PMEM will not be accessed directly, that is, any DRAM miss will
  trigger swapping in.  That is, 1 cache line access will be inflated as
  4KB accessing (4096 / 64 = 64).  And page direct reclaiming may be
  triggered, so the accessing latency is almost unbounded.

- The good part is that if the PMEM page is very hot, we will put the
  page in DRAM at the first accessing.

promotion + accessing PMEM directly:

- PMEM may be accessed directly.  The latency of PMEM is longer than
  that of DRAM, but much smaller than that of swapping in.  And we avoid
  to trigger direct reclaiming for page promotion.

- The bad part is that the very hot PMEM page may be accessed directly
  for a while before being promoted to DRAM.  It takes some time to
  identify whether a page is hot or not.

So in another words, swap can guarantee the very hot pages to be
accessed in DRAM always, but promotion + accessing PMEM directly
solution can avoid to move very cold pages to DRAM so that the page
thrashing can be avoided.

If the pages we put in PMEM will almost never been accessed, then PMEM
as swap may be the suitable solution too.  But if it's not, promotion +
accessing PMEM directly works generally better.

Best Regards,
Huang, Ying

[snip]

      parent reply	other threads:[~2021-02-05  8:39 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-04 10:10 [RFC -V5 0/6] autonuma: Optimize memory placement for memory tiering system Huang Ying
2021-02-04 10:10 ` [RFC -V5 1/6] NUMA balancing: optimize page " Huang Ying
2021-02-04 10:10 ` [RFC -V5 2/6] memory tiering: skip to scan fast memory Huang Ying
2021-02-04 10:10 ` [RFC -V5 3/6] memory tiering: hot page selection with hint page fault latency Huang Ying
2021-02-04 10:10 ` [RFC -V5 4/6] memory tiering: rate limit NUMA migration throughput Huang Ying
2021-02-04 10:10 ` [RFC -V5 5/6] memory tiering: adjust hot threshold automatically Huang Ying
2021-02-04 10:10 ` [RFC -V5 6/6] memory tiering: add page promotion counter Huang Ying
     [not found] ` <20210205075312.2515-1-hdanton@sina.com>
2021-02-05  8:37   ` Huang, Ying [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pn1ekizn.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).