linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wei Xu <weixugc@google.com>
To: "ying.huang@intel.com" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Michal Hocko <mhocko@kernel.org>,
	Tim C Chen <tim.c.chen@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dan Williams <dan.j.williams@intel.com>,
	David Rientjes <rientjes@google.com>,
	Linux MM <linux-mm@kvack.org>,
	Brice Goglin <brice.goglin@gmail.com>,
	Hesham Almatary <hesham.almatary@huawei.com>
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2)
Date: Thu, 12 May 2022 23:36:56 -0700	[thread overview]
Message-ID: <CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com> (raw)
In-Reply-To: <69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com>

On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > To allocate a new page as the demotion target for a page, the kernel
> > calls the allocation function (__alloc_pages_nodemask) with the
> > source page node as the preferred node and the union of all lower
> > tier nodes as the allowed nodemask.  The actual target node selection
> > then follows the allocation fallback order that the kernel has
> > already defined.
> >
> > The pseudo code looks like:
> >
> >     targets = NODE_MASK_NONE;
> >     src_nid = page_to_nid(page);
> >     src_tier = node_tier_map[src_nid];
> >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> >             nodes_or(targets, targets, memory_tiers[i]);
> >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion target nodemask, e.g. to prevent
> > demotion or select a particular allowed node as the demotion target.
>
> Consider a system with 3 tiers, if we want to demote some pages from
> tier 0, the desired behavior is,
>
> - Allocate pages from tier 1
> - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so
> demote some pages from tier 1 to tier 2
> - If there's still no enough free pages in tier 1, allocate pages from
> tier 2.
>
> In this way, tier 0 will have the hottest pages, while tier 1 will have
> the coldest pages.

When we are already in the allocation path for the demotion of a page
from tier 0, I think we'd better not block this allocation to wait for
kswapd to demote pages from tier 1 to tier 2. Instead, we should
directly allocate from tier 2.  Meanwhile, this demotion can wakeup
kswapd to demote from tier 1 to tier 2 in the background.

> With your proposed method, the demoting from tier 0 behavior is,
>
> - Allocate pages from tier 1
> - If there's no enough free pages in tier 1, allocate pages in tier 2
>
> The kswapd of tier 1 will not be waken up until there's no enough free
> pages in tier 2.  In quite long time, there's no much hot/cold
> differentiation between tier 1 and tier 2.

This is true with the current allocation code. But I think we can make
some changes for demotion allocations. For example, we can add a
GFP_DEMOTE flag and update the allocation function to wake up kswapd
when this flag is set and we need to fall back to another node.

> This isn't hard to be fixed, just call __alloc_pages_nodemask() for each
> tier one by one considering page allocation fallback order.

That would have worked, except that there is an example earlier, in
which it is actually preferred for some nodes to demote to their tier
+ 2, not tier +1.

More specifically, the example is:

                 20
   Node 0 (DRAM) -- Node 1 (DRAM)
    |   |           |    |
    |   | 30    120 |    |
    |   v           v    | 100
100 |  Node 2 (PMEM)     |
    |    |               |
    |    | 100           |
     \   v               v
      -> Node 3 (Large Mem)

Node distances:
node   0    1    2    3
   0  10   20   30  100
   1  20   10  120  100
   2  30  120   10  100
   3 100  100  100   10

3 memory tiers are defined:
tier 0: 0-1
tier 1: 2
tier 2: 3

The demotion fallback order is:
node 0: 2, 3
node 1: 3, 2
node 2: 3
node 3: empty

Note that even though node 3 is in tier 2 and node 2 is in tier 1,
node 1 (tier 0) still prefers node 3 as its first demotion target, not
node 2.

  reply	other threads:[~2022-05-13  6:37 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
2022-05-12  7:03 ` ying.huang
2022-05-12  7:12   ` Aneesh Kumar K V
2022-05-12  7:18     ` ying.huang
2022-05-12  7:22     ` Wei Xu
2022-05-12  7:36       ` Aneesh Kumar K.V
2022-05-12  8:15         ` Wei Xu
2022-05-12  8:37           ` ying.huang
2022-05-13  2:52             ` ying.huang
2022-05-13  7:00               ` Wei Xu
2022-05-16  1:57                 ` ying.huang
2022-05-12 21:12           ` Tim Chen
2022-05-12 21:31             ` Wei Xu
2022-05-12 15:00 ` Jonathan Cameron
2022-05-18  7:09   ` Wei Xu
2022-05-18 12:00     ` Jonathan Cameron
2022-05-24  7:36       ` Wei Xu
2022-05-24 13:26         ` Aneesh Kumar K.V
2022-05-25  5:27           ` Wei Xu
2022-05-25  7:47             ` Alistair Popple
     [not found]               ` <20220525124847.00007a16@Huawei.com>
2022-05-25 15:32                 ` Wei Xu
2022-05-20  3:06     ` Ying Huang
2022-05-24  7:04       ` Wei Xu
2022-05-24  8:24         ` Ying Huang
2022-05-25  5:32           ` Wei Xu
2022-05-25  9:03             ` Ying Huang
2022-05-25 10:01               ` Aneesh Kumar K V
2022-05-25 11:36                 ` Mika Penttilä
2022-05-25 15:33                   ` Wei Xu
2022-05-25 17:27                 ` Wei Xu
2022-05-26  9:32                   ` Jonathan Cameron
2022-05-26 20:30                     ` Wei Xu
2022-05-27  9:26                   ` Aneesh Kumar K V
2022-05-25 15:36               ` Wei Xu
2022-05-26  1:09                 ` Ying Huang
2022-05-26  3:53                   ` Wei Xu
2022-05-26  6:54                     ` Ying Huang
2022-05-26  7:08                       ` Wei Xu
2022-05-26  7:39                         ` Ying Huang
2022-05-26 20:55                           ` Wei Xu
2022-05-27  9:10                             ` Jonathan Cameron
2022-05-30  6:54                               ` Ying Huang
2022-05-13  3:25 ` ying.huang
2022-05-13  6:36   ` Wei Xu [this message]
2022-05-13  7:04     ` ying.huang
2022-05-13  7:21       ` Wei Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com \
    --to=weixugc@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).