All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Wei Xu <weixugc@google.com>
Cc: "ying.huang@intel.com" <ying.huang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS
Date: Thu, 21 Apr 2022 10:56:30 -0700	[thread overview]
Message-ID: <CAHbLzkqrxTpWT9q9xavGF+HZQNeNp13OATj248fb1rfCGKTu8A@mail.gmail.com> (raw)
In-Reply-To: <CAAPL-u9=-OHuUk=ZkNRDf3Dm_+3cBd2APL5MQpQr3_sVk_voJg@mail.gmail.com>

On Wed, Apr 20, 2022 at 10:41 PM Wei Xu <weixugc@google.com> wrote:
>
> On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > Current implementation to find the demotion targets works
> > > > based on node state N_MEMORY, however some systems may have
> > > > dram only memory numa node which are N_MEMORY but not the
> > > > right choices as demotion targets.
> > > >
> > > > This patch series introduces the new node state
> > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > is used to hold the list of nodes which can be used as demotion
> > > > targets, support is also added to set the demotion target
> > > > list from user space so that default behavior can be overridden.
> > >
> > > It appears that your proposed user space interface cannot solve all
> > > problems.  For example, for system as follows,
> > >
> > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > node 0,
> > >
> > > available: 3 nodes (0-2)
> > > node 0 cpus: 0 1
> > > node 0 size: n MB
> > > node 0 free: n MB
> > > node 1 cpus:
> > > node 1 size: n MB
> > > node 1 free: n MB
> > > node 2 cpus: 2 3
> > > node 2 size: n MB
> > > node 2 free: n MB
> > > node distances:
> > > node   0   1   2
> > >   0:  10  40  20
> > >   1:  40  10  80
> > >   2:  20  80  10
> > >
> > > Demotion order 1:
> > >
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              X
> > >
> > > Demotion order 2:
> > >
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              1
> > >
> > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > traffic.  While the demotion order 2 is preferred if we want to take
> > > full advantage of the slow memory node.  We can take any choice as
> > > automatic-generated order, while make the other choice possible via user
> > > space overridden.
> > >
> > > I don't know how to implement this via your proposed user space
> > > interface.  How about the following user space interface?
> > >
> > > 1. Add a file "demotion_order_override" in
> > >         /sys/devices/system/node/
> > >
> > > 2. When read, "1" is output if the demotion order of the system has been
> > > overridden; "0" is output if not.
> > >
> > > 3. When write "1", the demotion order of the system will become the
> > > overridden mode.  When write "0", the demotion order of the system will
> > > become the automatic mode and the demotion order will be re-generated.
> > >
> > > 4. Add a file "demotion_targets" for each node in
> > >         /sys/devices/system/node/nodeX/
> > >
> > > 5. When read, the demotion targets of nodeX will be output.
> > >
> > > 6. When write a node list to the file, the demotion targets of nodeX
> > > will be set to the written nodes.  And the demotion order of the system
> > > will become the overridden mode.
> >
> > TBH I don't think having override demotion targets in userspace is
> > quite useful in real life for now (it might become useful in the
> > future, I can't tell). Imagine you manage hundred thousands of
> > machines, which may come from different vendors, have different
> > generations of hardware, have different versions of firmware, it would
> > be a nightmare for the users to configure the demotion targets
> > properly. So it would be great to have the kernel properly configure
> > it *without* intervening from the users.
> >
> > So we should pick up a proper default policy and stick with that
> > policy unless it doesn't work well for the most workloads. I do
> > understand it is hard to make everyone happy. My proposal is having
> > every node in the fast tier has a demotion target (at least one) if
> > the slow tier exists sounds like a reasonable default policy. I think
> > this is also the current implementation.
> >
>
> This is reasonable.  I agree that with a decent default policy, the
> overriding of per-node demotion targets can be deferred.  The most
> important problem here is that we should allow the configurations
> where memory-only nodes are not used as demotion targets, which this
> patch set has already addressed.

Yes, I agree. Fixing the bug and allowing override by userspace are
totally two separate things.

>
> > >
> > > To reduce the complexity, the demotion order of the system is either in
> > > overridden mode or automatic mode.  When converting from the automatic
> > > mode to the overridden mode, the existing demotion targets of all nodes
> > > will be retained before being changed.  When converting from overridden
> > > mode to automatic mode, the demotion order of the system will be re-
> > > generated automatically.
> > >
> > > In overridden mode, the demotion targets of the hot-added and hot-
> > > removed node will be set to empty.  And the hot-removed node will be
> > > removed from the demotion targets of any node.
> > >
> > > This is an extention of the interface used in the following patch,
> > >
> > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
> > >
> > > What do you think about this?
> > >
> > > > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > > > driver, certain type of memory which registers through dax kmem
> > > > (e.g. HBM) may not be the right choices for demotion so in future
> > > > they should be distinguished based on certain attributes and dax
> > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > > > however current implementation also doesn't distinguish any
> > > > such memory and it considers all N_MEMORY as demotion targets
> > > > so this patch series doesn't modify the current behavior.
> > > >
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > [snip]
> > >

  parent reply	other threads:[~2022-04-21 17:56 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-13  9:22 [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13  9:22 ` [PATCH v2 1/5] mm: demotion: Set demotion list differently Jagdish Gediya
2022-04-14  7:09   ` ying.huang
2022-04-14  8:48     ` Jagdish Gediya
2022-04-14  8:57       ` ying.huang
2022-04-14  8:55   ` Baolin Wang
2022-04-14  9:02   ` Jonathan Cameron
2022-04-14 10:40     ` Jagdish Gediya
2022-04-21  6:13   ` ying.huang
2022-04-13  9:22 ` [PATCH v2 2/5] mm: demotion: Add new node state N_DEMOTION_TARGETS Jagdish Gediya
2022-04-21  4:33   ` Wei Xu
2022-04-13  9:22 ` [PATCH v2 3/5] mm: demotion: Add support to set targets from userspace Jagdish Gediya
2022-04-21  4:26   ` Wei Xu
2022-04-22  9:13     ` Jagdish Gediya
2022-04-21  5:31   ` Wei Xu
2022-04-13  9:22 ` [PATCH v2 4/5] device-dax/kmem: Set node state as N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13  9:22 ` [PATCH v2 5/5] mm: demotion: Build demotion list based on N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13 21:44 ` [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS Andrew Morton
2022-04-14 10:16   ` Jagdish Gediya
2022-04-14  7:00 ` ying.huang
2022-04-14 10:19   ` Jagdish Gediya
2022-04-21  3:11   ` Yang Shi
2022-04-21  5:41     ` Wei Xu
2022-04-21  6:24       ` ying.huang
2022-04-21  6:49         ` Wei Xu
2022-04-21  7:08           ` ying.huang
2022-04-21  7:29             ` Wei Xu
2022-04-21  7:45               ` ying.huang
2022-04-21 18:26                 ` Wei Xu
2022-04-22  0:58                   ` ying.huang
2022-04-22  4:46                     ` Wei Xu
2022-04-22  5:40                       ` ying.huang
2022-04-22  6:11                         ` Wei Xu
2022-04-22  6:13                         ` Wei Xu
2022-04-22  6:21                           ` ying.huang
2022-04-22 11:00                             ` Jagdish Gediya
2022-04-22 16:43                               ` Wei Xu
2022-04-22 17:29                                 ` Yang Shi
2022-04-24  3:02                               ` ying.huang
2022-04-25  3:50                                 ` Aneesh Kumar K.V
2022-04-25  6:10                                   ` ying.huang
2022-04-25  8:09                                     ` Aneesh Kumar K V
2022-04-25  8:54                                       ` Aneesh Kumar K V
2022-04-25 20:17                                       ` Davidlohr Bueso
2022-04-26  8:42                                       ` ying.huang
2022-04-26  9:02                                         ` Aneesh Kumar K V
2022-04-26  9:44                                           ` ying.huang
2022-04-27  4:27                                         ` Wei Xu
2022-04-25  7:26                                 ` Jagdish Gediya
2022-04-25 16:56                                 ` Wei Xu
2022-04-27  5:06                                   ` Aneesh Kumar K V
2022-04-27 18:27                                     ` Wei Xu
2022-04-28  0:56                                       ` ying.huang
2022-04-28  4:11                                         ` Wei Xu
2022-04-28 17:14                                           ` Yang Shi
2022-04-29  1:27                                             ` Alistair Popple
2022-04-29  2:21                                               ` ying.huang
2022-04-29  2:58                                                 ` Wei Xu
2022-04-29  3:27                                                   ` ying.huang
2022-04-29  4:45                                                     ` Alistair Popple
2022-04-29 18:53                                                       ` Yang Shi
2022-04-29 18:52                                                   ` Yang Shi
2022-04-27  7:11                                   ` ying.huang
2022-04-27 16:27                                     ` Wei Xu
2022-04-28  8:37                                       ` ying.huang
2022-04-28 19:30                                         ` Chen, Tim C
2022-04-30  2:21                                           ` Wei Xu
2022-04-21 17:56       ` Yang Shi [this message]
2022-04-21 23:48         ` ying.huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkqrxTpWT9q9xavGF+HZQNeNp13OATj248fb1rfCGKTu8A@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=gthelen@google.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.