All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Wei Xu <weixugc@google.com>,
	"ying.huang@intel.com" <ying.huang@intel.com>
Cc: Yang Shi <shy828301@gmail.com>,
	Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Linux MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Greg Thelen <gthelen@google.com>, MichalHocko <mhocko@kernel.org>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>
Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS
Date: Fri, 29 Apr 2022 14:45:27 +1000	[thread overview]
Message-ID: <6564912.CuQ3haaViz@nvdebian> (raw)
In-Reply-To: <cd496b0854d963064e0ae4e2d219d1ed63c13b68.camel@intel.com>

On Friday, 29 April 2022 1:27:36 PM AEST ying.huang@intel.com wrote:
> On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote:
> > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > > 
> > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > > > > 
> > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > > 
> > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > > > 
> > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ....
> > > > > > > > > 
> > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > > > memory node near node 0,
> > > > > > > > > > > 
> > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > node distances:
> > > > > > > > > > > node   0   1   2
> > > > > > > > > > >    0:  10  40  20
> > > > > > > > > > >    1:  40  10  80
> > > > > > > > > > >    2:  20  80  10
> > > > > > > > > > > 
> > > > > > > > > > > We have 2 choices,
> > > > > > > > > > > 
> > > > > > > > > > > a)
> > > > > > > > > > > node    demotion targets
> > > > > > > > > > > 0       1
> > > > > > > > > > > 2       1
> > > > > > > > > > > 
> > > > > > > > > > > b)
> > > > > > > > > > > node    demotion targets
> > > > > > > > > > > 0       1
> > > > > > > > > > > 2       X
> > > > > > > > > > > 
> > > > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > > > default configuration.
> > > > > > > > > > 
> > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > > > > 
> > > > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > > > by that application
> > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > > > > 
> > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > > > tool for applications to override and control their memory placement
> > > > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > > > demotion path.
> > > > > > > > 
> > > > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > > > page_referenced() similar to vm_flags.
> > > > > > > 
> > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > > > mempolicy.
> > > > > > 
> > > > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > > > task_struct, which has the process mempolicy.
> > > > > > 
> > > > > > It is indeed a problem when a page is shared by different threads or
> > > > > > different processes that have different thread default mempolicy
> > > > > > values.
> > > > > 
> > > > > Sorry for chiming in late, this is a known issue when we were working
> > > > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > > > threads since mempolicy is applied to each thread so each thread may
> > > > > have different mempolicy. And I don't think this case is rare. And not
> > > > > only mempolicy but also may cpuset settings cause the similar problem,
> > > > > different threads may have different cpuset settings for cgroupv1.
> > > > > 
> > > > > If this is really a problem for real life workloads, we may consider
> > > > > tackling it for exclusively owned pages first. Thanks to David's
> > > > > patches, now we have dedicated flags to tell exclusively owned pages.
> > > > 
> > > > One of the problems with demotion when I last looked is it does almost exactly
> > > > the opposite of what we want on systems like POWER9 where GPU memory is a
> > > > CPU-less memory node.
> > > > 
> > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > > > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > > > allocations to the CPU node and finally other slow memory nodes or swap.
> > > > 
> > > > Currently though demotion considers the GPU node slow memory (because it is
> > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > > > everything to disk rather than demote to CPU memory (which would be preferred).
> > > > 
> > > > I'm still looking at this series but as I understand it it will help somewhat
> > > > because we could make GPU memory the top-tier so nothing gets demoted to it.
> > > 
> > > Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> > > CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> > > override the auto-generated demotion order via some user space tool.
> > > Another way is to change the GPU driver (I guess where the GPU memory is
> > > enumerated and onlined?) to change the tier of GPU memory node.

Yes, although I think in this case it would be firmware that determines memory
tiers (similar to ACPI HMAT which I saw discussed somewhere here). I agree it's
a system level property though that in an ideal world shouldn't need overriding
from userspace. However being able to override it with a user space tool could
be useful.

> > > > However I wouldn't want to see demotion skipped entirely when a memory policy
> > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > > > some kind of policy specified and IMHO it would be better to demote to another
> > > > node in the mempolicy nodemask rather than going straight to swap, particularly
> > > > as GPU memory capacity tends to be limited in comparison to CPU memory
> > > > capacity.
> > > > > 
> > > 
> > > Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> > > possible, we will not stop demoting from GPU to DRAM with
> > > MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> > > be used too to avoid allocation latency caused by demotion.

I think so. It's been a little while since I last looked at this but I was
under the impression MPOL_PREFERRED didn't do direct reclaim (and therefore
wouldn't trigger demotion so once GPU memory was full became effectively a
no-op). However looking at the source I don't think that's the case now - if
I'm understanding correctly MPOL_PREFERRED will do reclaim/demotion.

The other problem with MPOL_PREFERRED is it doesn't allow the fallback nodes to
be specified. I was hoping the new MPOL_PREFERRED_MANY and
set_mempolicy_home_node() would help here but currently that does disable
reclaim (and therefore demotion) in the first pass.

However that problem is tangential to this series and I can look at that
separately. My main aim here given you were looking at requirements was just
to raise this as a slightly different use case (one where the CPU isn't the top
tier).

Thanks for looking into all this.

 - Alistair

> > I expect that MPOL_BIND can be used to either prevent demotion or
> > select a particular demotion node/nodemask. It all depends on the
> > mempolicy nodemask specified by MPOL_BIND.
> 
> Yes.  I think so too.
> 
> Best Regards,
> Huang, Ying
> 
> > > This is another example of a system with 3 tiers if PMEM is installed in
> > > this machine too.
> > > 
> > > Best Regards,
> > > Huang, Ying
> > > 
> > > > > > On the other hand, it can already support most interesting use cases
> > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > > > demotion) by respecting cpuset and vma mempolicies.
> > > > > > 
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > > > important here.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > -aneesh
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> 
> 
> 





  reply	other threads:[~2022-04-29  4:45 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-13  9:22 [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13  9:22 ` [PATCH v2 1/5] mm: demotion: Set demotion list differently Jagdish Gediya
2022-04-14  7:09   ` ying.huang
2022-04-14  8:48     ` Jagdish Gediya
2022-04-14  8:57       ` ying.huang
2022-04-14  8:55   ` Baolin Wang
2022-04-14  9:02   ` Jonathan Cameron
2022-04-14 10:40     ` Jagdish Gediya
2022-04-21  6:13   ` ying.huang
2022-04-13  9:22 ` [PATCH v2 2/5] mm: demotion: Add new node state N_DEMOTION_TARGETS Jagdish Gediya
2022-04-21  4:33   ` Wei Xu
2022-04-13  9:22 ` [PATCH v2 3/5] mm: demotion: Add support to set targets from userspace Jagdish Gediya
2022-04-21  4:26   ` Wei Xu
2022-04-22  9:13     ` Jagdish Gediya
2022-04-21  5:31   ` Wei Xu
2022-04-13  9:22 ` [PATCH v2 4/5] device-dax/kmem: Set node state as N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13  9:22 ` [PATCH v2 5/5] mm: demotion: Build demotion list based on N_DEMOTION_TARGETS Jagdish Gediya
2022-04-13 21:44 ` [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS Andrew Morton
2022-04-14 10:16   ` Jagdish Gediya
2022-04-14  7:00 ` ying.huang
2022-04-14 10:19   ` Jagdish Gediya
2022-04-21  3:11   ` Yang Shi
2022-04-21  5:41     ` Wei Xu
2022-04-21  6:24       ` ying.huang
2022-04-21  6:49         ` Wei Xu
2022-04-21  7:08           ` ying.huang
2022-04-21  7:29             ` Wei Xu
2022-04-21  7:45               ` ying.huang
2022-04-21 18:26                 ` Wei Xu
2022-04-22  0:58                   ` ying.huang
2022-04-22  4:46                     ` Wei Xu
2022-04-22  5:40                       ` ying.huang
2022-04-22  6:11                         ` Wei Xu
2022-04-22  6:13                         ` Wei Xu
2022-04-22  6:21                           ` ying.huang
2022-04-22 11:00                             ` Jagdish Gediya
2022-04-22 16:43                               ` Wei Xu
2022-04-22 17:29                                 ` Yang Shi
2022-04-24  3:02                               ` ying.huang
2022-04-25  3:50                                 ` Aneesh Kumar K.V
2022-04-25  6:10                                   ` ying.huang
2022-04-25  8:09                                     ` Aneesh Kumar K V
2022-04-25  8:54                                       ` Aneesh Kumar K V
2022-04-25 20:17                                       ` Davidlohr Bueso
2022-04-26  8:42                                       ` ying.huang
2022-04-26  9:02                                         ` Aneesh Kumar K V
2022-04-26  9:44                                           ` ying.huang
2022-04-27  4:27                                         ` Wei Xu
2022-04-25  7:26                                 ` Jagdish Gediya
2022-04-25 16:56                                 ` Wei Xu
2022-04-27  5:06                                   ` Aneesh Kumar K V
2022-04-27 18:27                                     ` Wei Xu
2022-04-28  0:56                                       ` ying.huang
2022-04-28  4:11                                         ` Wei Xu
2022-04-28 17:14                                           ` Yang Shi
2022-04-29  1:27                                             ` Alistair Popple
2022-04-29  2:21                                               ` ying.huang
2022-04-29  2:58                                                 ` Wei Xu
2022-04-29  3:27                                                   ` ying.huang
2022-04-29  4:45                                                     ` Alistair Popple [this message]
2022-04-29 18:53                                                       ` Yang Shi
2022-04-29 18:52                                                   ` Yang Shi
2022-04-27  7:11                                   ` ying.huang
2022-04-27 16:27                                     ` Wei Xu
2022-04-28  8:37                                       ` ying.huang
2022-04-28 19:30                                         ` Chen, Tim C
2022-04-30  2:21                                           ` Wei Xu
2022-04-21 17:56       ` Yang Shi
2022-04-21 23:48         ` ying.huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6564912.CuQ3haaViz@nvdebian \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.