linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Mel  Gorman <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-accelerators@lists.ozlabs.org
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Wed, 2 Jan 2019 12:21:10 +0000	[thread overview]
Message-ID: <20190102122110.00000206@huawei.com> (raw)
In-Reply-To: <20181228195224.GY16738@dhcp22.suse.cz>

On Fri, 28 Dec 2018 20:52:24 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.  
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.  
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)  
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.  
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.
> 
> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.

This is indeed a very interesting direction.  I'm coming at this from a CCIX
point of view.  Ignore the next bit of you are already familiar with CCIX :)

Main thing CCIX brings is that memory can be fully coherent
anywhere in the system including out near accelerators, all via shared physical
address space, leveraging ATS / IOMMUs / MMUs to do translations. Result is a
big and possibly extremely heterogenous NUMA system.  All the setup is done in
firmware so by the time the kernel sees it everything is in SRAT / SLIT
/ NFIT / HMAT etc.

We have a few usecases that need some more fine grained control combined with
automated balancing.  So far we've been messing with nasty tricks like
hotplugging memory after boot a long way away, or the original CDM zone patches
(knowing they weren't likely to go anywhere!)  Userspace is all hand tuned
which is not great in the long run...

Use cases (I've probably missed some):

* Storage Class Memory near to the host CPU / DRAM controllers (pretty much
  the same as this series is considering).  Note that there isn't necessarily
  any 'pairing' with host DRAM as seen in this RFC.  A typical system might have
  a large single pool with similar access characteristics from each host SOC.
  The paired approach is probably going to be common in early systems though.
  Also not necessarily Non Volatile, could just be a big DDR expansion board.

* RAM out near an accelerator. Aim would be to migrate data to that RAM if
  the access patterns from the accelerator justify it being there rather than
  near any of the host CPUs.  In a memory pressure on host situation anything
  could be pushed out there as probably still better than swapping.
  Note that this would require some knowledge of 'who' is doing the accessing
  which isn't needed for what this RFC is doing.

* Hot pages may not be hot just because the host is using them a lot.  It would be
  very useful to have a means of adding information available from accelerators
  beyond simple accessed bits (dreaming ;)  One problem here is translation
  caches (ATCs) as they won't normally result in any updates to the page accessed
  bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
  obvious) that the ATS request is the only opportunity to update the accessed
  bit.  The nasty option here would be to periodically flush the ATC to force
  the access bit updates via repeats of the ATS request (ouch).
  That option only works if the iommu supports updating the accessed flag
  (optional on SMMU v3 for example).

We need the explicit placement, but can get that from existing NUMA controls.
More of a concern is persuading the kernel it really doesn't want to put
it's data structures in distant memory as it can be very very distant.

So ideally I'd love this set to head in a direction that helps me tick off
at least some of the above usecases and hopefully have some visibility on
how to address the others moving forwards,

Good to see some new thoughts in this area!

Jonathan
> 
> > On the other hand, there will also be page allocations that don't care
> > about the exact memory level. So it looks reasonable to expect
> > different kind of fallback zonelists that can be selected by NUMA policy.
> > 
> > Thanks,
> > Fengguang  
> 

WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	<kvm@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"Mel  Gorman" <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	<linux-accelerators@lists.ozlabs.org>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Wed, 2 Jan 2019 12:21:10 +0000	[thread overview]
Message-ID: <20190102122110.00000206@huawei.com> (raw)
Message-ID: <20190102122110.lh-xv9i-xHWvvhi32KSjbs9k3yWWjgFoDYhoSgrjecE@z> (raw)
In-Reply-To: <20181228195224.GY16738@dhcp22.suse.cz>

On Fri, 28 Dec 2018 20:52:24 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.  
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.  
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)  
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.  
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.
> 
> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.

This is indeed a very interesting direction.  I'm coming at this from a CCIX
point of view.  Ignore the next bit of you are already familiar with CCIX :)

Main thing CCIX brings is that memory can be fully coherent
anywhere in the system including out near accelerators, all via shared physical
address space, leveraging ATS / IOMMUs / MMUs to do translations. Result is a
big and possibly extremely heterogenous NUMA system.  All the setup is done in
firmware so by the time the kernel sees it everything is in SRAT / SLIT
/ NFIT / HMAT etc.

We have a few usecases that need some more fine grained control combined with
automated balancing.  So far we've been messing with nasty tricks like
hotplugging memory after boot a long way away, or the original CDM zone patches
(knowing they weren't likely to go anywhere!)  Userspace is all hand tuned
which is not great in the long run...

Use cases (I've probably missed some):

* Storage Class Memory near to the host CPU / DRAM controllers (pretty much
  the same as this series is considering).  Note that there isn't necessarily
  any 'pairing' with host DRAM as seen in this RFC.  A typical system might have
  a large single pool with similar access characteristics from each host SOC.
  The paired approach is probably going to be common in early systems though.
  Also not necessarily Non Volatile, could just be a big DDR expansion board.

* RAM out near an accelerator. Aim would be to migrate data to that RAM if
  the access patterns from the accelerator justify it being there rather than
  near any of the host CPUs.  In a memory pressure on host situation anything
  could be pushed out there as probably still better than swapping.
  Note that this would require some knowledge of 'who' is doing the accessing
  which isn't needed for what this RFC is doing.

* Hot pages may not be hot just because the host is using them a lot.  It would be
  very useful to have a means of adding information available from accelerators
  beyond simple accessed bits (dreaming ;)  One problem here is translation
  caches (ATCs) as they won't normally result in any updates to the page accessed
  bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
  obvious) that the ATS request is the only opportunity to update the accessed
  bit.  The nasty option here would be to periodically flush the ATC to force
  the access bit updates via repeats of the ATS request (ouch).
  That option only works if the iommu supports updating the accessed flag
  (optional on SMMU v3 for example).

We need the explicit placement, but can get that from existing NUMA controls.
More of a concern is persuading the kernel it really doesn't want to put
it's data structures in distant memory as it can be very very distant.

So ideally I'd love this set to head in a direction that helps me tick off
at least some of the above usecases and hopefully have some visibility on
how to address the others moving forwards,

Good to see some new thoughts in this area!

Jonathan
> 
> > On the other hand, there will also be page allocations that don't care
> > about the exact memory level. So it looks reasonable to expect
> > different kind of fallback zonelists that can be selected by NUMA policy.
> > 
> > Thanks,
> > Fengguang  
> 



  reply	other threads:[~2019-01-02 12:22 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron [this message]
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190102122110.00000206@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dongx.peng@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=fan.du@intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=jingqi.liu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-accelerators@lists.ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=yi.z.zhang@linux.intel.com \
    --cc=ying.huang@intel.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).