linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Thu, 10 Jan 2019 13:02:15 -0500	[thread overview]
Message-ID: <20190110180215.GE4394@redhat.com> (raw)
In-Reply-To: <20190110165001.GP31793@dhcp22.suse.cz>

On Thu, Jan 10, 2019 at 05:50:01PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > > [Ccing Mel and Andrea]
> > > 
> > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > > skepticism in the approach.
> > > > > > 
> > > > > > It looks we are pretty different than CDM. :)
> > > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > > 
> > > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > > too different, you just use a different terminology ;)
> > > > 
> > > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > > 
> > > > In long term POV, Linux should be prepared for multi-level memory.
> > > > Then there will arise the need to "allocate from this level memory".
> > > > So it looks good to have separated zonelists for each level of memory.
> > > 
> > > Well, I do not have a good answer for you here. We do not have good
> > > experiences with those systems, I am afraid. NUMA is with us for more
> > > than a decade yet our APIs are coarse to say the least and broken at so
> > > many times as well. Starting a new API just based on PMEM sounds like a
> > > ticket to another disaster to me.
> > > 
> > > I would like to see solid arguments why the current model of numa nodes
> > > with fallback in distances order cannot be used for those new
> > > technologies in the beginning and develop something better based on our
> > > experiences that we gain on the way.
> > 
> > I see several issues with distance. First it does fully abstract the
> > underlying topology and this might be problematic, for instance if
> > you memory with different characteristic in same node like persistent
> > memory connected to some CPU then it might be faster for that CPU to
> > access that persistent memory has it has dedicated link to it than to
> > access some other remote memory for which the CPU might have to share
> > the link with other CPUs or devices.
> > 
> > Second distance is no longer easy to compute when you are not trying
> > to answer what is the fastest memory for CPU-N but rather asking what
> > is the fastest memory for CPU-N and device-M ie when you are trying to
> > find the best memory for a group of CPUs/devices. The answer can
> > changes drasticly depending on members of the groups.
> 
> While you might be right, I would _really_ appreciate to start with a
> simpler model and go to a more complex one based on realy HW and real
> experiences than start with an overly complicated and over engineered
> approach from scratch.
> 
> > Some advance programmer already do graph matching ie they match the
> > graph of their program dataset/computation with the topology graph
> > of the computer they run on to determine what is best placement both
> > for threads and memory.
> 
> And those can still use our mempolicy API to describe their needs. If
> existing API is not sufficient then let's talk about which pieces are
> missing.

I understand people don't want the fully topology thing but device memory
can not be expose as a NUMA node hence at very least we need something
that is not NUMA node only and most likely an API that does not use bitmask
as front facing userspace API. So some kind of UID for memory, one for
each type of memory on each node (and also for each device memory). It
can be a 1 to 1 match with NUMA node id for all regular NUMA node memory
with extra id for device memory (for instance by setting the high bit on
the UID for device memory).


> > > I would be especially interested about a possibility of the memory
> > > migration idea during a memory pressure and relying on numa balancing to
> > > resort the locality on demand rather than hiding certain NUMA nodes or
> > > zones from the allocator and expose them only to the userspace.
> > 
> > For device memory we have more things to think of like:
> >     - memory not accessible by CPU
> >     - non cache coherent memory (yet still useful in some case if
> >       application explicitly ask for it)
> >     - device driver want to keep full control over memory as older
> >       application like graphic for GPU, do need contiguous physical
> >       memory and other tight control over physical memory placement
> 
> Again, I believe that HMM is to target those non-coherent or
> non-accessible memory and I do not think it is helpful to put them into
> the mix here.

HMM is the kernel plumbing it does not expose anything to userspace.
While right now for nouveau the plan is to expose API through nouveau
ioctl this does not scale/work for multiple devices or when you mix
and match different devices. A single API that can handle both device
memory and regular memory would be much more useful. Long term at least
that's what i would like to see.


> > So if we are talking about something to replace NUMA i would really
> > like for that to be inclusive of device memory (which can itself be
> > a hierarchy of different memory with different characteristics).
> 
> I think we should build on the existing NUMA infrastructure we have.
> Developing something completely new is not going to happen anytime soon
> and I am not convinced the result would be that much better either.

The issue with NUMA is that i do not see a way to add device memory as
node as the memory need to be fully manage by the device driver. Also
the number of nodes might get out of hands (think 32 devices per CPU
so with 1024 CPU that's 2^15 max nodes ...) this leads to node mask
taking a full page.

Also the whole NUMA access tracking does not work with devices (it can
be added but right now it is non existent). Forcing page fault to track
access is highly disruptive for GPU while the hw can provide much better
informations without fault and CPU counters might also be something we
might want to use rather than faulting.

I am not saying something new will solve all the issues we have today
with NUMA, actualy i don't believe we can solve all of them. But it
could at least be more flexible in terms of what memory program can
bind to.

Cheers,
J�r�me

WARNING: multiple messages have this Message-ID (diff)
From: Jerome Glisse <jglisse@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Thu, 10 Jan 2019 13:02:15 -0500	[thread overview]
Message-ID: <20190110180215.GE4394@redhat.com> (raw)
Message-ID: <20190110180215.uVsxd_YQREjeSUAMrRdAde3IMYaLAxWVjl4GshYQxJI@z> (raw)
In-Reply-To: <20190110165001.GP31793@dhcp22.suse.cz>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 6551 bytes --]

On Thu, Jan 10, 2019 at 05:50:01PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > > [Ccing Mel and Andrea]
> > > 
> > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > > skepticism in the approach.
> > > > > > 
> > > > > > It looks we are pretty different than CDM. :)
> > > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > > 
> > > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > > too different, you just use a different terminology ;)
> > > > 
> > > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > > 
> > > > In long term POV, Linux should be prepared for multi-level memory.
> > > > Then there will arise the need to "allocate from this level memory".
> > > > So it looks good to have separated zonelists for each level of memory.
> > > 
> > > Well, I do not have a good answer for you here. We do not have good
> > > experiences with those systems, I am afraid. NUMA is with us for more
> > > than a decade yet our APIs are coarse to say the least and broken at so
> > > many times as well. Starting a new API just based on PMEM sounds like a
> > > ticket to another disaster to me.
> > > 
> > > I would like to see solid arguments why the current model of numa nodes
> > > with fallback in distances order cannot be used for those new
> > > technologies in the beginning and develop something better based on our
> > > experiences that we gain on the way.
> > 
> > I see several issues with distance. First it does fully abstract the
> > underlying topology and this might be problematic, for instance if
> > you memory with different characteristic in same node like persistent
> > memory connected to some CPU then it might be faster for that CPU to
> > access that persistent memory has it has dedicated link to it than to
> > access some other remote memory for which the CPU might have to share
> > the link with other CPUs or devices.
> > 
> > Second distance is no longer easy to compute when you are not trying
> > to answer what is the fastest memory for CPU-N but rather asking what
> > is the fastest memory for CPU-N and device-M ie when you are trying to
> > find the best memory for a group of CPUs/devices. The answer can
> > changes drasticly depending on members of the groups.
> 
> While you might be right, I would _really_ appreciate to start with a
> simpler model and go to a more complex one based on realy HW and real
> experiences than start with an overly complicated and over engineered
> approach from scratch.
> 
> > Some advance programmer already do graph matching ie they match the
> > graph of their program dataset/computation with the topology graph
> > of the computer they run on to determine what is best placement both
> > for threads and memory.
> 
> And those can still use our mempolicy API to describe their needs. If
> existing API is not sufficient then let's talk about which pieces are
> missing.

I understand people don't want the fully topology thing but device memory
can not be expose as a NUMA node hence at very least we need something
that is not NUMA node only and most likely an API that does not use bitmask
as front facing userspace API. So some kind of UID for memory, one for
each type of memory on each node (and also for each device memory). It
can be a 1 to 1 match with NUMA node id for all regular NUMA node memory
with extra id for device memory (for instance by setting the high bit on
the UID for device memory).


> > > I would be especially interested about a possibility of the memory
> > > migration idea during a memory pressure and relying on numa balancing to
> > > resort the locality on demand rather than hiding certain NUMA nodes or
> > > zones from the allocator and expose them only to the userspace.
> > 
> > For device memory we have more things to think of like:
> >     - memory not accessible by CPU
> >     - non cache coherent memory (yet still useful in some case if
> >       application explicitly ask for it)
> >     - device driver want to keep full control over memory as older
> >       application like graphic for GPU, do need contiguous physical
> >       memory and other tight control over physical memory placement
> 
> Again, I believe that HMM is to target those non-coherent or
> non-accessible memory and I do not think it is helpful to put them into
> the mix here.

HMM is the kernel plumbing it does not expose anything to userspace.
While right now for nouveau the plan is to expose API through nouveau
ioctl this does not scale/work for multiple devices or when you mix
and match different devices. A single API that can handle both device
memory and regular memory would be much more useful. Long term at least
that's what i would like to see.


> > So if we are talking about something to replace NUMA i would really
> > like for that to be inclusive of device memory (which can itself be
> > a hierarchy of different memory with different characteristics).
> 
> I think we should build on the existing NUMA infrastructure we have.
> Developing something completely new is not going to happen anytime soon
> and I am not convinced the result would be that much better either.

The issue with NUMA is that i do not see a way to add device memory as
node as the memory need to be fully manage by the device driver. Also
the number of nodes might get out of hands (think 32 devices per CPU
so with 1024 CPU that's 2^15 max nodes ...) this leads to node mask
taking a full page.

Also the whole NUMA access tracking does not work with devices (it can
be added but right now it is non existent). Forcing page fault to track
access is highly disruptive for GPU while the hw can provide much better
informations without fault and CPU counters might also be something we
might want to use rather than faulting.

I am not saying something new will solve all the issues we have today
with NUMA, actualy i don't believe we can solve all of them. But it
could at least be more flexible in terms of what memory program can
bind to.

Cheers,
Jérôme


  reply	other threads:[~2019-01-10 18:02 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse [this message]
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190110180215.GE4394@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dongx.peng@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=fan.du@intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=jingqi.liu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=yi.z.zhang@linux.intel.com \
    --cc=ying.huang@intel.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).