All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Bob Liu <liubo95@huawei.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Nellans <dnellans@nvidia.com>,
	Balbir Singh <bsingharora@gmail.com>,
	majiuyue <majiuyue@huawei.com>,
	"xieyisheng (A)" <xieyisheng1@huawei.com>,
	ross.zwisler@linux.intel.com, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@kernel.org>
Subject: Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Date: Tue, 5 Sep 2017 13:00:13 -0600	[thread overview]
Message-ID: <20170905190013.GC24073@linux.intel.com> (raw)
In-Reply-To: <20170905135017.GA19397@redhat.com>

On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
> > >>>>> Unlike unaddressable memory, coherent device memory has a real
> > >>>>> resource associated with it on the system (as CPU can address
> > >>>>> it). Add a new helper to hotplug such memory within the HMM
> > >>>>> framework.
> > >>>>>
> > >>>>
> > >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> > >>>> through ACPI and recognized as NUMA memory node.
> > >>>> Then how can their memory be captured and managed by HMM framework?
> > >>>>
> > >>>
> > >>> Only platform that has such memory today is powerpc and it is not reported
> > >>> as regular memory by the firmware hence why they need this helper.
> > >>>
> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> > >>
> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > >> Table (HMAT) table defined in ACPI 6.2.
> > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > >> coherent on-device memory).
> > >>
> > >> Ross from Intel already done some work on this, see:
> > >> https://lwn.net/Articles/724562/
> > >>
> > >> arm64 supports APCI also, there is likely more this kind of device when CCIX
> > >> is out (should be very soon if on schedule).
> > > 
> > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > > when you have several kind of memory each with different characteristics:
> > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > >     small (ie few giga bytes)
> > >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> > >   - DDR (good old memory) well characteristics are between HBM and persistent
> > > 
> > 
> > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
> > Does someone have any suggestion?  I thought HMM can do this.
> > Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
> > swap, can't accept DDR fallback allocation.
> 
> I don't think there is any consensus for this. I put forward the idea that NUMA
> needed to be extended as with deep hierarchy it is not only the distance between
> two nodes but also others factors like persistency, bandwidth, latency ...
> 
> 
> > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > > maybe even persistent memory).
> > > 
> > 
> > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.
> 
> It is not, it is much more complex than that. Linux kernel has no idea on what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
> 
> > 
> > 
> > >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> > >>> memory node but as io range like any regular PCIE resources. Device driver
> > >>> through capabilities flags would then figure out if the link between the
> > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> > >>> as device memory.
> > >>>
> > >>
> > >> From my point of view,  Cache coherent device memory will popular soon and
> > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> > >> to me.
> > > 
> > > Cache coherent device will be reported through standard mecanisms defined by
> > > the bus standard they are using. To my knowledge all the standard are either
> > > on top of PCIE or are similar to PCIE.
> > > 
> > > It is true that on many platform PCIE resource is manage/initialize by the
> > > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > > bios pick.
> > > 
> > > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> > 
> > But it's happening.
> > In my understanding, that's why HMAT was introduced.
> > For reporting device memory as regular memory(with different characteristics).
> 
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a device is a piece of hardware that has some
> processing capabilities (network adapter, sound card, GPU, ...)
> 
> But we can argue about semantic and what a device is. For all intent and purposes
> device in HMM context is some piece of hardware with processing capabilities and
> local device memory.

I personally don't see a reason why we couldn't use the HMAT to describe
device memory.  The idea of having memory-only NUMA nodes is already a realty
post-HMAT, and the HMAT is just there to give you information on the memory
ranges in the system.  I realize that you may need a different device driver
to set the memory up, but once you do set it up and it's cache coherent,
doesn't it just look like any other memory range where you can say things
like:

My memory starts at X
My memory has size Y
My memory's performance from CPU Z is XXX (latency, bandwidth, read & write)
etc?

- Ross

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Bob Liu <liubo95@huawei.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Nellans <dnellans@nvidia.com>,
	Balbir Singh <bsingharora@gmail.com>,
	majiuyue <majiuyue@huawei.com>,
	"xieyisheng (A)" <xieyisheng1@huawei.com>,
	ross.zwisler@linux.intel.com, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@kernel.org>
Subject: Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Date: Tue, 5 Sep 2017 13:00:13 -0600	[thread overview]
Message-ID: <20170905190013.GC24073@linux.intel.com> (raw)
In-Reply-To: <20170905135017.GA19397@redhat.com>

On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >>>> On 2017/8/17 8:05, Jerome Glisse wrote:
> > >>>>> Unlike unaddressable memory, coherent device memory has a real
> > >>>>> resource associated with it on the system (as CPU can address
> > >>>>> it). Add a new helper to hotplug such memory within the HMM
> > >>>>> framework.
> > >>>>>
> > >>>>
> > >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> > >>>> through ACPI and recognized as NUMA memory node.
> > >>>> Then how can their memory be captured and managed by HMM framework?
> > >>>>
> > >>>
> > >>> Only platform that has such memory today is powerpc and it is not reported
> > >>> as regular memory by the firmware hence why they need this helper.
> > >>>
> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> > >>
> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > >> Table (HMAT) table defined in ACPI 6.2.
> > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > >> coherent on-device memory).
> > >>
> > >> Ross from Intel already done some work on this, see:
> > >> https://lwn.net/Articles/724562/
> > >>
> > >> arm64 supports APCI also, there is likely more this kind of device when CCIX
> > >> is out (should be very soon if on schedule).
> > > 
> > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > > when you have several kind of memory each with different characteristics:
> > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > >     small (ie few giga bytes)
> > >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> > >   - DDR (good old memory) well characteristics are between HBM and persistent
> > > 
> > 
> > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
> > Does someone have any suggestion?  I thought HMM can do this.
> > Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
> > swap, can't accept DDR fallback allocation.
> 
> I don't think there is any consensus for this. I put forward the idea that NUMA
> needed to be extended as with deep hierarchy it is not only the distance between
> two nodes but also others factors like persistency, bandwidth, latency ...
> 
> 
> > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > > maybe even persistent memory).
> > > 
> > 
> > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.
> 
> It is not, it is much more complex than that. Linux kernel has no idea on what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
> 
> > 
> > 
> > >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> > >>> memory node but as io range like any regular PCIE resources. Device driver
> > >>> through capabilities flags would then figure out if the link between the
> > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> > >>> as device memory.
> > >>>
> > >>
> > >> From my point of view,  Cache coherent device memory will popular soon and
> > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> > >> to me.
> > > 
> > > Cache coherent device will be reported through standard mecanisms defined by
> > > the bus standard they are using. To my knowledge all the standard are either
> > > on top of PCIE or are similar to PCIE.
> > > 
> > > It is true that on many platform PCIE resource is manage/initialize by the
> > > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > > bios pick.
> > > 
> > > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> > 
> > But it's happening.
> > In my understanding, that's why HMAT was introduced.
> > For reporting device memory as regular memory(with different characteristics).
> 
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a device is a piece of hardware that has some
> processing capabilities (network adapter, sound card, GPU, ...)
> 
> But we can argue about semantic and what a device is. For all intent and purposes
> device in HMM context is some piece of hardware with processing capabilities and
> local device memory.

I personally don't see a reason why we couldn't use the HMAT to describe
device memory.  The idea of having memory-only NUMA nodes is already a realty
post-HMAT, and the HMAT is just there to give you information on the memory
ranges in the system.  I realize that you may need a different device driver
to set the memory up, but once you do set it up and it's cache coherent,
doesn't it just look like any other memory range where you can say things
like:

My memory starts at X
My memory has size Y
My memory's performance from CPU Z is XXX (latency, bandwidth, read & write)
etc?

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-09-05 19:00 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
2017-08-17  0:05 ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2018-12-20  8:33   ` Dan Williams
2018-12-20 16:15     ` Jerome Glisse
2018-12-20 16:15       ` Jerome Glisse
2018-12-20 16:47       ` Dan Williams
2018-12-20 16:47         ` Dan Williams
2018-12-20 16:57         ` Jerome Glisse
2018-12-20 16:57           ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-09-05 17:13   ` Laurent Dufour
2017-09-05 17:13     ` Laurent Dufour
2017-09-05 17:13     ` Laurent Dufour
2017-09-05 17:21     ` Jerome Glisse
2017-09-05 17:21       ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17 21:12   ` Andrew Morton
2017-08-17 21:12     ` Andrew Morton
2017-08-17 21:44     ` Jerome Glisse
2017-08-17 21:44       ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-09-04  3:09   ` Bob Liu
2017-09-04  3:09     ` Bob Liu
2017-09-04 15:51     ` Jerome Glisse
2017-09-04 15:51       ` Jerome Glisse
2017-09-05  1:13       ` Bob Liu
2017-09-05  1:13         ` Bob Liu
2017-09-05  2:38         ` Jerome Glisse
2017-09-05  2:38           ` Jerome Glisse
2017-09-05  3:50           ` Bob Liu
2017-09-05  3:50             ` Bob Liu
2017-09-05 13:50             ` Jerome Glisse
2017-09-05 13:50               ` Jerome Glisse
2017-09-05 16:18               ` Dan Williams
2017-09-05 16:18                 ` Dan Williams
2017-09-05 19:00               ` Ross Zwisler [this message]
2017-09-05 19:00                 ` Ross Zwisler
2017-09-05 19:20                 ` Jerome Glisse
2017-09-05 19:20                   ` Jerome Glisse
2017-09-08 19:43                   ` Ross Zwisler
2017-09-08 19:43                     ` Ross Zwisler
2017-09-08 20:29                     ` Jerome Glisse
2017-09-08 20:29                       ` Jerome Glisse
2017-09-05 18:54           ` Ross Zwisler
2017-09-05 18:54             ` Ross Zwisler
2017-09-06  1:25             ` Bob Liu
2017-09-06  1:25               ` Bob Liu
2017-09-06  2:12               ` Jerome Glisse
2017-09-06  2:12                 ` Jerome Glisse
2017-09-07  2:06                 ` Bob Liu
2017-09-07  2:06                   ` Bob Liu
2017-09-07 17:00                   ` Jerome Glisse
2017-09-07 17:00                     ` Jerome Glisse
2017-09-07 17:27                   ` Jerome Glisse
2017-09-07 17:27                     ` Jerome Glisse
2017-09-08  1:59                     ` Bob Liu
2017-09-08  1:59                       ` Bob Liu
2017-09-08 20:43                       ` Dan Williams
2017-09-08 20:43                         ` Dan Williams
2017-11-17  3:47                         ` chetan L
2017-11-17  3:47                           ` chetan L
2017-09-05  3:36       ` Balbir Singh
2017-09-05  3:36         ` Balbir Singh
2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
2017-08-17 21:39   ` Andrew Morton
2017-08-17 21:55   ` Jerome Glisse
2017-08-17 21:55     ` Jerome Glisse
2017-08-17 21:59     ` Dan Williams
2017-08-17 21:59       ` Dan Williams
2017-08-17 22:02       ` Jerome Glisse
2017-08-17 22:02         ` Jerome Glisse
2017-08-17 22:06         ` Dan Williams
2017-08-17 22:06           ` Dan Williams
2017-08-17 22:16       ` Andrew Morton
2017-08-17 22:16         ` Andrew Morton
2017-12-13 12:10 ` Figo.zhang
2017-12-13 16:12   ` Jerome Glisse
2017-12-14  2:48     ` Figo.zhang
2017-12-14  3:16       ` Jerome Glisse
2017-12-14  3:53         ` Figo.zhang
2017-12-14  4:16           ` Jerome Glisse
2017-12-14  7:05             ` Figo.zhang
2017-12-14 15:28               ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170905190013.GC24073@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dnellans@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liubo95@huawei.com \
    --cc=majiuyue@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=riel@redhat.com \
    --cc=xieyisheng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.