All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Bob Liu <liubo95@huawei.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Nellans <dnellans@nvidia.com>,
	Balbir Singh <bsingharora@gmail.com>,
	majiuyue <majiuyue@huawei.com>,
	"xieyisheng (A)" <xieyisheng1@huawei.com>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Date: Fri, 8 Sep 2017 13:43:44 -0600	[thread overview]
Message-ID: <20170908194344.GA1956@linux.intel.com> (raw)
In-Reply-To: <20170905192050.GC19397@redhat.com>

On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
<>
> Does HMAT support device hotplug ? I am unfamiliar with the whole inner working
> of ACPI versus PCIE. Anyway i don't see any issue with device memory also showing
> through HMAT but like i said device driver for the device will want to be in total
> control of that memory.

Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
will use to replace the current one.

I don't yet have support for _HMA in my enabling, but I do intend to add
support for it once we settle on a sysfs API for the regular boot-time case.

> Like i said issue here is that core kernel is unaware of the device activity ie
> on what part of memory the device is actively working. So core mm can not make
> inform decision on what should be migrated to device memory. Also we do not want
> regular memory allocation to end in device memory unless explicitly ask for.
> Few reasons for that. First this memory might not only be use for compute task
> but also for graphic and in that case they are hard constraint on physically
> contiguous memory allocation that require the GPU to move thing around to make
> room for graphic object (can't allow GUP).
> 
> Second reasons, the device memory is inherently unreliable. If there is a bug
> in the device driver or the user manage to trigger a faulty condition on GPU
> the device might need a hard reset (ie cut PCIE power to device) which leads
> to loss of memory content. While GPU are becoming more and more resilient they
> are still prone to lockup.
> 
> Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> each application as if they were the only one and allow each of them to allocate
> all of the device memory or more than could with strict sharing. As GPU have
> long timeslice between switching to different context/application they can
> easily move out and in large chunk of the process memory at context/application
> switching. This is have proven to be a key aspect to allow maximum performances
> accross several concurrent application/context.
> 
> To implement this easiest solution is for the device to lie about how much memory
> it has and use the system memory as an overflow.

I don't think any of this precludes the HMAT being involved.  This is all very
similar to what I think we need to do for high bandwidth memory, for example.
We don't want the OS to use it for anything, and we want all of it to be
available for applications to allocate and use for their specific workload.
We don't want to make any assumptions about how it can or should be used.

The HMAT is just there to give us a few things:

1) It provides us with an explicit way of telling the OS not to use the
memory, in the form of the "Reservation hint" flag in the Memory Subsystem
Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
be set for persistent memory and HBM, and it sounds like you'd expect it to be
set for your device memory as well.

2) It provides us with a way of telling userspace "hey, I know about some
memory, and I can tell you its performance characteristics".  All control of
how this memory is allocated and used is still left to userspace.

> I am not saying that NUMA is not the way forward, i am saying that as it is today
> it is not suited for this. It is lacking metric, it is lacking logic, it is lacking
> features. We could add all this but it is a lot of work and i don't feel that we
> have enough real world experience to do so now. I would rather have each devices
> grow proper infrastructure in their driver through device specific API.

To be clear, I'm not proposing that we teach the NUMA code how to
automatically allocate for a given numa node, balance, etc. memory described
by the HMAT.  All I want is an API that says "here is some memory, I'll tell
you all I can about it and let you do with it what you will", and perhaps a
way to manually allocate what you want.

And yes, this is very hand-wavy at this point. :)  After I get the sysfs
portion sussed out the next step is to work on enabling something like
libnuma to allow the memory to be manually allocated.

I think this works for both my use case and yours, correct?

> Then identify common pattern and from there try to build a sane API (if any such
> thing exist :)) rather than trying today to build the whole house from the ground
> up with just a foggy idea of how it should looks in the end.

Yea, I do see your point.  My worry is that if I define an API, and you define
an API, we'll end up in two different places with people using our different
APIs, then:

https://xkcd.com/927/

:)

The HMAT enabling I'm trying to do is very passive - it doesn't actively do
*anything* with the memory, it's entire purpose is to give userspace more
information about the memory so userspace can make informed decisions.

Would you be willing to look at the sysfs API I have defined, and see if it
would work for you?

https://lkml.org/lkml/2017/7/6/749

I'll look harder at your enabling and see if we can figure out some common
ground.

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Bob Liu <liubo95@huawei.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Nellans <dnellans@nvidia.com>,
	Balbir Singh <bsingharora@gmail.com>,
	majiuyue <majiuyue@huawei.com>,
	"xieyisheng (A)" <xieyisheng1@huawei.com>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Date: Fri, 8 Sep 2017 13:43:44 -0600	[thread overview]
Message-ID: <20170908194344.GA1956@linux.intel.com> (raw)
In-Reply-To: <20170905192050.GC19397@redhat.com>

On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
<>
> Does HMAT support device hotplug ? I am unfamiliar with the whole inner working
> of ACPI versus PCIE. Anyway i don't see any issue with device memory also showing
> through HMAT but like i said device driver for the device will want to be in total
> control of that memory.

Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
will use to replace the current one.

I don't yet have support for _HMA in my enabling, but I do intend to add
support for it once we settle on a sysfs API for the regular boot-time case.

> Like i said issue here is that core kernel is unaware of the device activity ie
> on what part of memory the device is actively working. So core mm can not make
> inform decision on what should be migrated to device memory. Also we do not want
> regular memory allocation to end in device memory unless explicitly ask for.
> Few reasons for that. First this memory might not only be use for compute task
> but also for graphic and in that case they are hard constraint on physically
> contiguous memory allocation that require the GPU to move thing around to make
> room for graphic object (can't allow GUP).
> 
> Second reasons, the device memory is inherently unreliable. If there is a bug
> in the device driver or the user manage to trigger a faulty condition on GPU
> the device might need a hard reset (ie cut PCIE power to device) which leads
> to loss of memory content. While GPU are becoming more and more resilient they
> are still prone to lockup.
> 
> Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> each application as if they were the only one and allow each of them to allocate
> all of the device memory or more than could with strict sharing. As GPU have
> long timeslice between switching to different context/application they can
> easily move out and in large chunk of the process memory at context/application
> switching. This is have proven to be a key aspect to allow maximum performances
> accross several concurrent application/context.
> 
> To implement this easiest solution is for the device to lie about how much memory
> it has and use the system memory as an overflow.

I don't think any of this precludes the HMAT being involved.  This is all very
similar to what I think we need to do for high bandwidth memory, for example.
We don't want the OS to use it for anything, and we want all of it to be
available for applications to allocate and use for their specific workload.
We don't want to make any assumptions about how it can or should be used.

The HMAT is just there to give us a few things:

1) It provides us with an explicit way of telling the OS not to use the
memory, in the form of the "Reservation hint" flag in the Memory Subsystem
Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
be set for persistent memory and HBM, and it sounds like you'd expect it to be
set for your device memory as well.

2) It provides us with a way of telling userspace "hey, I know about some
memory, and I can tell you its performance characteristics".  All control of
how this memory is allocated and used is still left to userspace.

> I am not saying that NUMA is not the way forward, i am saying that as it is today
> it is not suited for this. It is lacking metric, it is lacking logic, it is lacking
> features. We could add all this but it is a lot of work and i don't feel that we
> have enough real world experience to do so now. I would rather have each devices
> grow proper infrastructure in their driver through device specific API.

To be clear, I'm not proposing that we teach the NUMA code how to
automatically allocate for a given numa node, balance, etc. memory described
by the HMAT.  All I want is an API that says "here is some memory, I'll tell
you all I can about it and let you do with it what you will", and perhaps a
way to manually allocate what you want.

And yes, this is very hand-wavy at this point. :)  After I get the sysfs
portion sussed out the next step is to work on enabling something like
libnuma to allow the memory to be manually allocated.

I think this works for both my use case and yours, correct?

> Then identify common pattern and from there try to build a sane API (if any such
> thing exist :)) rather than trying today to build the whole house from the ground
> up with just a foggy idea of how it should looks in the end.

Yea, I do see your point.  My worry is that if I define an API, and you define
an API, we'll end up in two different places with people using our different
APIs, then:

https://xkcd.com/927/

:)

The HMAT enabling I'm trying to do is very passive - it doesn't actively do
*anything* with the memory, it's entire purpose is to give userspace more
information about the memory so userspace can make informed decisions.

Would you be willing to look at the sysfs API I have defined, and see if it
would work for you?

https://lkml.org/lkml/2017/7/6/749

I'll look harder at your enabling and see if we can figure out some common
ground.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-09-08 19:43 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
2017-08-17  0:05 ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2018-12-20  8:33   ` Dan Williams
2018-12-20 16:15     ` Jerome Glisse
2018-12-20 16:15       ` Jerome Glisse
2018-12-20 16:47       ` Dan Williams
2018-12-20 16:47         ` Dan Williams
2018-12-20 16:57         ` Jerome Glisse
2018-12-20 16:57           ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-09-05 17:13   ` Laurent Dufour
2017-09-05 17:13     ` Laurent Dufour
2017-09-05 17:13     ` Laurent Dufour
2017-09-05 17:21     ` Jerome Glisse
2017-09-05 17:21       ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17 21:12   ` Andrew Morton
2017-08-17 21:12     ` Andrew Morton
2017-08-17 21:44     ` Jerome Glisse
2017-08-17 21:44       ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
2017-08-17  0:05   ` Jérôme Glisse
2017-09-04  3:09   ` Bob Liu
2017-09-04  3:09     ` Bob Liu
2017-09-04 15:51     ` Jerome Glisse
2017-09-04 15:51       ` Jerome Glisse
2017-09-05  1:13       ` Bob Liu
2017-09-05  1:13         ` Bob Liu
2017-09-05  2:38         ` Jerome Glisse
2017-09-05  2:38           ` Jerome Glisse
2017-09-05  3:50           ` Bob Liu
2017-09-05  3:50             ` Bob Liu
2017-09-05 13:50             ` Jerome Glisse
2017-09-05 13:50               ` Jerome Glisse
2017-09-05 16:18               ` Dan Williams
2017-09-05 16:18                 ` Dan Williams
2017-09-05 19:00               ` Ross Zwisler
2017-09-05 19:00                 ` Ross Zwisler
2017-09-05 19:20                 ` Jerome Glisse
2017-09-05 19:20                   ` Jerome Glisse
2017-09-08 19:43                   ` Ross Zwisler [this message]
2017-09-08 19:43                     ` Ross Zwisler
2017-09-08 20:29                     ` Jerome Glisse
2017-09-08 20:29                       ` Jerome Glisse
2017-09-05 18:54           ` Ross Zwisler
2017-09-05 18:54             ` Ross Zwisler
2017-09-06  1:25             ` Bob Liu
2017-09-06  1:25               ` Bob Liu
2017-09-06  2:12               ` Jerome Glisse
2017-09-06  2:12                 ` Jerome Glisse
2017-09-07  2:06                 ` Bob Liu
2017-09-07  2:06                   ` Bob Liu
2017-09-07 17:00                   ` Jerome Glisse
2017-09-07 17:00                     ` Jerome Glisse
2017-09-07 17:27                   ` Jerome Glisse
2017-09-07 17:27                     ` Jerome Glisse
2017-09-08  1:59                     ` Bob Liu
2017-09-08  1:59                       ` Bob Liu
2017-09-08 20:43                       ` Dan Williams
2017-09-08 20:43                         ` Dan Williams
2017-11-17  3:47                         ` chetan L
2017-11-17  3:47                           ` chetan L
2017-09-05  3:36       ` Balbir Singh
2017-09-05  3:36         ` Balbir Singh
2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
2017-08-17 21:39   ` Andrew Morton
2017-08-17 21:55   ` Jerome Glisse
2017-08-17 21:55     ` Jerome Glisse
2017-08-17 21:59     ` Dan Williams
2017-08-17 21:59       ` Dan Williams
2017-08-17 22:02       ` Jerome Glisse
2017-08-17 22:02         ` Jerome Glisse
2017-08-17 22:06         ` Dan Williams
2017-08-17 22:06           ` Dan Williams
2017-08-17 22:16       ` Andrew Morton
2017-08-17 22:16         ` Andrew Morton
2017-12-13 12:10 ` Figo.zhang
2017-12-13 16:12   ` Jerome Glisse
2017-12-14  2:48     ` Figo.zhang
2017-12-14  3:16       ` Jerome Glisse
2017-12-14  3:53         ` Figo.zhang
2017-12-14  4:16           ` Jerome Glisse
2017-12-14  7:05             ` Figo.zhang
2017-12-14 15:28               ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170908194344.GA1956@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dnellans@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liubo95@huawei.com \
    --cc=majiuyue@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=riel@redhat.com \
    --cc=xieyisheng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.