All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC summary] Enable Coherent Device Memory
@ 2017-05-12  6:18 Balbir Singh
  2017-05-12 10:26 ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2017-05-12  6:18 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, khandual, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, arbab, vbabka, Christoph Lameter, Rik van Riel,
	Benjamin Herrenschmidt

Here is a summary of the RFC I posted for coherent device memory
(see https://lwn.net/Articles/720380/)

I did an FAQ in one of the emails, I am extending that to summary form
so that we can move ahead towards decision making

What is coherent device memory?
 - Please see the RFC (https://lwn.net/Articles/720380/) and
   https://lwn.net/Articles/717601/
Why do we need to isolate memory?
 - CDM memory is not meant for normal usage, applications can request for it
   explictly. Oflload their compute to the device where the memory is
   (the offload is via a user space API like CUDA/openCL/...)
How do we isolate the memory - NUMA or HMM-CDM?
 - Since the memory is coherent, NUMA provides the mechanism to isolate to
   a large extent via mempolicy. With NUMA we also get autonuma/kswapd/etc
   running. Something we would like to avoid. NUMA gives the application
   a transparent view of memory, in the sense that all mm features work,
   like direct page cache allocation in coherent device memory, limiting
   memory via cgroups if required, etc. With CPUSets, its
   possible for us to isolate allocation. One challenge is that the
   admin on the system may use them differently and applications need to
   be aware of running in the right cpuset to allocate memory from the
   CDM node. Putting all applications in the cpuset with the CDM node is
   not the right thing to do, which means the application needs to move itself
   to the right cpuset before requesting for CDM memory. It's not impossible
   to use CPUsets, just hard to configure correctly.
  - With HMM, we would need a HMM variant HMM-CDM, so that we are not marking
   the pages as unavailable, page cache cannot do directly to coherent memory.
   Audit of mm paths is required. Most of the other things should work.
   User access to HMM-CDM memory behind ZONE_DEVICE is via a device driver.
Do we need to isolate node attributes independent of coherent device memory?
 - Christoph Lameter thought it would be useful to isolate node attributes,
   specifically ksm/autonuma for low latency suff.
Why do we need migration?
 - Depending on where the memory is being accessed from, we would like to
   migrate pages between system and coherent device memory. HMM provides
   DMA offload capability that is useful in both cases.
What is the larger picture - end to end?
 - Applications can allocate memory on the device or in system memory,
   offload the compute via user space API. Migration can be used for performance
   if required since it helps to keep the memory local to the compute.

Comments from the thread

1. If we go down the NUMA path, we need to live with the limitations of
   what comes with the cpuless NUMA node
2. The changes made to cpusets and mempolicies, make the code more complex
3. We need a good end to end story

The comments from the thread were responded to

How do we go about implementing CDM then?

The recommendation from John Hubbard/Mel Gorman and Michal Hocko is to
use HMM-CDM to solve the problem. Jerome/Balbir and Ben H prefer NUMA-CDM.
There were suggestions that NUMA might not be ready or is the best approach
in the long term, but we are yet to identify what changes to NUMA would
enable it to support NUMA-CDM.

The trade-offs and limitations/advantages of both approaches are in the
RFC thread and in the summary above. It seems like the from the discussions
with Michal/Mel/John the direction is to use HMM-CDM for now (both from the
thread and from mm-summit). Can we build consensus on this and move forward?
Are there any objections? Did I miss or misrepresent anything from the threads?
It would be good to get feedback from Andrew Morton and Rik Van Riel as well.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-12  6:18 [RFC summary] Enable Coherent Device Memory Balbir Singh
@ 2017-05-12 10:26 ` Mel Gorman
  2017-05-15 23:45   ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2017-05-12 10:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, arbab, vbabka, Christoph Lameter, Rik van Riel,
	Benjamin Herrenschmidt

On Fri, May 12, 2017 at 04:18:02PM +1000, Balbir Singh wrote:
> Why do we need to isolate memory?
>  - CDM memory is not meant for normal usage, applications can request for it
>    explictly. Oflload their compute to the device where the memory is
>    (the offload is via a user space API like CUDA/openCL/...)

It still remains unanswered to a large extent why this cannot be
isolated after the fact via a standard mechanism. It may be easier if
the onlining of CDM memory can be deferred at boot until userspace
helpers can trigger the onlining and isolation.

> How do we isolate the memory - NUMA or HMM-CDM?
>  - Since the memory is coherent, NUMA provides the mechanism to isolate to
>    a large extent via mempolicy. With NUMA we also get autonuma/kswapd/etc
>    running.

This has come up before with respect to autonuma and there appears to be
confusion. autonuma doesn't run on nodes as such. The page table hinting
happens in per-task context but should skip VMAs that are controlled by
a policy. While some care is needed from the application, it's managable
and would perform better than special casing the marking of pages placed
on a CDM-controlled node.

As for kswapd, there isn't a user-controllable method for controlling
this. However, if a device onlining the memory set the watermarks to 0,
it would allow the full CDM memory to be used by the application and kswapd
would never be woken.

KSM is potentially more problematic and initially may have to be disabled
entirely to determine if it actually matters for CDM-aware applications or
not. KSM normally comes into play with virtual machines are involved so it
would have to be decided if CDM is being exposed to guests with pass-thru
or some other mechanism. Initially, just disable it unless the use cases
are known.

>    Something we would like to avoid. NUMA gives the application
>    a transparent view of memory, in the sense that all mm features work,
>    like direct page cache allocation in coherent device memory, limiting
>    memory via cgroups if required, etc. With CPUSets, its
>    possible for us to isolate allocation. One challenge is that the
>    admin on the system may use them differently and applications need to
>    be aware of running in the right cpuset to allocate memory from the
>    CDM node.

An admin and application has to deal with this complexity regardless.
Particular care would be needed for file-backed data as an application
would have to ensure the data was not already cache resident. For
example, creating a data file and then doing computation on it may be
problematic. Unconditionally, the application is going to have to deal
with migration.

Identifying issues like this are why an end-to-end application that
takes advantage of the feature is important. Otherwise, there is a risk
that APIs are exposed to userspace that are Linux-specific,
device-specific and unusable.

>    Putting all applications in the cpuset with the CDM node is
>    not the right thing to do, which means the application needs to move itself
>    to the right cpuset before requesting for CDM memory. It's not impossible
>    to use CPUsets, just hard to configure correctly.

They optionally could also use move_pages.

>   - With HMM, we would need a HMM variant HMM-CDM, so that we are not marking
>    the pages as unavailable, page cache cannot do directly to coherent memory.
>    Audit of mm paths is required. Most of the other things should work.
>    User access to HMM-CDM memory behind ZONE_DEVICE is via a device driver.

The main reason why I would prefer HMM-CDM is two-fold. The first is
that using these accelerators still has use cases that are not very well
defined but if an application could use either CDM or HMM transparently
then it may be better overall.

The second reason is because there are technologies like near-memory coming
in the future and there is no infrastructure in place to take advantage like
that. I haven't even heard of plans from developers working with vendors of
such devices on how they intend to support it. Hence, the desired policies
are unknown such as whether the near memory should be isolated or if there
should be policies that promote/demote data between NUMA nodes instead of
reclaim. While I'm not involved in enabling such technology, I worry that
there will be collisiosn in the policies required for CDM and those required
for near-memory but once the API is exposed to userspace, it becomes fixed.

> Do we need to isolate node attributes independent of coherent device memory?
>  - Christoph Lameter thought it would be useful to isolate node attributes,
>    specifically ksm/autonuma for low latency suff.

Whatever about KSM, I would have suggested that autonuma have a prctl
flag to disable autonuma on a per-task basis. It would be sufficient for
anonymous memory at least. It would have some hazards if a
latency-sensitive application shared file-backed data with a normal
application but latency-sensitive applications generally have to take
care to isolate themselves properly.

> Why do we need migration?
>  - Depending on where the memory is being accessed from, we would like to
>    migrate pages between system and coherent device memory. HMM provides
>    DMA offload capability that is useful in both cases.

That suggests that HMM would be a better idea.

> What is the larger picture - end to end?
>  - Applications can allocate memory on the device or in system memory,
>    offload the compute via user space API. Migration can be used for performance
>    if required since it helps to keep the memory local to the compute.
> 

The end-to-end is what matters because there is an expectation that
applications will have to use libraries to control the actual acceleration
and collection of results. The same libraries should be responsible for
doing the migration if necessary. While I accept that bringing up the
library would be inconvenient as supporting tools will be needed for the
application, it's better than quickly exposting CDM devices as NUMA as this
suggests, applying the policies and then finding the same supporting tools
and libraries were needed anyway and the proposed policies did not help.

> Comments from the thread
> 
> 1. If we go down the NUMA path, we need to live with the limitations of
>    what comes with the cpuless NUMA node
> 2. The changes made to cpusets and mempolicies, make the code more complex
> 3. We need a good end to end story
> 
> The comments from the thread were responded to
> 
> How do we go about implementing CDM then?
> 
> The recommendation from John Hubbard/Mel Gorman and Michal Hocko is to
> use HMM-CDM to solve the problem. Jerome/Balbir and Ben H prefer NUMA-CDM.
> There were suggestions that NUMA might not be ready or is the best approach
> in the long term, but we are yet to identify what changes to NUMA would
> enable it to support NUMA-CDM.
> 

Primarily, I would suggest that HMM-CDM be taken as far as possible on the
hope/expectation that an application could transparently use either CDM
(memory visible to both CPU and device) or HMM (special care required)
with a common library API. This may be unworkable ultimately but it's
impossible to know unless someone is fully up to date with exactly how
these devices are to be used by appliications.

If NUMA nodes are still required then the initial path appears to
be controlling the onlining of memory from the device, isolating from
userspace with existing mechanisms and using library awareness to control
the migration. If DMA offloading is required then the device would also
need to control that which may or may not push it towards HMM again.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-12 10:26 ` Mel Gorman
@ 2017-05-15 23:45   ` Balbir Singh
  2017-05-16  8:43     ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2017-05-15 23:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel, Benjamin Herrenschmidt

Hi, Mel

On Fri, May 12, 2017 at 8:26 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Fri, May 12, 2017 at 04:18:02PM +1000, Balbir Singh wrote:
>> Why do we need to isolate memory?
>>  - CDM memory is not meant for normal usage, applications can request for it
>>    explictly. Oflload their compute to the device where the memory is
>>    (the offload is via a user space API like CUDA/openCL/...)
>
> It still remains unanswered to a large extent why this cannot be
> isolated after the fact via a standard mechanism. It may be easier if
> the onlining of CDM memory can be deferred at boot until userspace
> helpers can trigger the onlining and isolation.
>

Sure, yes! I also see the need to have tasks migrate between
cpusets at runtime, depending on a trigger mechanism, the allocation
request maybe?

>> How do we isolate the memory - NUMA or HMM-CDM?
>>  - Since the memory is coherent, NUMA provides the mechanism to isolate to
>>    a large extent via mempolicy. With NUMA we also get autonuma/kswapd/etc
>>    running.
>
> This has come up before with respect to autonuma and there appears to be
> confusion. autonuma doesn't run on nodes as such. The page table hinting
> happens in per-task context but should skip VMAs that are controlled by
> a policy. While some care is needed from the application, it's managable
> and would perform better than special casing the marking of pages placed
> on a CDM-controlled node.
>

I presume your referring to vma_is_migratable() bits, but it means the
application
does malloc() followed by madvise() or something else to mark the VMA. The mm
could do some of this automatically depending on the node from which a fault/
allocation occurs. But a VMA could contain pages from different nodes. In my
current branch the checks are in numa_migrate_prep() to check if the page
belongs to CDM memory.

> As for kswapd, there isn't a user-controllable method for controlling
> this. However, if a device onlining the memory set the watermarks to 0,
> it would allow the full CDM memory to be used by the application and kswapd
> would never be woken.

Fair point, I presume you are suggesting we set the low/min/high to 0.

>
> KSM is potentially more problematic and initially may have to be disabled
> entirely to determine if it actually matters for CDM-aware applications or
> not. KSM normally comes into play with virtual machines are involved so it
> would have to be decided if CDM is being exposed to guests with pass-thru
> or some other mechanism. Initially, just disable it unless the use cases
> are known.

OK.. With mixed workloads we may selectively enable and ensure that none
of the MERGABLE pages end up on CDM

>
>>    Something we would like to avoid. NUMA gives the application
>>    a transparent view of memory, in the sense that all mm features work,
>>    like direct page cache allocation in coherent device memory, limiting
>>    memory via cgroups if required, etc. With CPUSets, its
>>    possible for us to isolate allocation. One challenge is that the
>>    admin on the system may use them differently and applications need to
>>    be aware of running in the right cpuset to allocate memory from the
>>    CDM node.
>
> An admin and application has to deal with this complexity regardless.

I was thinking along the lines of cpusets working orthogonal to CDM
and not managing CDM memory, that way the concerns are different.
A policy set on cpusets does not impact CDM memory. It also means
that CDM memory is not used for total memory computation and related
statistics.

> Particular care would be needed for file-backed data as an application
> would have to ensure the data was not already cache resident. For
> example, creating a data file and then doing computation on it may be
> problematic. Unconditionally, the application is going to have to deal
> with migration.
>

Ins't migration transparent to the application, it may affect performance.

> Identifying issues like this are why an end-to-end application that
> takes advantage of the feature is important. Otherwise, there is a risk
> that APIs are exposed to userspace that are Linux-specific,
> device-specific and unusable.
>
>>    Putting all applications in the cpuset with the CDM node is
>>    not the right thing to do, which means the application needs to move itself
>>    to the right cpuset before requesting for CDM memory. It's not impossible
>>    to use CPUsets, just hard to configure correctly.
>
> They optionally could also use move_pages.

move_pages() to move the memory to the right node after the allocation?

>
>>   - With HMM, we would need a HMM variant HMM-CDM, so that we are not marking
>>    the pages as unavailable, page cache cannot do directly to coherent memory.
>>    Audit of mm paths is required. Most of the other things should work.
>>    User access to HMM-CDM memory behind ZONE_DEVICE is via a device driver.
>
> The main reason why I would prefer HMM-CDM is two-fold. The first is
> that using these accelerators still has use cases that are not very well
> defined but if an application could use either CDM or HMM transparently
> then it may be better overall.
>
> The second reason is because there are technologies like near-memory coming
> in the future and there is no infrastructure in place to take advantage like
> that. I haven't even heard of plans from developers working with vendors of
> such devices on how they intend to support it. Hence, the desired policies
> are unknown such as whether the near memory should be isolated or if there
> should be policies that promote/demote data between NUMA nodes instead of
> reclaim. While I'm not involved in enabling such technology, I worry that
> there will be collisiosn in the policies required for CDM and those required
> for near-memory but once the API is exposed to userspace, it becomes fixed.
>

OK, I see your concern, it is definitely valid. We do have a use case,
but I wonder
how long we wait?

>> Do we need to isolate node attributes independent of coherent device memory?
>>  - Christoph Lameter thought it would be useful to isolate node attributes,
>>    specifically ksm/autonuma for low latency suff.
>
> Whatever about KSM, I would have suggested that autonuma have a prctl
> flag to disable autonuma on a per-task basis. It would be sufficient for
> anonymous memory at least. It would have some hazards if a
> latency-sensitive application shared file-backed data with a normal
> application but latency-sensitive applications generally have to take
> care to isolate themselves properly.
>

OK, I was planning on doing an isolated feature set. But I am still trying
to think what it would mean in terms of complexity to the mm. Not having
all of N_MEMORY participating in a particular feature/algorithm is something
most admins will not want to enable.

>> Why do we need migration?
>>  - Depending on where the memory is being accessed from, we would like to
>>    migrate pages between system and coherent device memory. HMM provides
>>    DMA offload capability that is useful in both cases.
>
> That suggests that HMM would be a better idea.

Yes, the total end-to-end did include HMM to begin with, we need the migration
capabilities from HMM, even with NUMA-CDM.

>
>> What is the larger picture - end to end?
>>  - Applications can allocate memory on the device or in system memory,
>>    offload the compute via user space API. Migration can be used for performance
>>    if required since it helps to keep the memory local to the compute.
>>
>
> The end-to-end is what matters because there is an expectation that
> applications will have to use libraries to control the actual acceleration
> and collection of results. The same libraries should be responsible for
> doing the migration if necessary. While I accept that bringing up the
> library would be inconvenient as supporting tools will be needed for the
> application, it's better than quickly exposting CDM devices as NUMA as this
> suggests, applying the policies and then finding the same supporting tools
> and libraries were needed anyway and the proposed policies did not help.
>
>> Comments from the thread
>>
>> 1. If we go down the NUMA path, we need to live with the limitations of
>>    what comes with the cpuless NUMA node
>> 2. The changes made to cpusets and mempolicies, make the code more complex
>> 3. We need a good end to end story
>>
>> The comments from the thread were responded to
>>
>> How do we go about implementing CDM then?
>>
>> The recommendation from John Hubbard/Mel Gorman and Michal Hocko is to
>> use HMM-CDM to solve the problem. Jerome/Balbir and Ben H prefer NUMA-CDM.
>> There were suggestions that NUMA might not be ready or is the best approach
>> in the long term, but we are yet to identify what changes to NUMA would
>> enable it to support NUMA-CDM.
>>
>
> Primarily, I would suggest that HMM-CDM be taken as far as possible on the
> hope/expectation that an application could transparently use either CDM
> (memory visible to both CPU and device) or HMM (special care required)
> with a common library API. This may be unworkable ultimately but it's
> impossible to know unless someone is fully up to date with exactly how
> these devices are to be used by appliications.
>
> If NUMA nodes are still required then the initial path appears to
> be controlling the onlining of memory from the device, isolating from
> userspace with existing mechanisms and using library awareness to control
> the migration. If DMA offloading is required then the device would also
> need to control that which may or may not push it towards HMM again.
>

Agreeed, but I think both NUMA and DMA offloading are possible together.
The user space uses NUMA API's and the driver can use DMA offloading
for migration of pages depending on any heuristics or user provided
hints that a page may be soon needed on the device. Some application
details depend on whether the memory is fully driver managed (HMM-CDM)
or NUMA. We've been seriously looking at HMM-CDM as an alternative
to NUMA. We'll push in that direction and see beyond our audting what
else we run into.

Thanks for the detailed feedback,
Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-15 23:45   ` Balbir Singh
@ 2017-05-16  8:43     ` Mel Gorman
  2017-05-16 22:26       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2017-05-16  8:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel, Benjamin Herrenschmidt

On Tue, May 16, 2017 at 09:45:43AM +1000, Balbir Singh wrote:
> Hi, Mel
> 
> On Fri, May 12, 2017 at 8:26 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > On Fri, May 12, 2017 at 04:18:02PM +1000, Balbir Singh wrote:
> >> Why do we need to isolate memory?
> >>  - CDM memory is not meant for normal usage, applications can request for it
> >>    explictly. Oflload their compute to the device where the memory is
> >>    (the offload is via a user space API like CUDA/openCL/...)
> >
> > It still remains unanswered to a large extent why this cannot be
> > isolated after the fact via a standard mechanism. It may be easier if
> > the onlining of CDM memory can be deferred at boot until userspace
> > helpers can trigger the onlining and isolation.
> >
> 
> Sure, yes! I also see the need to have tasks migrate between
> cpusets at runtime, depending on a trigger mechanism, the allocation
> request maybe?
> 

That would be a userspace decision and does not have to be a kernel
decision. It would likely be controlled by whatever moves tasks between
cpusets but if fine-grained control is needed then the application would
need to link to a library that can handle that via a callback mechanism.
The kernel is not going to automagically know what the application requires.

> >> How do we isolate the memory - NUMA or HMM-CDM?
> >>  - Since the memory is coherent, NUMA provides the mechanism to isolate to
> >>    a large extent via mempolicy. With NUMA we also get autonuma/kswapd/etc
> >>    running.
> >
> > This has come up before with respect to autonuma and there appears to be
> > confusion. autonuma doesn't run on nodes as such. The page table hinting
> > happens in per-task context but should skip VMAs that are controlled by
> > a policy. While some care is needed from the application, it's managable
> > and would perform better than special casing the marking of pages placed
> > on a CDM-controlled node.
> >
> 
> I presume your referring to vma_is_migratable() bits, but it means the
> application
> does malloc() followed by madvise() or something else to mark the
> VMA.

More likely set_mempolicy but as with other places, some degree of
application awareness is involved because at that the very least,
something needs to know how to trigger the CDM device to do computation
and co-ordinate to pickup the result.

> The mm
> could do some of this automatically depending on the node from which a fault/
> allocation occurs.

That would require wiring policy into the kernel unnecessarily and not
necessarily gain you anything. If control is handled at fault time, it
means that the VMA in question would also need to have CDM as the first
fallback as it's CPUless and therefore CDM cannot be local. Even with
that, it'd have to handle the case where the CDM node was full and a
fallback occurred and the kernel does not normally automatically "fix"
that without wiring a lot of policy in.

It's also unnecessary considering that an application can use policies
to bind a VMA to the CDM node, handle failures if desired or use
migration if fallbacks are allowed.

> But a VMA could contain pages from different nodes. In my
> current branch the checks are in numa_migrate_prep() to check if the page
> belongs to CDM memory.
> 

If the policies allow VMAs to contain pages from different nodes, then the
application needs to call move_pages. Wiring this into the kernel doesn't
really help anything as the application would need to handle any in-kernel
failures such as the CDM being full.

> > As for kswapd, there isn't a user-controllable method for controlling
> > this. However, if a device onlining the memory set the watermarks to 0,
> > it would allow the full CDM memory to be used by the application and kswapd
> > would never be woken.
> 
> Fair point, I presume you are suggesting we set the low/min/high to 0.
> 

Yes. If that is not doable for some reason then the initial userspace
support would have to take care to never allocate CDM below the high
watermark to avoid kswapd waking up.

> >
> > KSM is potentially more problematic and initially may have to be disabled
> > entirely to determine if it actually matters for CDM-aware applications or
> > not. KSM normally comes into play with virtual machines are involved so it
> > would have to be decided if CDM is being exposed to guests with pass-thru
> > or some other mechanism. Initially, just disable it unless the use cases
> > are known.
> 
> OK.. With mixed workloads we may selectively enable and ensure that none
> of the MERGABLE pages end up on CDM
> 

Yes, alternatively look into KSM settings or patches that prevent KSM
merging pages across nodes and get behind that. I'm struggling to see
why KSM in a CDM environment is even desirable so would suggest just
disabling it.

> >
> >>    Something we would like to avoid. NUMA gives the application
> >>    a transparent view of memory, in the sense that all mm features work,
> >>    like direct page cache allocation in coherent device memory, limiting
> >>    memory via cgroups if required, etc. With CPUSets, its
> >>    possible for us to isolate allocation. One challenge is that the
> >>    admin on the system may use them differently and applications need to
> >>    be aware of running in the right cpuset to allocate memory from the
> >>    CDM node.
> >
> > An admin and application has to deal with this complexity regardless.
> 
> I was thinking along the lines of cpusets working orthogonal to CDM
> and not managing CDM memory, that way the concerns are different.
> A policy set on cpusets does not impact CDM memory. It also means
> that CDM memory is not used for total memory computation and related
> statistics.
> 

So far, the desire to avoid CDM being used in total memory consumption
appears to be the only core kernel thing that may need support. Whether it's
worth creating a pgdat->flag to special case that or not is debatable as
the worst impact is slightly confusing sysrq+m, oom-kill and free/top/etc
messages. That might be annoying but not a functional blocker.

> > Particular care would be needed for file-backed data as an application
> > would have to ensure the data was not already cache resident. For
> > example, creating a data file and then doing computation on it may be
> > problematic. Unconditionally, the application is going to have to deal
> > with migration.
> >
> 
> Ins't migration transparent to the application, it may affect performance.
> 

I'm not sure what you're asking here. migration is only partially
transparent but a move_pages call will be necessary to force pages onto
CDM if binding policies are not used so the cost of migration will be
invisible. Even if you made it "transparent", the migration cost would
be incurred at fault time. If anything, using move_pages would be more
predictable as you control when the cost is incurred.

> > Identifying issues like this are why an end-to-end application that
> > takes advantage of the feature is important. Otherwise, there is a risk
> > that APIs are exposed to userspace that are Linux-specific,
> > device-specific and unusable.
> >
> >>    Putting all applications in the cpuset with the CDM node is
> >>    not the right thing to do, which means the application needs to move itself
> >>    to the right cpuset before requesting for CDM memory. It's not impossible
> >>    to use CPUsets, just hard to configure correctly.
> >
> > They optionally could also use move_pages.
> 
> move_pages() to move the memory to the right node after the allocation?
> 

More specifically, move_pages before the offloaded computation begins
and optionally move it back to main memory after the computation
completes.

> >
> >>   - With HMM, we would need a HMM variant HMM-CDM, so that we are not marking
> >>    the pages as unavailable, page cache cannot do directly to coherent memory.
> >>    Audit of mm paths is required. Most of the other things should work.
> >>    User access to HMM-CDM memory behind ZONE_DEVICE is via a device driver.
> >
> > The main reason why I would prefer HMM-CDM is two-fold. The first is
> > that using these accelerators still has use cases that are not very well
> > defined but if an application could use either CDM or HMM transparently
> > then it may be better overall.
> >
> > The second reason is because there are technologies like near-memory coming
> > in the future and there is no infrastructure in place to take advantage like
> > that. I haven't even heard of plans from developers working with vendors of
> > such devices on how they intend to support it. Hence, the desired policies
> > are unknown such as whether the near memory should be isolated or if there
> > should be policies that promote/demote data between NUMA nodes instead of
> > reclaim. While I'm not involved in enabling such technology, I worry that
> > there will be collisiosn in the policies required for CDM and those required
> > for near-memory but once the API is exposed to userspace, it becomes fixed.
> >
> 
> OK, I see your concern, it is definitely valid. We do have a use case,
> but I wonder
> how long we wait?
> 

As before, from a core kernel perspective, all the use cases described
so far can be handled with existing mechanisms *if* the driver controls
the hotplug of memory at a time chosen by userspace so it can control the
isolation, allocation and usage. Of coursse, the driver still needs to
exist and will have some additional complexity that other drivers do not
need but for the pure NUMA-approach to CDM, it can be handled entirely
within a driver and then controlled from userspace without requiring
additional wiring into the core vm.

The same is not quite as true for near-memory (although it could be forced
to be that way initially albeit sub-optimally due to page age inversion
problems unless extreme care was taken).

> >> Do we need to isolate node attributes independent of coherent device memory?
> >>  - Christoph Lameter thought it would be useful to isolate node attributes,
> >>    specifically ksm/autonuma for low latency suff.
> >
> > Whatever about KSM, I would have suggested that autonuma have a prctl
> > flag to disable autonuma on a per-task basis. It would be sufficient for
> > anonymous memory at least. It would have some hazards if a
> > latency-sensitive application shared file-backed data with a normal
> > application but latency-sensitive applications generally have to take
> > care to isolate themselves properly.
> >
> 
> OK, I was planning on doing an isolated feature set. But I am still trying
> to think what it would mean in terms of complexity to the mm. Not having
> all of N_MEMORY participating in a particular feature/algorithm is something
> most admins will not want to enable.
> 

prctl disabling on a per-task basis is fairly straight-forward.
Alternatively, always assign policies to VMAs being used for CDM and it'll
be left alone.

> > Primarily, I would suggest that HMM-CDM be taken as far as possible on the
> > hope/expectation that an application could transparently use either CDM
> > (memory visible to both CPU and device) or HMM (special care required)
> > with a common library API. This may be unworkable ultimately but it's
> > impossible to know unless someone is fully up to date with exactly how
> > these devices are to be used by appliications.
> >
> > If NUMA nodes are still required then the initial path appears to
> > be controlling the onlining of memory from the device, isolating from
> > userspace with existing mechanisms and using library awareness to control
> > the migration. If DMA offloading is required then the device would also
> > need to control that which may or may not push it towards HMM again.
> >
> 
> Agreeed, but I think both NUMA and DMA offloading are possible together.
> The user space uses NUMA API's and the driver can use DMA offloading
> for migration of pages depending on any heuristics or user provided
> hints that a page may be soon needed on the device. Some application
> details depend on whether the memory is fully driver managed (HMM-CDM)
> or NUMA. We've been seriously looking at HMM-CDM as an alternative
> to NUMA. We'll push in that direction and see beyond our audting what
> else we run into.
> 

It's possible you'll end up with a hybrid of NUMA and HMM but right now,
it appears the NUMA part can be handled by existing mechanisms if the
driver is handling the hot-add of memory and triggered from userspace.
That actual hot-add might be a little tricky as it has to handle watermark
setting and keep the node out of default zonelists. That might require a
check in the core VM for a pgdat->flag but it would be one branch in the
zonelist building and optionally a check in the watermark configuration
which is fairly minimal.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-16  8:43     ` Mel Gorman
@ 2017-05-16 22:26       ` Benjamin Herrenschmidt
  2017-05-17  8:28         ` Mel Gorman
  2017-05-17 13:54         ` Christoph Lameter
  0 siblings, 2 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-16 22:26 UTC (permalink / raw)
  To: Mel Gorman, Balbir Singh
  Cc: linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Tue, 2017-05-16 at 09:43 +0100, Mel Gorman wrote:
> I'm not sure what you're asking here. migration is only partially
> transparent but a move_pages call will be necessary to force pages onto
> CDM if binding policies are not used so the cost of migration will be
> invisible. Even if you made it "transparent", the migration cost would
> be incurred at fault time. If anything, using move_pages would be more
> predictable as you control when the cost is incurred.

One of the main point of this whole exercise is for applications to not
have to bother with any of this and now you are bringing all back into
their lap.

The base idea behind the counters we have on the link is for the HW to
know when memory is accessed "remotely", so that the device driver can
make decision about migrating pages into or away from the device,
especially so that applications don't have to concern themselves with
memory placement.

This is also to a certain extent the programming model provided by HMM
for non-coherent devices.

While some customers want the last % of performance and will explicitly
place their memory, the general case out there is to have "plug in"
libraries using GPU to accelerate common computational problems behind
the scene with no awareness of memory placement. Explicit memory
placement becomes unmanageable is heavily shared environment too.

Thus we want to reply on the GPU driver moving the pages around where
most appropriate (where they are being accessed, either core memory or
GPU memory) based on inputs from the HW counters monitoring the link.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-16 22:26       ` Benjamin Herrenschmidt
@ 2017-05-17  8:28         ` Mel Gorman
  2017-05-17  9:03           ` Benjamin Herrenschmidt
  2017-05-17 13:54         ` Christoph Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2017-05-17  8:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, May 17, 2017 at 08:26:47AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-05-16 at 09:43 +0100, Mel Gorman wrote:
> > I'm not sure what you're asking here. migration is only partially
> > transparent but a move_pages call will be necessary to force pages onto
> > CDM if binding policies are not used so the cost of migration will be
> > invisible. Even if you made it "transparent", the migration cost would
> > be incurred at fault time. If anything, using move_pages would be more
> > predictable as you control when the cost is incurred.
> 
> One of the main point of this whole exercise is for applications to not
> have to bother with any of this and now you are bringing all back into
> their lap.
> 
> The base idea behind the counters we have on the link is for the HW to
> know when memory is accessed "remotely", so that the device driver can
> make decision about migrating pages into or away from the device,
> especially so that applications don't have to concern themselves with
> memory placement.
> 

There is only so much magic that can be applied and if the manual case
cannot be handled then the automatic case is problematic. You say that you
want kswapd disabled, but have nothing to handle overcommit sanely. You
want to disable automatic NUMA balancing yet also be able to automatically
detect when data should move from CDM (automatic NUMA balancing by design
couldn't move data to CDM without driver support tracking GPU accesses).

To handle it transparently, either the driver needs to do the work in which
case no special core-kernel support is needed beyond what already exists or
there is a userspace daemon like numad running in userspace that decides
when to trigger migrations on a separate process that is using CDM which
would need to gather information from the driver.

In either case, the existing isolation mechanisms are still sufficient as
long as the driver hot-adds the CDM memory from a userspace trigger that
it then responsible for setting up the isolation.

All that aside, this series has nothing to do with the type of magic
you describe and the feedback as given was "at this point, what you are
looking for does not require special kernel support or heavy wiring into
the core vm".

> Thus we want to reply on the GPU driver moving the pages around where
> most appropriate (where they are being accessed, either core memory or
> GPU memory) based on inputs from the HW counters monitoring the link.
> 

And if the driver is polling all the accesses, there are still no changes
required to the core vm as long as the driver does the hotplug and allows
userspace to isolate if that is what the applications desire.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17  8:28         ` Mel Gorman
@ 2017-05-17  9:03           ` Benjamin Herrenschmidt
  2017-05-17  9:15             ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-17  9:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, 2017-05-17 at 09:28 +0100, Mel Gorman wrote:
> On Wed, May 17, 2017 at 08:26:47AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-05-16 at 09:43 +0100, Mel Gorman wrote:
> > > I'm not sure what you're asking here. migration is only partially
> > > transparent but a move_pages call will be necessary to force pages onto
> > > CDM if binding policies are not used so the cost of migration will be
> > > invisible. Even if you made it "transparent", the migration cost would
> > > be incurred at fault time. If anything, using move_pages would be more
> > > predictable as you control when the cost is incurred.
> > 
> > One of the main point of this whole exercise is for applications to not
> > have to bother with any of this and now you are bringing all back into
> > their lap.
> > 
> > The base idea behind the counters we have on the link is for the HW to
> > know when memory is accessed "remotely", so that the device driver can
> > make decision about migrating pages into or away from the device,
> > especially so that applications don't have to concern themselves with
> > memory placement.
> > 
> 
> There is only so much magic that can be applied and if the manual case
> cannot be handled then the automatic case is problematic. You say that you
> want kswapd disabled, but have nothing to handle overcommit sanely.

I am not certain we want kswapd disabled, that is definitely more of a
userspace policy, I agree. It could be in this case that it should
prioritize different pages but still be able to push out. We *do* have
age counting etc... just less efficient / higher cost. 

>  You
> want to disable automatic NUMA balancing yet also be able to automatically
> detect when data should move from CDM (automatic NUMA balancing by design
> couldn't move data to CDM without driver support tracking GPU accesses).

We can, via a driver specific hook, since we have specific counters on
the link, so we don't want the autonuma based approach which makes PTEs
inaccessible.

> To handle it transparently, either the driver needs to do the work in which
> case no special core-kernel support is needed beyond what already exists or
> there is a userspace daemon like numad running in userspace that decides
> when to trigger migrations on a separate process that is using CDM which
> would need to gather information from the driver.

The driver can handle it, we just need autonuma off the CDM memory (it
can continue operating normally on system memory).

> In either case, the existing isolation mechanisms are still sufficient as
> long as the driver hot-adds the CDM memory from a userspace trigger that
> it then responsible for setting up the isolation.

Yes, I think the NUMA node based approach works fine using a lot of
existing stuff. There are a couple of gaps, which we need to look at
fixing one way or another such as the above, but overall I don't see
the need of some major overhaul, not do I see the need of going down
the path of ZONE_DEVICE.

> All that aside, this series has nothing to do with the type of magic
> you describe and the feedback as iven was "at this point, what you are
> looking for does not require special kernel support or heavy wiring into
> the core vm".
> 
> > Thus we want to reply on the GPU driver moving the pages around where
> > most appropriate (where they are being accessed, either core memory or
> > GPU memory) based on inputs from the HW counters monitoring the link.
> > 
> 
> And if the driver is polling all the accesses, there are still no changes
> required to the core vm as long as the driver does the hotplug and allows
> userspace to isolate if that is what the applications desire.

With one main exception ... 

We also do want normal allocations to avoid going to the GPU memory.

IE, things should go to the GPU memory if and only if they are either
explicitly put there by the application/driver (the case where
applications do care about manual placement), or the migration case.A 

The latter is triggered by the driver, so it's also a case of the
driver allocating the GPU pages and doing a migration to them.

This is the key thing. Now creating a CMA or using ZONE_MOVABLE can
handle at least keeping kernel allocations off the GPU. However we
would also like to keep random unrelated user memory & page cache off
as well.

There are various reasons for that, some related to the fact that the
performance characteristics of that memory (ie latency) could cause
nasty surprises for normal applications, some related to the fact that
this memory is rather unreliable compared to system memory...

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17  9:03           ` Benjamin Herrenschmidt
@ 2017-05-17  9:15             ` Mel Gorman
  2017-05-17  9:56               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2017-05-17  9:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, May 17, 2017 at 07:03:46PM +1000, Benjamin Herrenschmidt wrote:
> > There is only so much magic that can be applied and if the manual case
> > cannot be handled then the automatic case is problematic. You say that you
> > want kswapd disabled, but have nothing to handle overcommit sanely.
> 
> I am not certain we want kswapd disabled, that is definitely more of a
> userspace policy, I agree. It could be in this case that it should
> prioritize different pages but still be able to push out. We *do* have
> age counting etc... just less efficient / higher cost. 
> 

If you don't want kswapd disabled, then the existing support is
sufficient unless different reclaim policies are required. If so, it
becomes a general problem of NUMA hierarchies where policies for nodes
may differ.

> >  You
> > want to disable automatic NUMA balancing yet also be able to automatically
> > detect when data should move from CDM (automatic NUMA balancing by design
> > couldn't move data to CDM without driver support tracking GPU accesses).
> 
> We can, via a driver specific hook, since we have specific counters on
> the link, so we don't want the autonuma based approach which makes PTEs
> inaccessible.
> 

Then poll the driver from a userspace daemon and make placement
decisions if automatic NUMA balancings reference-based decisions are
unsuitable.

> > To handle it transparently, either the driver needs to do the work in which
> > case no special core-kernel support is needed beyond what already exists or
> > there is a userspace daemon like numad running in userspace that decides
> > when to trigger migrations on a separate process that is using CDM which
> > would need to gather information from the driver.
> 
> The driver can handle it, we just need autonuma off the CDM memory (it
> can continue operating normally on system memory).
> 

Already suggested that prctl be used to disable automatic numa balancing
on a per-task basis. Alternatively, settiing a memory policy will be
enough and as the applications are going to need policies anyway, you
should be able to get that by default.

> > In either case, the existing isolation mechanisms are still sufficient as
> > long as the driver hot-adds the CDM memory from a userspace trigger that
> > it then responsible for setting up the isolation.
> 
> Yes, I think the NUMA node based approach works fine using a lot of
> existing stuff. There are a couple of gaps, which we need to look at
> fixing one way or another such as the above, but overall I don't see
> the need of some major overhaul, not do I see the need of going down
> the path of ZONE_DEVICE.
> 

Your choice, but it also doesn't take away from the fact that special
casing in the core does not appear to be required at this point.

> > All that aside, this series has nothing to do with the type of magic
> > you describe and the feedback as iven was "at this point, what you are
> > looking for does not require special kernel support or heavy wiring into
> > the core vm".
> > 
> > > Thus we want to reply on the GPU driver moving the pages around where
> > > most appropriate (where they are being accessed, either core memory or
> > > GPU memory) based on inputs from the HW counters monitoring the link.
> > > 
> > 
> > And if the driver is polling all the accesses, there are still no changes
> > required to the core vm as long as the driver does the hotplug and allows
> > userspace to isolate if that is what the applications desire.
> 
> With one main exception ... 
> 
> We also do want normal allocations to avoid going to the GPU memory.
> 

Use policies. If the NUMA distance for CDM is set high then even applications
that have access to CDM will use every other node before going to CDM. As
you insist on no application awareness, the migration to CDM will have to
be controlled by a separate daemon.

> IE, things should go to the GPU memory if and only if they are either
> explicitly put there by the application/driver (the case where
> applications do care about manual placement), or the migration case. 
> 
> The latter is triggered by the driver, so it's also a case of the
> driver allocating the GPU pages and doing a migration to them.
> 
> This is the key thing. Now creating a CMA or using ZONE_MOVABLE can
> handle at least keeping kernel allocations off the GPU. However we
> would also like to keep random unrelated user memory & page cache off
> as well.
> 

Fine -- hot add the memory from the device via a userspace trigger and
have the userspace trigger then setup the policies to isolate CDM from
general usage.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17  9:15             ` Mel Gorman
@ 2017-05-17  9:56               ` Benjamin Herrenschmidt
  2017-05-17 10:58                 ` Mel Gorman
  2017-05-17 12:41                 ` Michal Hocko
  0 siblings, 2 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-17  9:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, 2017-05-17 at 10:15 +0100, Mel Gorman wrote:
> > We can, via a driver specific hook, since we have specific counters on
> > the link, so we don't want the autonuma based approach which makes PTEs
> > inaccessible.
> > 
> 
> Then poll the driver from a userspace daemon and make placement
> decisions if automatic NUMA balancings reference-based decisions are
> unsuitable.

Why a userspace daemon ? I don't get this... the driver will get
interrupts from the GPU with page lists, it can trigger migrations
without needing a userspace daemon...

> > > To handle it transparently, either the driver needs to do the work in which
> > > case no special core-kernel support is needed beyond what already exists or
> > > there is a userspace daemon like numad running in userspace that decides
> > > when to trigger migrations on a separate process that is using CDM which
> > > would need to gather information from the driver.
> > 
> > The driver can handle it, we just need autonuma off the CDM memory (it
> > can continue operating normally on system memory).
> > 
> 
> Already suggested that prctl be used to disable automatic numa balancing
> on a per-task basis. Alternatively, settiing a memory policy will be
> enough and as the applications are going to need policies anyway, you
> should be able to get that by default.

I'm not sure we want to disable it for the application vs. disabling it
for pages that reside on that node, however, but it could be tricky so
the application first might be a way to get started.

> > > In either case, the existing isolation mechanisms are still sufficient as
> > > long as the driver hot-adds the CDM memory from a userspace trigger that
> > > it then responsible for setting up the isolation.
> > 
> > Yes, I think the NUMA node based approach works fine using a lot of
> > existing stuff. There are a couple of gaps, which we need to look at
> > fixing one way or another such as the above, but overall I don't see
> > the need of some major overhaul, not do I see the need of going down
> > the path of ZONE_DEVICE.
> > 
> Your choice, but it also doesn't take away from the fact that special
> casing in the core does not appear to be required at this point.

Well, yes and no.

If we use the NUMA based approach, then no special casing up to this
point, the only thing is below, the idea of avoiding "normal"
allocations for that type of memory.

If we use ZONE_DEVICE and the bulk of the HMM infrastructure, then we
get the above, but at the expense of a pile of special casing all over
the place for the "special" kind of struct page created for ZONE_DEVICE
(lacking LRU).

> > > All that aside, this series has nothing to do with the type of magic
> > > you describe and the feedback as iven was "at this point, what you are
> > > looking for does not require special kernel support or heavy wiring into
> > > the core vm".
> > > 
> > > > Thus we want to reply on the GPU driver moving the pages around where
> > > > most appropriate (where they are being accessed, either core memory or
> > > > GPU memory) based on inputs from the HW counters monitoring the link.
> > > > 
> > > 
> > > And if the driver is polling all the accesses, there are still no changes
> > > required to the core vm as long as the driver does the hotplug and allows
> > > userspace to isolate if that is what the applications desire.
> > 
> > With one main exception ... 
> > 
> > We also do want normal allocations to avoid going to the GPU memory.
> > 
> 
> Use policies. If the NUMA distance for CDM is set high then even applications
> that have access to CDM will use every other node before going to CDM.

Yes. That was the original idea. Along with ZONE_MOVABLE to avoid
kernel allocations completely.

I think Balbir and Anshuman wanted to play with a more fully exclusive
approach where those allocations are simply not permitted.

>  As
> you insist on no application awareness, the migration to CDM will have to
> be controlled by a separate daemon.

Or by the driver itself, I don't think we need a daemon, but that's a
detail in the grand scheme of things.

> > IE, things should go to the GPU memory if and only if they are either
> > explicitly put there by the application/driver (the case where
> > applications do care about manual placement), or the migration case.A 
> > 
> > The latter is triggered by the driver, so it's also a case of the
> > driver allocating the GPU pages and doing a migration to them.
> > 
> > This is the key thing. Now creating a CMA or using ZONE_MOVABLE can
> > handle at least keeping kernel allocations off the GPU. However we
> > would also like to keep random unrelated user memory & page cache off
> > as well.
> > 
> 
> Fine -- hot add the memory from the device via a userspace trigger and
> have the userspace trigger then setup the policies to isolate CDM from
> general usage.

This is racy though. The memory is hot added, but things can get
allocated all over it before it has time to adjust the policies. Same
issue we had with creating a CMA I believe.

I think that's what Balbir was trying to do with the changes to the
core, to be able to create that "don't touche me" NUMA node straight
up.

Unless we have a way to create a node without actually making it
available for allocations, so we get a chance to establish policies for
it, then "online" it ?

Doing these from userspace is a bit nasty since it's expected to all be
under the control of the GPU driver, but it could be done via a
combination of GPU driver & udev helpers or a special daemon.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17  9:56               ` Benjamin Herrenschmidt
@ 2017-05-17 10:58                 ` Mel Gorman
  2017-05-17 19:35                   ` Benjamin Herrenschmidt
  2017-05-17 19:37                   ` Benjamin Herrenschmidt
  2017-05-17 12:41                 ` Michal Hocko
  1 sibling, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2017-05-17 10:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, May 17, 2017 at 07:56:35PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2017-05-17 at 10:15 +0100, Mel Gorman wrote:
> > > We can, via a driver specific hook, since we have specific counters on
> > > the link, so we don't want the autonuma based approach which makes PTEs
> > > inaccessible.
> > > 
> > 
> > Then poll the driver from a userspace daemon and make placement
> > decisions if automatic NUMA balancings reference-based decisions are
> > unsuitable.
> 
> Why a userspace daemon ? I don't get this... the driver will get
> interrupts from the GPU with page lists, it can trigger migrations
> without needing a userspace daemon...
> 

Then handle it within the driver. The point is that it still doesn't
need hooks into the core VM at this point.

> > > > To handle it transparently, either the driver needs to do the work in which
> > > > case no special core-kernel support is needed beyond what already exists or
> > > > there is a userspace daemon like numad running in userspace that decides
> > > > when to trigger migrations on a separate process that is using CDM which
> > > > would need to gather information from the driver.
> > > 
> > > The driver can handle it, we just need autonuma off the CDM memory (it
> > > can continue operating normally on system memory).
> > > 
> > 
> > Already suggested that prctl be used to disable automatic numa balancing
> > on a per-task basis. Alternatively, settiing a memory policy will be
> > enough and as the applications are going to need policies anyway, you
> > should be able to get that by default.
> 
> I'm not sure we want to disable it for the application vs. disabling it
> for pages that reside on that node,

Then use a memory policy to control which VMAs are exempt. If you do not
wants at all for particular nodes then that would need core VM support
but you'll lose transparency. If you want to flag particular pgdats,
then it'll be adding a check to the task scanner but it would need to be
clearly shown that there is a lot of value in teaching automatic NUMA
balancing this.

> > > > long as the driver hot-adds the CDM memory from a userspace trigger that
> > > > it then responsible for setting up the isolation.
> > > 
> > > Yes, I think the NUMA node based approach works fine using a lot of
> > > existing stuff. There are a couple of gaps, which we need to look at
> > > fixing one way or another such as the above, but overall I don't see
> > > the need of some major overhaul, not do I see the need of going down
> > > the path of ZONE_DEVICE.
> > > 
> > Your choice, but it also doesn't take away from the fact that special
> > casing in the core does not appear to be required at this point.
> 
> Well, yes and no.
> 
> If we use the NUMA based approach, then no special casing up to this
> point, the only thing is below, the idea of avoiding "normal"
> allocations for that type of memory.
> 

Use cpusets from userspace, and control carefully how and when the memory
is hot-added and what zone it gets added to. We've been through this.

> > Use policies. If the NUMA distance for CDM is set high then even applications
> > that have access to CDM will use every other node before going to CDM.
> 
> Yes. That was the original idea. Along with ZONE_MOVABLE to avoid
> kernel allocations completely.
> 

Remember that this will include the page table pages which may or may
not be what you want.

> I think Balbir and Anshuman wanted to play with a more fully exclusive
> approach where those allocations are simply not permitted.
> 

Use cpusets and control carefully how and when the memory is hot-added
and what zone it gets added to.

> >  As
> > you insist on no application awareness, the migration to CDM will have to
> > be controlled by a separate daemon.
> 
> Or by the driver itself, I don't think we need a daemon, but that's a
> detail in the grand scheme of things.
> 

It also doesn't need core VM hooks or special support.

> > > IE, things should go to the GPU memory if and only if they are either
> > > explicitly put there by the application/driver (the case where
> > > applications do care about manual placement), or the migration case. 
> > > 
> > > The latter is triggered by the driver, so it's also a case of the
> > > driver allocating the GPU pages and doing a migration to them.
> > > 
> > > This is the key thing. Now creating a CMA or using ZONE_MOVABLE can
> > > handle at least keeping kernel allocations off the GPU. However we
> > > would also like to keep random unrelated user memory & page cache off
> > > as well.
> > > 
> > 
> > Fine -- hot add the memory from the device via a userspace trigger and
> > have the userspace trigger then setup the policies to isolate CDM from
> > general usage.
> 
> This is racy though. The memory is hot added, but things can get
> allocated all over it before it has time to adjust the policies. Same
> issue we had with creating a CMA I believe.
> 

The race is a non-issue unless for some reason you decide to hot-add the node
when the machine is already heavily loaded and under memory pressure. Do it
near boot time and no CPU-local allocation is going to hit it. In itself,
special casing the core VM is overkill.

If you decide to use ZONE_MOVABLE and take the remote hit penalty of page
tables, then you can also migrate all the pages away after the onlining
and isolation is complete if it's a serious concern in practice.

> Unless we have a way to create a node without actually making it
> available for allocations, so we get a chance to establish policies for
> it, then "online" it ?
> 

Conceivably, that could be done although again it's somewhat overkill
as the race only applies if hot-adding CDM under heavy memory pressure
sufficient to overflow to a very remote node.

> Doing these from userspace is a bit nasty since it's expected to all be
> under the control of the GPU driver, but it could be done via a
> combination of GPU driver & udev helpers or a special daemon.
> 

Special casing the core VM in multiple places is also nasty as it shoves
all the maintenance overhead into places where most people will not be
able to verify it's still working due to a lack of hardware.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17  9:56               ` Benjamin Herrenschmidt
  2017-05-17 10:58                 ` Mel Gorman
@ 2017-05-17 12:41                 ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-05-17 12:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mel Gorman, Balbir Singh, linux-mm, akpm, Anshuman Khandual,
	Aneesh Kumar KV, Paul E. McKenney, Srikar Dronamraju,
	Haren Myneni, Jérôme Glisse, Reza Arbab,
	Vlastimil Babka, Christoph Lameter, Rik van Riel

On Wed 17-05-17 19:56:35, Benjamin Herrenschmidt wrote:
> On Wed, 2017-05-17 at 10:15 +0100, Mel Gorman wrote:
[...]
> > Fine -- hot add the memory from the device via a userspace trigger and
> > have the userspace trigger then setup the policies to isolate CDM from
> > general usage.
> 
> This is racy though. The memory is hot added, but things can get
> allocated all over it before it has time to adjust the policies. Same
> issue we had with creating a CMA I believe.

memory hotplug is by definition 2 stage. Physical hotadd which just
prepares memory blocks and allocates struct pages and the memory online
phase. You can handle the policy part from the userspace before onlining
te first memblock from your CDM NUMA node.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-16 22:26       ` Benjamin Herrenschmidt
  2017-05-17  8:28         ` Mel Gorman
@ 2017-05-17 13:54         ` Christoph Lameter
  2017-05-17 19:39           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2017-05-17 13:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mel Gorman, Balbir Singh, linux-mm, akpm, Anshuman Khandual,
	Aneesh Kumar KV, Paul E. McKenney, Srikar Dronamraju,
	Haren Myneni, Jérôme Glisse, Reza Arbab,
	Vlastimil Babka, Rik van Riel

On Wed, 17 May 2017, Benjamin Herrenschmidt wrote:

> On Tue, 2017-05-16 at 09:43 +0100, Mel Gorman wrote:
> > I'm not sure what you're asking here. migration is only partially
> > transparent but a move_pages call will be necessary to force pages onto
> > CDM if binding policies are not used so the cost of migration will be
> > invisible. Even if you made it "transparent", the migration cost would
> > be incurred at fault time. If anything, using move_pages would be more
> > predictable as you control when the cost is incurred.
>
> One of the main point of this whole exercise is for applications to not
> have to bother with any of this and now you are bringing all back into
> their lap.

You can provide a library that does it?

> The base idea behind the counters we have on the link is for the HW to
> know when memory is accessed "remotely", so that the device driver can
> make decision about migrating pages into or away from the device,
> especially so that applications don't have to concern themselves with
> memory placement.

Library can enquire about the current placement of the pages and move them
if necessary?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17 10:58                 ` Mel Gorman
@ 2017-05-17 19:35                   ` Benjamin Herrenschmidt
  2017-05-17 19:37                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-17 19:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, 2017-05-17 at 11:58 +0100, Mel Gorman wrote:
> Remember that this will include the page table pages which may or may
> not be what you want.

It is fine. The GPU does translation using ATS, so the page tables are
effectively accessed by the nest MMU in the corresponding P9 chip, not
by the GPU itself. Thus we do want them to reside in system memory.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17 10:58                 ` Mel Gorman
  2017-05-17 19:35                   ` Benjamin Herrenschmidt
@ 2017-05-17 19:37                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-17 19:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, linux-mm, akpm, Anshuman Khandual, Aneesh Kumar KV,
	Paul E. McKenney, Srikar Dronamraju, Haren Myneni,
	Jérôme Glisse, Reza Arbab, Vlastimil Babka,
	Christoph Lameter, Rik van Riel

On Wed, 2017-05-17 at 11:58 +0100, Mel Gorman wrote:
> The race is a non-issue unless for some reason you decide to hot-add the node
> when the machine is already heavily loaded and under memory pressure. Do it
> near boot time and no CPU-local allocation is going to hit it. In itself,
> special casing the core VM is overkill.
> 
> If you decide to use ZONE_MOVABLE and take the remote hit penalty of page
> tables, then you can also migrate all the pages away after the onlining
> and isolation is complete if it's a serious concern in practice.
> 
> > Unless we have a way to create a node without actually making it
> > available for allocations, so we get a chance to establish policies for
> > it, then "online" it ?
> > 
> 
> Conceivably, that could be done although again it's somewhat overkill
> as the race only applies if hot-adding CDM under heavy memory pressure
> sufficient to overflow to a very remote node.

I wouldn't dismiss the problem that readily. It might by ok for our
initial customer needs but long run, there's a lot of demand for SR-IOV 
GPUs and pass-through.

It's not far fetched to have GPU being dynamically added/removed from
partitions based on usage, which means possibly under significant
pressure.

That said, this can be solved later if needed.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC summary] Enable Coherent Device Memory
  2017-05-17 13:54         ` Christoph Lameter
@ 2017-05-17 19:39           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-17 19:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Balbir Singh, linux-mm, akpm, Anshuman Khandual,
	Aneesh Kumar KV, Paul E. McKenney, Srikar Dronamraju,
	Haren Myneni, Jérôme Glisse, Reza Arbab,
	Vlastimil Babka, Rik van Riel

On Wed, 2017-05-17 at 08:54 -0500, Christoph Lameter wrote:
> You can provide a library that does it?
> 
> > The base idea behind the counters we have on the link is for the HW to
> > know when memory is accessed "remotely", so that the device driver can
> > make decision about migrating pages into or away from the device,
> > especially so that applications don't have to concern themselves with
> > memory placement.
> 
> Library can enquire about the current placement of the pages and move them
> if necessary?

No, doing that from a library would not work. It should be done by the
driver, but that's not a problem in the proposed scheme and doesn't
require new MM hooks afaik so I don't think there's a debate here.

>From my understanding, the main discussion revolves around isolation,
ie, whether to change the NUMA core to add nodes on which no allocation
will take place by default or not.

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-05-17 19:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-12  6:18 [RFC summary] Enable Coherent Device Memory Balbir Singh
2017-05-12 10:26 ` Mel Gorman
2017-05-15 23:45   ` Balbir Singh
2017-05-16  8:43     ` Mel Gorman
2017-05-16 22:26       ` Benjamin Herrenschmidt
2017-05-17  8:28         ` Mel Gorman
2017-05-17  9:03           ` Benjamin Herrenschmidt
2017-05-17  9:15             ` Mel Gorman
2017-05-17  9:56               ` Benjamin Herrenschmidt
2017-05-17 10:58                 ` Mel Gorman
2017-05-17 19:35                   ` Benjamin Herrenschmidt
2017-05-17 19:37                   ` Benjamin Herrenschmidt
2017-05-17 12:41                 ` Michal Hocko
2017-05-17 13:54         ` Christoph Lameter
2017-05-17 19:39           ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.