linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] mmap(MAP_CONTIG)
@ 2017-10-03 23:56 Mike Kravetz
  2017-10-04 11:54 ` Michal Nazarewicz
                   ` (4 more replies)
  0 siblings, 5 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-03 23:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Mike Kravetz

At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
titled 'User space contiguous memory allocation for DMA' [1].  The slides
point out the performance benefits of devices that can take advantage of
larger physically contiguous areas.

When such physically contiguous allocations are done today, they are done
within drivers themselves in an ad-hoc manner.  In addition to allocations
for DMA, allocations of this type are also performed for buffers used by
coprocessors and other acceleration engines.

As mentioned in the presentation, posix specifies an interface to obtain
physically contiguous memory.  This is via typed memory objects as described
in the posix_typed_mem_open() man page.  Since Linux today does not follow
the posix typed memory object model, adding infrastructure for contiguous
memory allocations seems to be overkill.  Instead, a proposal was suggested
to add support via a mmap flag: MAP_CONTIG.

mmap(MAP_CONTIG) would have the following semantics:
- The entire mapping (length size) would be backed by physically contiguous
  pages.
- If 'length' physically contiguous pages can not be allocated, then mmap
  will fail.
- MAP_CONTIG only works with MAP_ANONYMOUS mappings.
- MAP_CONTIG will lock the associated pages in memory.  As such, the same
  privileges and limits that apply to mlock will also apply to MAP_CONTIG.
- A MAP_CONTIG mapping can not be expanded.
- At fork time, private MAP_CONTIG mappings will be converted to regular
  (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
  will not require a contiguous allocation.

Some implementation considerations:
- alloc_contig_range() or similar will be used for allocations larger
  than MAX_ORDER.
- MAP_CONTIG should imply MAP_POPULATE.  At mmap time, all pages for the
  mapping must be 'pre-allocated', and they can only be used for the mapping,
  so it makes sense to 'fault in' all pages.
- Using 'pre-allocated' pages in the fault paths may be intrusive.
- We need to keep keep track of those pre-allocated pages until the vma is
  tore down, especially if free_contig_range() must be called.

Thoughts?
- Is such an interface useful?
- Any other ideas on how to achieve the same functionality?
- Any thoughts on implementation?

I have started down the path of pre-allocating contiguous pages at mmap
time and hanging those off the vma(vm_private_data) with some kludges to
use the pages at fault time.  It is really ugly, which is why I am not
sharing the code.  Hoping for some comments/suggestions.

[1] https://www.linuxplumbersconf.org/2017/ocw/proposals/4669
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
@ 2017-10-04 11:54 ` Michal Nazarewicz
  2017-10-04 17:08   ` Mike Kravetz
  2017-10-04 13:49 ` Anshuman Khandual
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 63+ messages in thread
From: Michal Nazarewicz @ 2017-10-04 11:54 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Aneesh Kumar K.V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter, Mike Kravetz

On Tue, Oct 03 2017, Mike Kravetz wrote:
> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
> titled 'User space contiguous memory allocation for DMA' [1].  The slides
> point out the performance benefits of devices that can take advantage of
> larger physically contiguous areas.

Issue I have is that kind of memory needed may depend on a device.  Some
may require contiguous blocks.  Some may support scatter-gather.  Some
may be behind IO-MMU and not care either way.

Furthermore, I feel déjà vu.  Wasn’t dmabuf supposed to address this
issue?

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
  2017-10-04 11:54 ` Michal Nazarewicz
@ 2017-10-04 13:49 ` Anshuman Khandual
  2017-10-04 16:05   ` Christopher Lameter
  2017-10-04 17:35   ` Mike Kravetz
  2017-10-05  7:06 ` Vlastimil Babka
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 63+ messages in thread
From: Anshuman Khandual @ 2017-10-04 13:49 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter

On 10/04/2017 05:26 AM, Mike Kravetz wrote:
> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
> titled 'User space contiguous memory allocation for DMA' [1].  The slides
> point out the performance benefits of devices that can take advantage of
> larger physically contiguous areas.
> 
> When such physically contiguous allocations are done today, they are done
> within drivers themselves in an ad-hoc manner.  In addition to allocations
> for DMA, allocations of this type are also performed for buffers used by
> coprocessors and other acceleration engines.

Right.

> 
> As mentioned in the presentation, posix specifies an interface to obtain
> physically contiguous memory.  This is via typed memory objects as described
> in the posix_typed_mem_open() man page.  Since Linux today does not follow
> the posix typed memory object model, adding infrastructure for contiguous
> memory allocations seems to be overkill.  Instead, a proposal was suggested
> to add support via a mmap flag: MAP_CONTIG.

Right.

> 
> mmap(MAP_CONTIG) would have the following semantics:
> - The entire mapping (length size) would be backed by physically contiguous
>   pages.
> - If 'length' physically contiguous pages can not be allocated, then mmap
>   will fail.
> - MAP_CONTIG only works with MAP_ANONYMOUS mappings.
> - MAP_CONTIG will lock the associated pages in memory.  As such, the same
>   privileges and limits that apply to mlock will also apply to MAP_CONTIG.
> - A MAP_CONTIG mapping can not be expanded.

Why ? May be we have memory around the edge of the existing mapping. Why
give up before trying ?

> - At fork time, private MAP_CONTIG mappings will be converted to regular
>   (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
>   will not require a contiguous allocation.

Makes sense but need to be documented as the child still knows that the buffer
came from a mmap(MAP_CONTIG) call in the parent.

> 
> Some implementation considerations:
> - alloc_contig_range() or similar will be used for allocations larger
>   than MAX_ORDER.

As I had also mentioned during the presentation at Plumbers, there should be
a fallback approach while attempting to allocate the contiguous memory.

- If order < MAX_ORDER -> alloc_pages()
- If order > MAX_ORDER -> alloc_contig_range()
- If alloc_contig_range() fails attempt a CMA based allocation scheme
  The CMA area should have been initialized at the boot exclusively for
  this purpose (may be with a CONFIG option if some one wants to go for
  this fallback at all) and use cma_alloc() on that area when we need
  to service MAP_CONTIG requests.

> - MAP_CONTIG should imply MAP_POPULATE.  At mmap time, all pages for the
>   mapping must be 'pre-allocated', and they can only be used for the mapping,
>   so it makes sense to 'fault in' all pages.


> - Using 'pre-allocated' pages in the fault paths may be intrusive.

But we have already faulted in all of them for the mapping and they
are also locked. Hence there should not be any page faults any more
for the VMA. Am I missing something here ?

> - We need to keep keep track of those pre-allocated pages until the vma is
>   tore down, especially if free_contig_range() must be called

Right, probably tracking them as part of the vm_area_struct itself. 

> 
> Thoughts?
> - Is such an interface useful?
> - Any other ideas on how to achieve the same functionality?
> - Any thoughts on implementation?
> 
> I have started down the path of pre-allocating contiguous pages at mmap
> time and hanging those off the vma(vm_private_data) with some kludges to
> use the pages at fault time.  It is really ugly, which is why I am not
> sharing the code.  Hoping for some comments/suggestions.

I am still wondering why wait till fault time not pre fault all of them
and populate the page tables.
> 
> [1] https://www.linuxplumbersconf.org/2017/ocw/proposals/4669
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-04 13:49 ` Anshuman Khandual
@ 2017-10-04 16:05   ` Christopher Lameter
  2017-10-04 17:38     ` Mike Kravetz
  2017-10-04 17:35   ` Mike Kravetz
  1 sibling, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-04 16:05 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah

On Wed, 4 Oct 2017, Anshuman Khandual wrote:

> > - Using 'pre-allocated' pages in the fault paths may be intrusive.
>
> But we have already faulted in all of them for the mapping and they
> are also locked. Hence there should not be any page faults any more
> for the VMA. Am I missing something here ?

The PTEs may be torn down and have to reestablished through a page faults.
Page faults would not allocate memory.

> I am still wondering why wait till fault time not pre fault all of them
> and populate the page tables.

They are populated but some processes (swap and migration) may tear them
down.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-04 11:54 ` Michal Nazarewicz
@ 2017-10-04 17:08   ` Mike Kravetz
  2017-10-04 21:29     ` Laura Abbott
  0 siblings, 1 reply; 63+ messages in thread
From: Mike Kravetz @ 2017-10-04 17:08 UTC (permalink / raw)
  To: Michal Nazarewicz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Aneesh Kumar K.V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter

On 10/04/2017 04:54 AM, Michal Nazarewicz wrote:
> On Tue, Oct 03 2017, Mike Kravetz wrote:
>> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
>> titled 'User space contiguous memory allocation for DMA' [1].  The slides
>> point out the performance benefits of devices that can take advantage of
>> larger physically contiguous areas.
> 
> Issue I have is that kind of memory needed may depend on a device.  Some
> may require contiguous blocks.  Some may support scatter-gather.  Some
> may be behind IO-MMU and not care either way.
> 
> Furthermore, I feel déjà vu.  Wasn’t dmabuf supposed to address this
> issue?

Thanks Michal,

I was unaware of dmabuf and am just now looking at capabilities.  The
question is whether or not the IB driver writers requesting mmap(MAP_CONTIG)
functionality could make use of dmabuf.  That is out of my are of expertise,
so I will let them reply.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-04 13:49 ` Anshuman Khandual
  2017-10-04 16:05   ` Christopher Lameter
@ 2017-10-04 17:35   ` Mike Kravetz
  1 sibling, 0 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-04 17:35 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter

On 10/04/2017 06:49 AM, Anshuman Khandual wrote:
> On 10/04/2017 05:26 AM, Mike Kravetz wrote:
>> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
>> titled 'User space contiguous memory allocation for DMA' [1].  The slides
>> point out the performance benefits of devices that can take advantage of
>> larger physically contiguous areas.
>>
>> When such physically contiguous allocations are done today, they are done
>> within drivers themselves in an ad-hoc manner.  In addition to allocations
>> for DMA, allocations of this type are also performed for buffers used by
>> coprocessors and other acceleration engines.
> 
> Right.
> 
>>
>> As mentioned in the presentation, posix specifies an interface to obtain
>> physically contiguous memory.  This is via typed memory objects as described
>> in the posix_typed_mem_open() man page.  Since Linux today does not follow
>> the posix typed memory object model, adding infrastructure for contiguous
>> memory allocations seems to be overkill.  Instead, a proposal was suggested
>> to add support via a mmap flag: MAP_CONTIG.
> 
> Right.
> 
>>
>> mmap(MAP_CONTIG) would have the following semantics:
>> - The entire mapping (length size) would be backed by physically contiguous
>>   pages.
>> - If 'length' physically contiguous pages can not be allocated, then mmap
>>   will fail.
>> - MAP_CONTIG only works with MAP_ANONYMOUS mappings.
>> - MAP_CONTIG will lock the associated pages in memory.  As such, the same
>>   privileges and limits that apply to mlock will also apply to MAP_CONTIG.
>> - A MAP_CONTIG mapping can not be expanded.
> 
> Why ? May be we have memory around the edge of the existing mapping. Why
> give up before trying ?

Just a simplification.  If not to complicated, we could add support for
expansion.  But, it may not be worth the cost and I do not know if there
would be any real use cases.

>> - At fork time, private MAP_CONTIG mappings will be converted to regular
>>   (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
>>   will not require a contiguous allocation.
> 
> Makes sense but need to be documented as the child still knows that the buffer
> came from a mmap(MAP_CONTIG) call in the parent.
> 
>>
>> Some implementation considerations:
>> - alloc_contig_range() or similar will be used for allocations larger
>>   than MAX_ORDER.
> 
> As I had also mentioned during the presentation at Plumbers, there should be
> a fallback approach while attempting to allocate the contiguous memory.
> 
> - If order < MAX_ORDER -> alloc_pages()
> - If order > MAX_ORDER -> alloc_contig_range()
> - If alloc_contig_range() fails attempt a CMA based allocation scheme
>   The CMA area should have been initialized at the boot exclusively for
>   this purpose (may be with a CONFIG option if some one wants to go for
>   this fallback at all) and use cma_alloc() on that area when we need
>   to service MAP_CONTIG requests.

I am not sure about the use of CMA and requiring admin setup.  It is
something that can be considered.  However, I suspect people would want
to avoid admin interaction/requirements if possible.

>> - MAP_CONTIG should imply MAP_POPULATE.  At mmap time, all pages for the
>>   mapping must be 'pre-allocated', and they can only be used for the mapping,
>>   so it makes sense to 'fault in' all pages.
> 
> 
>> - Using 'pre-allocated' pages in the fault paths may be intrusive.
> 
> But we have already faulted in all of them for the mapping and they
> are also locked. Hence there should not be any page faults any more
> for the VMA. Am I missing something here ?

I was referring to the action of pre-populating the mapping.  Today that
is done via the normal fault paths.  So, if we use this same scheme for
MAP_CONTIG, the fault paths would need to know about pre-allocated pages.

Sorry for not being more clear as that may have been a source of confusion.

>> - We need to keep keep track of those pre-allocated pages until the vma is
>>   tore down, especially if free_contig_range() must be called
> 
> Right, probably tracking them as part of the vm_area_struct itself. 
> 
>>
>> Thoughts?
>> - Is such an interface useful?
>> - Any other ideas on how to achieve the same functionality?
>> - Any thoughts on implementation?
>>
>> I have started down the path of pre-allocating contiguous pages at mmap
>> time and hanging those off the vma(vm_private_data) with some kludges to
>> use the pages at fault time.  It is really ugly, which is why I am not
>> sharing the code.  Hoping for some comments/suggestions.
> 
> I am still wondering why wait till fault time not pre fault all of them
> and populate the page tables.

Yes, that is the idea.  I just did not state clearly above.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-04 16:05   ` Christopher Lameter
@ 2017-10-04 17:38     ` Mike Kravetz
  0 siblings, 0 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-04 17:38 UTC (permalink / raw)
  To: Christopher Lameter, Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-api, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K.V, Joonsoo Kim, Guy Shattah

On 10/04/2017 09:05 AM, Christopher Lameter wrote:
> On Wed, 4 Oct 2017, Anshuman Khandual wrote:
> 
>>> - Using 'pre-allocated' pages in the fault paths may be intrusive.
>>
>> But we have already faulted in all of them for the mapping and they
>> are also locked. Hence there should not be any page faults any more
>> for the VMA. Am I missing something here ?
> 
> The PTEs may be torn down and have to reestablished through a page faults.
> Page faults would not allocate memory.
> 
>> I am still wondering why wait till fault time not pre fault all of them
>> and populate the page tables.
> 
> They are populated but some processes (swap and migration) may tear them
> down.

As mentioned in my reply to Anshuman, the mention of fault paths here
may be a source of confusion.  I would expect the entire mapping to be
populated at mmap time, and the pages locked.  Therefore, there should
be no swap or migration.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-04 17:08   ` Mike Kravetz
@ 2017-10-04 21:29     ` Laura Abbott
  0 siblings, 0 replies; 63+ messages in thread
From: Laura Abbott @ 2017-10-04 21:29 UTC (permalink / raw)
  To: Mike Kravetz, Michal Nazarewicz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Aneesh Kumar K.V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter

On 10/04/2017 10:08 AM, Mike Kravetz wrote:
> On 10/04/2017 04:54 AM, Michal Nazarewicz wrote:
>> On Tue, Oct 03 2017, Mike Kravetz wrote:
>>> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
>>> titled 'User space contiguous memory allocation for DMA' [1].  The slides
>>> point out the performance benefits of devices that can take advantage of
>>> larger physically contiguous areas.
>>
>> Issue I have is that kind of memory needed may depend on a device.  Some
>> may require contiguous blocks.  Some may support scatter-gather.  Some
>> may be behind IO-MMU and not care either way.
>>
>> Furthermore, I feel déjà vu.  Wasn’t dmabuf supposed to address this
>> issue?
> 
> Thanks Michal,
> 
> I was unaware of dmabuf and am just now looking at capabilities.  The
> question is whether or not the IB driver writers requesting mmap(MAP_CONTIG)
> functionality could make use of dmabuf.  That is out of my are of expertise,
> so I will let them reply.
> 

I don't think dmabuf as it exists today would help anything here.
It's designed to share buffers via fd but you still need some
place/driver to actually get the allocation and then export it
since there isn't a single interface for allocations. You could
convert drivers to take a dma_buf fd if there were appropriate
buffers available though.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
  2017-10-04 11:54 ` Michal Nazarewicz
  2017-10-04 13:49 ` Anshuman Khandual
@ 2017-10-05  7:06 ` Vlastimil Babka
  2017-10-05 14:30   ` Christopher Lameter
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-23 22:10 ` [RFC] mmap(MAP_CONTIG) Dave Hansen
  4 siblings, 1 reply; 63+ messages in thread
From: Vlastimil Babka @ 2017-10-05  7:06 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter

On 10/04/2017 01:56 AM, Mike Kravetz wrote:

Hi,

> At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
> titled 'User space contiguous memory allocation for DMA' [1].  The slides

Hm I didn't find slides on that link, are they available?

> point out the performance benefits of devices that can take advantage of
> larger physically contiguous areas.
> 
> When such physically contiguous allocations are done today, they are done
> within drivers themselves in an ad-hoc manner.

As Michal N. noted, the drivers might have different requirements. Is
contiguity (without extra requirements) so common that it would benefit
from a userspace API change?
Also how are the driver-specific allocations done today? mmap() on the
driver's device? Maybe we could provide some in-kernel API/library to
make them less "ad-hoc". Conversion to MAP_ANONYMOUS would at first seem
like an improvement in that userspace would be able to use a generic
allocation API and all the generic treatment of anonymous pages (LRU
aging, reclaim, migration etc), but the restrictions you listed below
eliminate most of that?
(It's likely that I just don't have enough info about how it works today
so it's difficult to judge)

> In addition to allocations
> for DMA, allocations of this type are also performed for buffers used by
> coprocessors and other acceleration engines.
> 
> As mentioned in the presentation, posix specifies an interface to obtain
> physically contiguous memory.  This is via typed memory objects as described
> in the posix_typed_mem_open() man page.  Since Linux today does not follow
> the posix typed memory object model, adding infrastructure for contiguous
> memory allocations seems to be overkill.  Instead, a proposal was suggested
> to add support via a mmap flag: MAP_CONTIG.
> 
> mmap(MAP_CONTIG) would have the following semantics:
> - The entire mapping (length size) would be backed by physically contiguous
>   pages.
> - If 'length' physically contiguous pages can not be allocated, then mmap
>   will fail.
> - MAP_CONTIG only works with MAP_ANONYMOUS mappings.
> - MAP_CONTIG will lock the associated pages in memory.  As such, the same
>   privileges and limits that apply to mlock will also apply to MAP_CONTIG.
> - A MAP_CONTIG mapping can not be expanded.
> - At fork time, private MAP_CONTIG mappings will be converted to regular
>   (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
>   will not require a contiguous allocation.
> 
> Some implementation considerations:
> - alloc_contig_range() or similar will be used for allocations larger
>   than MAX_ORDER.
> - MAP_CONTIG should imply MAP_POPULATE.  At mmap time, all pages for the
>   mapping must be 'pre-allocated', and they can only be used for the mapping,
>   so it makes sense to 'fault in' all pages.
> - Using 'pre-allocated' pages in the fault paths may be intrusive.
> - We need to keep keep track of those pre-allocated pages until the vma is
>   tore down, especially if free_contig_range() must be called.
> 
> Thoughts?
> - Is such an interface useful?
> - Any other ideas on how to achieve the same functionality?
> - Any thoughts on implementation?
> 
> I have started down the path of pre-allocating contiguous pages at mmap
> time and hanging those off the vma(vm_private_data) with some kludges to
> use the pages at fault time.  It is really ugly, which is why I am not
> sharing the code.  Hoping for some comments/suggestions.
> 
> [1] https://www.linuxplumbersconf.org/2017/ocw/proposals/4669
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-05  7:06 ` Vlastimil Babka
@ 2017-10-05 14:30   ` Christopher Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christopher Lameter @ 2017-10-05 14:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah

On Thu, 5 Oct 2017, Vlastimil Babka wrote:

> On 10/04/2017 01:56 AM, Mike Kravetz wrote:
> > At Plumbers this year, Guy Shattah and Christoph Lameter gave a presentation
> > titled 'User space contiguous memory allocation for DMA' [1].  The slides
> Hm I didn't find slides on that link, are they available?

I just added Guy's slides to the entry.

> As Michal N. noted, the drivers might have different requirements. Is
> contiguity (without extra requirements) so common that it would benefit
> from a userspace API change?

Yes.

> Also how are the driver-specific allocations done today? mmap() on the
> driver's device? Maybe we could provide some in-kernel API/library to
> make them less "ad-hoc". Conversion to MAP_ANONYMOUS would at first seem
> like an improvement in that userspace would be able to use a generic
> allocation API and all the generic treatment of anonymous pages (LRU
> aging, reclaim, migration etc), but the restrictions you listed below
> eliminate most of that?
> (It's likely that I just don't have enough info about how it works today
> so it's difficult to judge)

Contemporary devices typically can address all of memory. Moreover the
device used actually can trigger faults to page in 4k pages if they are
not present (ODP in RDMA layer). There is no need for driver specific
allocation in those drivers.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support
  2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
                   ` (2 preceding siblings ...)
  2017-10-05  7:06 ` Vlastimil Babka
@ 2017-10-12  1:46 ` Mike Kravetz
  2017-10-12  1:46   ` [RFC PATCH 1/3] mm/map_contig: Add VM_CONTIG flag to vma struct Mike Kravetz
                     ` (3 more replies)
  2017-10-23 22:10 ` [RFC] mmap(MAP_CONTIG) Dave Hansen
  4 siblings, 4 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-12  1:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka, Mike Kravetz

The following is a 'possible' way to add such functionality.  I just
did what was easy and pre-allocated contiguous pages which are used
to populate the mapping.  I did not use any of the higher order
allocators such as alloc_contig_range.  Therefore, it is limited to
allocations of MAX_ORDER size.  Also, the allocations should probably
be done outside mmap_sem but that was the easiest place to do it in
this quick and easy POC.

I just wanted to throw out some code to get further ideas.  It is far
from complete.

Mike Kravetz (3):
  mm/map_contig: Add VM_CONTIG flag to vma struct
  mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings
  mm/map_contig: Add mmap(MAP_CONTIG) support

 include/linux/mm.h              |  1 +
 include/uapi/asm-generic/mman.h |  1 +
 kernel/fork.c                   |  2 +-
 mm/memory.c                     | 13 +++++-
 mm/mmap.c                       | 94 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 109 insertions(+), 2 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH 1/3] mm/map_contig: Add VM_CONTIG flag to vma struct
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
@ 2017-10-12  1:46   ` Mike Kravetz
  2017-10-12  1:46   ` [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings Mike Kravetz
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-12  1:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka, Mike Kravetz

Add the flag VM_CONTIG to vma structure to identify vmas which are
backed by contiguous memory allocations.  This flag is not propogated
to child processes, so be sure to clear at fork time.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h | 1 +
 kernel/fork.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..db82f172fbd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,6 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
+#define VM_CONTIG	0x00800000	/* Contiguous page backing */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
diff --git a/kernel/fork.c b/kernel/fork.c
index e702cb9ffbd8..d93b022e4909 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -665,7 +665,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 				goto fail_nomem_anon_vma_fork;
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_CONTIG);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
 		if (file) {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-12  1:46   ` [RFC PATCH 1/3] mm/map_contig: Add VM_CONTIG flag to vma struct Mike Kravetz
@ 2017-10-12  1:46   ` Mike Kravetz
  2017-10-12 11:04     ` Anshuman Khandual
  2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-12 10:36   ` [RFC PATCH 0/3] " Anshuman Khandual
  3 siblings, 1 reply; 63+ messages in thread
From: Mike Kravetz @ 2017-10-12  1:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka, Mike Kravetz

When populating mappings backed by contiguous memory allocations
(VM_CONTIG), use the preallocated pages instead of allocating new.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..fbef78d07cf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3100,7 +3100,18 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+
+	/*
+	 * In the special VM_CONTIG case, pages have been pre-allocated. So,
+	 * simply grab the appropriate pre-allocated page.
+	 */
+	if (unlikely(vma->vm_flags & VM_CONTIG)) {
+		VM_BUG_ON(!vma->vm_private_data);
+		page = ((struct page *)vma->vm_private_data) +
+			((vmf->address - vma->vm_start) / PAGE_SIZE);
+	} else {
+		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+	}
 	if (!page)
 		goto oom;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-12  1:46   ` [RFC PATCH 1/3] mm/map_contig: Add VM_CONTIG flag to vma struct Mike Kravetz
  2017-10-12  1:46   ` [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings Mike Kravetz
@ 2017-10-12  1:46   ` Mike Kravetz
  2017-10-12 11:22     ` Anshuman Khandual
                       ` (2 more replies)
  2017-10-12 10:36   ` [RFC PATCH 0/3] " Anshuman Khandual
  3 siblings, 3 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-12  1:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka, Mike Kravetz

Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
mmap flag processing.  If present, pre-allocate a contiguous set of
pages to back the mapping.  These pages will be used a fault time, and
the MAP_CONTIG flag implies populating the mapping at the mmap time.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/uapi/asm-generic/mman.h |  1 +
 mm/mmap.c                       | 94 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..e8046b4c4ac4 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_CONTIG	0x80000		/* back with contiguous pages */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..aee7917ee073 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
 
+	if (vma->vm_flags & VM_CONTIG) {
+		/*
+		 * Do any necessary clean up when freeing a vma backed
+		 * by a contiguous allocation.
+		 *
+		 * Not very useful in it's present form.
+		 */
+		VM_BUG_ON(!vma->vm_private_data);
+		vma->vm_private_data = NULL;
+	}
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
@@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
+	/*
+	 * MAP_CONTIG has some restrictions,
+	 * and also implies additional mmap and vma flags.
+	 */
+	if (flags & MAP_CONTIG) {
+		if (!(flags & MAP_ANONYMOUS))
+			return -EINVAL;
+
+		flags |= MAP_POPULATE | MAP_LOCKED;
+		vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
+	}
+
 	if (flags & MAP_LOCKED)
 		if (!can_do_mlock())
 			return -EPERM;
@@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */
 
 /*
+ * Attempt to allocate a contiguous range of pages to back the
+ * specified vma.  vm_private_data is used as a 'pointer' to the
+ * allocated pages.  Larger requests and more fragmented memory
+ * make the allocation more likely to fail.  So, caller must deal
+ * with this situation.
+ */
+static long __alloc_vma_contig_range(struct vm_area_struct *vma)
+{
+	gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
+	unsigned long order;
+
+	VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
+	order = get_order(vma->vm_end - vma->vm_start);
+
+	/*
+	 * FIXME - Incomplete implementation.  For now, just handle
+	 * allocations < MAX_ORDER in size.  However, this should really
+	 * handle arbitrary size allocations.
+	 */
+	if (order >= MAX_ORDER)
+		return -ENOMEM;
+
+	vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
+						numa_node_id(), false);
+	if (!vma->vm_private_data)
+		return -ENOMEM;
+
+	/*
+	 * split large allocation so it can be treated as individual
+	 * pages when populating the mapping and at unmap time.
+	 */
+	if (order) {
+		unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
+								PAGE_SIZE;
+		unsigned long order_pages = 1 << order;
+		unsigned long i;
+		struct page *page = vma->vm_private_data;
+
+		split_page((struct page *)vma->vm_private_data, order);
+
+		/*
+		 * 'order' rounds up size of vma to next power of 2.  We
+		 * will not need/use the extra pages so free them now.
+		 */
+		for (i = vma_pages; i < order_pages; i++)
+			put_page(page + i);
+	}
+
+	return 0;
+}
+
+static void __free_vma_contig_range(struct vm_area_struct *vma)
+{
+	struct page *page = vma->vm_private_data;
+	unsigned long n_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+	unsigned long i;
+
+	if (!page)
+		return;
+
+	for (i = 0; i < n_pages; i++)
+		put_page(page + i);
+}
+
+/*
  * Some shared mappigns will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
  * to the private version (using protection_map[] without the
@@ -1669,6 +1756,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_pgoff = pgoff;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 
+	if (vm_flags & VM_CONTIG) {
+		error = __alloc_vma_contig_range(vma);
+		if (error)
+			goto free_vma;
+	}
+
 	if (file) {
 		if (vm_flags & VM_DENYWRITE) {
 			error = deny_write_access(file);
@@ -1758,6 +1851,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (vm_flags & VM_DENYWRITE)
 		allow_write_access(file);
 free_vma:
+	__free_vma_contig_range(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 	if (charged)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
                     ` (2 preceding siblings ...)
  2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
@ 2017-10-12 10:36   ` Anshuman Khandual
  2017-10-12 14:25     ` Anshuman Khandual
  3 siblings, 1 reply; 63+ messages in thread
From: Anshuman Khandual @ 2017-10-12 10:36 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

On 10/12/2017 07:16 AM, Mike Kravetz wrote:
> The following is a 'possible' way to add such functionality.  I just
> did what was easy and pre-allocated contiguous pages which are used
> to populate the mapping.  I did not use any of the higher order
> allocators such as alloc_contig_range.  Therefore, it is limited to

Just tried with a small prototype with an implementation similar to that
of alloc_gigantic_page() where we scan the zones (applicable zonelist)
for contiguous valid PFN range and try allocating with alloc_contig_range.
Will share it soon.

> allocations of MAX_ORDER size.  Also, the allocations should probably

Just did a quick test and it worked till 1UL << (MAX_ORDER - 1) numbers
of pages on a POWER system with the current RFC patches. As the pages
are allocated during VMA creation time, comparison to normal page fault
speed while accessing the buffer wont be fair.

> be done outside mmap_sem but that was the easiest place to do it in
> this quick and easy POC.

Why it should be done outside the mmap_sem, because it can take some
time ? But then VMA can just go away while we are allocating the big
chunks of pages (if we dont hold mmap_sem).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings
  2017-10-12  1:46   ` [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings Mike Kravetz
@ 2017-10-12 11:04     ` Anshuman Khandual
  0 siblings, 0 replies; 63+ messages in thread
From: Anshuman Khandual @ 2017-10-12 11:04 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

On 10/12/2017 07:16 AM, Mike Kravetz wrote:
> When populating mappings backed by contiguous memory allocations
> (VM_CONTIG), use the preallocated pages instead of allocating new.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/memory.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a728bed16c20..fbef78d07cf3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3100,7 +3100,18 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	/* Allocate our own private page. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> +
> +	/*
> +	 * In the special VM_CONTIG case, pages have been pre-allocated. So,
> +	 * simply grab the appropriate pre-allocated page.
> +	 */
> +	if (unlikely(vma->vm_flags & VM_CONTIG)) {
> +		VM_BUG_ON(!vma->vm_private_data);
> +		page = ((struct page *)vma->vm_private_data) +
> +			((vmf->address - vma->vm_start) / PAGE_SIZE);
> +	} else {
> +		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);

vm_private_data should be fine. Seems like its getting used for HugeTLB,
special mappings and for shared memory as well. As long as we dont cross
these things (lets say while enabling this for file mapping etc with
MAP_CONTIG), we can keep using vm_private_data.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
@ 2017-10-12 11:22     ` Anshuman Khandual
  2017-10-13 15:14       ` Christopher Lameter
  2017-10-12 14:37     ` Michal Hocko
  2017-10-15  8:07     ` Guy Shattah
  2 siblings, 1 reply; 63+ messages in thread
From: Anshuman Khandual @ 2017-10-12 11:22 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

On 10/12/2017 07:16 AM, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c                       | 94 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
>  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> +#define MAP_CONTIG	0x80000		/* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  {
>  	struct vm_area_struct *next = vma->vm_next;
>  
> +	if (vma->vm_flags & VM_CONTIG) {
> +		/*
> +		 * Do any necessary clean up when freeing a vma backed
> +		 * by a contiguous allocation.
> +		 *
> +		 * Not very useful in it's present form.
> +		 */
> +		VM_BUG_ON(!vma->vm_private_data);
> +		vma->vm_private_data = NULL;
> +	}
>  	might_sleep();
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>  			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> +	/*
> +	 * MAP_CONTIG has some restrictions,
> +	 * and also implies additional mmap and vma flags.
> +	 */
> +	if (flags & MAP_CONTIG) {
> +		if (!(flags & MAP_ANONYMOUS))
> +			return -EINVAL;
> +
> +		flags |= MAP_POPULATE | MAP_LOCKED;
> +		vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> +	}
> +
>  	if (flags & MAP_LOCKED)
>  		if (!can_do_mlock())
>  			return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> +	gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;

Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
coming from Buddy, every thing should have already been zeroed out
in there. Am I missing something ?

> +	unsigned long order;
> +
> +	VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> +	order = get_order(vma->vm_end - vma->vm_start);
> +
> +	/*
> +	 * FIXME - Incomplete implementation.  For now, just handle
> +	 * allocations < MAX_ORDER in size.  However, this should really
> +	 * handle arbitrary size allocations.
> +	 */
> +	if (order >= MAX_ORDER)
> +		return -ENOMEM;
> +
> +	vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> +						numa_node_id(), false);

This is where I was experimenting for requests beyond MAX_ORDER
with alloc_contig_range().

> +	if (!vma->vm_private_data)
> +		return -ENOMEM;
> +
> +	/*
> +	 * split large allocation so it can be treated as individual
> +	 * pages when populating the mapping and at unmap time.
> +	 */
> +	if (order) {
> +		unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> +								PAGE_SIZE;
> +		unsigned long order_pages = 1 << order;
> +		unsigned long i;
> +		struct page *page = vma->vm_private_data;
> +
> +		split_page((struct page *)vma->vm_private_data, order);
> +
> +		/*
> +		 * 'order' rounds up size of vma to next power of 2.  We
> +		 * will not need/use the extra pages so free them now.
> +		 */
> +		for (i = vma_pages; i < order_pages; i++)
> +			put_page(page + i);

Interesting and this should be kept there a little longer if we would like
to support expansion of this VMA to the next power of 2 which we originally
allocated for any way.

> +	}
> +
> +	return 0;
> +}
> +
> +static void __free_vma_contig_range(struct vm_area_struct *vma)
> +{
> +	struct page *page = vma->vm_private_data;
> +	unsigned long n_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
> +	unsigned long i;
> +
> +	if (!page)
> +		return;
> +
> +	for (i = 0; i < n_pages; i++)
> +		put_page(page + i);
> +}
> +
> +/*
>   * Some shared mappigns will want the pages marked read-only
>   * to track write events. If so, we'll downgrade vm_page_prot
>   * to the private version (using protection_map[] without the
> @@ -1669,6 +1756,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vma->vm_pgoff = pgoff;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  
> +	if (vm_flags & VM_CONTIG) {
> +		error = __alloc_vma_contig_range(vma);
> +		if (error)
> +			goto free_vma;
> +	}
> +

You wanted to have this outside of mmap_sem lock right ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support
  2017-10-12 10:36   ` [RFC PATCH 0/3] " Anshuman Khandual
@ 2017-10-12 14:25     ` Anshuman Khandual
  0 siblings, 0 replies; 63+ messages in thread
From: Anshuman Khandual @ 2017-10-12 14:25 UTC (permalink / raw)
  To: Anshuman Khandual, Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter, Laura Abbott,
	Vlastimil Babka

On 10/12/2017 04:06 PM, Anshuman Khandual wrote:
> On 10/12/2017 07:16 AM, Mike Kravetz wrote:
>> The following is a 'possible' way to add such functionality.  I just
>> did what was easy and pre-allocated contiguous pages which are used
>> to populate the mapping.  I did not use any of the higher order
>> allocators such as alloc_contig_range.  Therefore, it is limited to
> Just tried with a small prototype with an implementation similar to that
> of alloc_gigantic_page() where we scan the zones (applicable zonelist)
> for contiguous valid PFN range and try allocating with alloc_contig_range.
> Will share it soon.
> 

With this patch on top of the series can allocate little more than
twice of 1UL << (MAX_ORDER - 1) number of pages on POWER. But the
problem is it keeps on reducing every attempt till it reaches
1UL << (MAX_ORDER - 1). Will look into it.

diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 03c06ba..ce13b36 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -28,5 +28,6 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_CONTIG	0x80000		/* back with contiguous pages */
 
 #endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index aee7917..4e6588d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1568,6 +1568,60 @@ struct mmap_arg_struct {
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */
 
+static bool is_pfn_range_valid(struct zone *z,
+	unsigned long start_pfn, unsigned long nr_pages)
+{
+	unsigned long i, end_pfn = start_pfn + nr_pages;
+	struct page *page;
+
+	for (i = start_pfn; i < end_pfn; i++) {
+		if (!pfn_valid(i))
+			return false;
+
+		page = pfn_to_page(i);
+		if (page_zone(page) != z)
+			return false;
+
+		if (PageReserved(page))
+			return false;
+
+		if (page_count(page) > 0)
+			return false;
+
+		if (PageHuge(page))
+			return false;
+	}
+	return true;
+}
+
+struct page *
+alloc_pages_vma_contig(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr, int node, bool hugepage)
+{
+	struct zonelist *zonelist = node_zonelist(node, gfp);
+	struct zoneref *z;
+	struct zone *zone;
+	unsigned long pfn, nr_pages, flags, ret;
+
+	nr_pages = 1 << order;
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp), NULL) {
+		spin_lock_irqsave(&zone->lock, flags);
+		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+		while (zone_spans_pfn(zone, pfn + nr_pages - 1)) {
+			if (is_pfn_range_valid(zone, pfn, nr_pages)) {
+				spin_unlock_irqrestore(&zone->lock, flags);
+				ret = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE, gfp);
+				if (!ret)
+					return pfn_to_page(pfn);
+				spin_lock_irqsave(&zone->lock, flags);
+			}
+			pfn += nr_pages;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+	return NULL;
+}
+
 /*
  * Attempt to allocate a contiguous range of pages to back the
  * specified vma.  vm_private_data is used as a 'pointer' to the
@@ -1588,11 +1642,19 @@ static long __alloc_vma_contig_range(struct vm_area_struct *vma)
 	 * allocations < MAX_ORDER in size.  However, this should really
 	 * handle arbitrary size allocations.
 	 */
+
+	/*
 	if (order >= MAX_ORDER)
 		return -ENOMEM;
 
-	vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
-						numa_node_id(), false);
+	*/
+
+	if (order >= MAX_ORDER)
+		vma->vm_private_data = alloc_pages_vma_contig(gfp, order, vma,
+					vma->vm_start, numa_node_id(), false);
+	else
+		vma->vm_private_data = alloc_pages_vma(gfp, order, vma,
+					vma->vm_start, numa_node_id(), false);
 	if (!vma->vm_private_data)
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-12 11:22     ` Anshuman Khandual
@ 2017-10-12 14:37     ` Michal Hocko
  2017-10-12 17:19       ` Mike Kravetz
  2017-10-15  8:07     ` Guy Shattah
  2 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-12 14:37 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, linux-api, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.

I have only briefly read through the previous discussion and it is still
not clear to me _why_ we want such a interface. I didn't give it much
time yet but I do not think this is a good idea at all. Why? Do we want
any user to simply consume larger order memory blocks? What would
prevent from that? Also why should even userspace care about larger
memory blocks? We have huge pages (be it preallocated or transparent)
for that purpose already. Why should we add yet another another type
of physically contiguous memory. What is the guaratee of such a mapping.
Does the memory always stays contiguous? How much contiguous it will be?
Who is going to use such an interface? And probably many other
questions...

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c                       | 94 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
>  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> +#define MAP_CONTIG	0x80000		/* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  {
>  	struct vm_area_struct *next = vma->vm_next;
>  
> +	if (vma->vm_flags & VM_CONTIG) {
> +		/*
> +		 * Do any necessary clean up when freeing a vma backed
> +		 * by a contiguous allocation.
> +		 *
> +		 * Not very useful in it's present form.
> +		 */
> +		VM_BUG_ON(!vma->vm_private_data);
> +		vma->vm_private_data = NULL;
> +	}
>  	might_sleep();
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>  			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> +	/*
> +	 * MAP_CONTIG has some restrictions,
> +	 * and also implies additional mmap and vma flags.
> +	 */
> +	if (flags & MAP_CONTIG) {
> +		if (!(flags & MAP_ANONYMOUS))
> +			return -EINVAL;
> +
> +		flags |= MAP_POPULATE | MAP_LOCKED;
> +		vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> +	}
> +
>  	if (flags & MAP_LOCKED)
>  		if (!can_do_mlock())
>  			return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> +	gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
> +	unsigned long order;
> +
> +	VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> +	order = get_order(vma->vm_end - vma->vm_start);
> +
> +	/*
> +	 * FIXME - Incomplete implementation.  For now, just handle
> +	 * allocations < MAX_ORDER in size.  However, this should really
> +	 * handle arbitrary size allocations.
> +	 */
> +	if (order >= MAX_ORDER)
> +		return -ENOMEM;
> +
> +	vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> +						numa_node_id(), false);
> +	if (!vma->vm_private_data)
> +		return -ENOMEM;
> +
> +	/*
> +	 * split large allocation so it can be treated as individual
> +	 * pages when populating the mapping and at unmap time.
> +	 */
> +	if (order) {
> +		unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> +								PAGE_SIZE;
> +		unsigned long order_pages = 1 << order;
> +		unsigned long i;
> +		struct page *page = vma->vm_private_data;
> +
> +		split_page((struct page *)vma->vm_private_data, order);
> +
> +		/*
> +		 * 'order' rounds up size of vma to next power of 2.  We
> +		 * will not need/use the extra pages so free them now.
> +		 */
> +		for (i = vma_pages; i < order_pages; i++)
> +			put_page(page + i);
> +	}
> +
> +	return 0;
> +}
> +
> +static void __free_vma_contig_range(struct vm_area_struct *vma)
> +{
> +	struct page *page = vma->vm_private_data;
> +	unsigned long n_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
> +	unsigned long i;
> +
> +	if (!page)
> +		return;
> +
> +	for (i = 0; i < n_pages; i++)
> +		put_page(page + i);
> +}
> +
> +/*
>   * Some shared mappigns will want the pages marked read-only
>   * to track write events. If so, we'll downgrade vm_page_prot
>   * to the private version (using protection_map[] without the
> @@ -1669,6 +1756,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vma->vm_pgoff = pgoff;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  
> +	if (vm_flags & VM_CONTIG) {
> +		error = __alloc_vma_contig_range(vma);
> +		if (error)
> +			goto free_vma;
> +	}
> +
>  	if (file) {
>  		if (vm_flags & VM_DENYWRITE) {
>  			error = deny_write_access(file);
> @@ -1758,6 +1851,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	if (vm_flags & VM_DENYWRITE)
>  		allow_write_access(file);
>  free_vma:
> +	__free_vma_contig_range(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  unacct_error:
>  	if (charged)
> -- 
> 2.13.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12 14:37     ` Michal Hocko
@ 2017-10-12 17:19       ` Mike Kravetz
  2017-10-13  8:40         ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Mike Kravetz @ 2017-10-12 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, linux-api, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On 10/12/2017 07:37 AM, Michal Hocko wrote:
> On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
>> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
>> mmap flag processing.  If present, pre-allocate a contiguous set of
>> pages to back the mapping.  These pages will be used a fault time, and
>> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> I have only briefly read through the previous discussion and it is still
> not clear to me _why_ we want such a interface. I didn't give it much
> time yet but I do not think this is a good idea at all.

Thanks for looking Michal.  The primary use case comes from devices that can
realize performance benefits if operating on physically contiguous memory.
What sparked this effort was Christoph and Guy's plumbers presentation
where they showed RDMA performance benefits that could be realized with
contiguous memory.  I also remember sitting in a presentation about
Intel's QuackAssist technology at Vault last year.  The presenter mentioned
that their compression engine needed to be passed a physically contiguous
buffer.  I asked how a user could obtain such a buffer.  They said they
had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
specific use cases.  That is why I wanted to start the discussion as to
whether there should be an interface to provide this functionality.

>                                                         Why? Do we want
> any user to simply consume larger order memory blocks? What would
> prevent from that?

We certainly would want to put restrictions in place for contiguous
memory allocations.  Since it makes sense to pre-populate and lock
contiguous allocations, using the same restrictions as mlock is a start.
However, I can see the possible need for more restrictions.

>                    Also why should even userspace care about larger
> memory blocks? We have huge pages (be it preallocated or transparent)
> for that purpose already. Why should we add yet another another type

The 'sweet spot' for the Mellanox RDMA example is 2GB.  We can not
achieve that with huge pages (on x86) today.

>                                  What is the guaratee of such a mapping.

There is no guarantee.  My suggestion is that mmap(MAP_CONTIG) would fail
with ENOMEM if a sufficiently sized contiguous area could not be found.
The caller would need to deal with failure.

> Does the memory always stays contiguous? How much contiguous it will be?

Yes, it remains contiguous.  It is locked in memory.

> Who is going to use such an interface? And probably many other
> questions...

Thanks for asking.  I am just throwing out the idea of providing an interface
for doing contiguous memory allocations from user space.  There are at least
two (and possibly more) devices that could benefit from such an interface.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12 17:19       ` Mike Kravetz
@ 2017-10-13  8:40         ` Michal Hocko
  2017-10-13 15:20           ` Christopher Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-13  8:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, linux-api, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah,
	Christoph Lameter, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Thu 12-10-17 10:19:16, Mike Kravetz wrote:
> On 10/12/2017 07:37 AM, Michal Hocko wrote:
> > On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> >> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> >> mmap flag processing.  If present, pre-allocate a contiguous set of
> >> pages to back the mapping.  These pages will be used a fault time, and
> >> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> > 
> > I have only briefly read through the previous discussion and it is still
> > not clear to me _why_ we want such a interface. I didn't give it much
> > time yet but I do not think this is a good idea at all.
> 
> Thanks for looking Michal.  The primary use case comes from devices that can
> realize performance benefits if operating on physically contiguous memory.
> What sparked this effort was Christoph and Guy's plumbers presentation
> where they showed RDMA performance benefits that could be realized with
> contiguous memory.  I also remember sitting in a presentation about
> Intel's QuackAssist technology at Vault last year.  The presenter mentioned
> that their compression engine needed to be passed a physically contiguous
> buffer.  I asked how a user could obtain such a buffer.  They said they
> had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
> specific use cases.  That is why I wanted to start the discussion as to
> whether there should be an interface to provide this functionality.

I would, quite contrary, suggest a device specific mmap implementation
which would guarantee both the best memory wrt. physical contiguous
aspect as well as the placement - what if the device have a restriction
on that as well?
 
> > any user to simply consume larger order memory blocks? What would
> > prevent from that?
> 
> We certainly would want to put restrictions in place for contiguous
> memory allocations.  Since it makes sense to pre-populate and lock
> contiguous allocations, using the same restrictions as mlock is a start.
> However, I can see the possible need for more restrictions.

Absolutely. mlock limit is per process (resp. mm) so a single user could
simply deplete large blocks. No good...
 
> > Does the memory always stays contiguous? How much contiguous it will be?
> 
> Yes, it remains contiguous.  It is locked in memory.

Hmm, so hugetlb on steroids...

> > Who is going to use such an interface? And probably many other
> > questions...
> 
> Thanks for asking.  I am just throwing out the idea of providing an interface
> for doing contiguous memory allocations from user space.  There are at least
> two (and possibly more) devices that could benefit from such an interface.

I am not really convinced this is a good interface. You are basically
trying to bypass virtual memory abstraction and that is quite
contradicting the mmap API to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12 11:22     ` Anshuman Khandual
@ 2017-10-13 15:14       ` Christopher Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christopher Lameter @ 2017-10-13 15:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Laura Abbott, Vlastimil Babka

On Thu, 12 Oct 2017, Anshuman Khandual wrote:

> > +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> > +{
> > +	gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
>
> Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
> coming from Buddy, every thing should have already been zeroed out
> in there. Am I missing something ?

Contiguous pages cannot and should not be moved. They will no longer be
contiguous then. Also the page migration code cannot handle this case.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13  8:40         ` Michal Hocko
@ 2017-10-13 15:20           ` Christopher Lameter
  2017-10-13 15:28             ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-13 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri, 13 Oct 2017, Michal Hocko wrote:

> I would, quite contrary, suggest a device specific mmap implementation
> which would guarantee both the best memory wrt. physical contiguous
> aspect as well as the placement - what if the device have a restriction
> on that as well?

Contemporary high end devices can handle all of memory. If someone does
not have the requirements to get all that hardware can give you in terms
of speed then they also wont need contiguous memory.

> > Yes, it remains contiguous.  It is locked in memory.
>
> Hmm, so hugetlb on steroids...

Its actually better because there is no requirements of allocation in
exacytly 2M chunks. The remainder can be used for regular 4k page
allocations.

> > > Who is going to use such an interface? And probably many other
> > > questions...
> >
> > Thanks for asking.  I am just throwing out the idea of providing an interface
> > for doing contiguous memory allocations from user space.  There are at least
> > two (and possibly more) devices that could benefit from such an interface.
>
> I am not really convinced this is a good interface. You are basically
> trying to bypass virtual memory abstraction and that is quite
> contradicting the mmap API to me.

This is a standardized posix interface as described in our presentation at
the plumbers conference. See the presentation on contiguous allocations.

The contiguous allocations are particularly useful for the RDMA API which
allows registering user space memory with devices.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:20           ` Christopher Lameter
@ 2017-10-13 15:28             ` Michal Hocko
  2017-10-13 15:42               ` Christopher Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-13 15:28 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
[...]
> > I am not really convinced this is a good interface. You are basically
> > trying to bypass virtual memory abstraction and that is quite
> > contradicting the mmap API to me.
> 
> This is a standardized posix interface as described in our presentation at
> the plumbers conference. See the presentation on contiguous allocations.

Are you trying to desing a generic interface with a very specific and HW
dependent usecase in mind?
 
> The contiguous allocations are particularly useful for the RDMA API which
> allows registering user space memory with devices.

then make those devices expose an implementation of an mmap which does
that. You would get both a proper access control (via fd), accounting
and others.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:28             ` Michal Hocko
@ 2017-10-13 15:42               ` Christopher Lameter
  2017-10-13 15:47                 ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-13 15:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri, 13 Oct 2017, Michal Hocko wrote:

> On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > On Fri, 13 Oct 2017, Michal Hocko wrote:
> [...]
> > > I am not really convinced this is a good interface. You are basically
> > > trying to bypass virtual memory abstraction and that is quite
> > > contradicting the mmap API to me.
> >
> > This is a standardized posix interface as described in our presentation at
> > the plumbers conference. See the presentation on contiguous allocations.
>
> Are you trying to desing a generic interface with a very specific and HW
> dependent usecase in mind?

There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

> > The contiguous allocations are particularly useful for the RDMA API which
> > allows registering user space memory with devices.
>
> then make those devices expose an implementation of an mmap which does
> that. You would get both a proper access control (via fd), accounting
> and others.

There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:42               ` Christopher Lameter
@ 2017-10-13 15:47                 ` Michal Hocko
  2017-10-13 15:56                   ` Christopher Lameter
  2017-10-15  6:58                   ` Pavel Machek
  0 siblings, 2 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-13 15:47 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri 13-10-17 10:42:37, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > [...]
> > > > I am not really convinced this is a good interface. You are basically
> > > > trying to bypass virtual memory abstraction and that is quite
> > > > contradicting the mmap API to me.
> > >
> > > This is a standardized posix interface as described in our presentation at
> > > the plumbers conference. See the presentation on contiguous allocations.
> >
> > Are you trying to desing a generic interface with a very specific and HW
> > dependent usecase in mind?
> 
> There is a generic posix interface that could we used for a variety of
> specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

> > > The contiguous allocations are particularly useful for the RDMA API which
> > > allows registering user space memory with devices.
> >
> > then make those devices expose an implementation of an mmap which does
> > that. You would get both a proper access control (via fd), accounting
> > and others.
> 
> There are numerous RDMA devices that would all need the mmap
> implementation. And this covers only the needs of one subsystem. There are
> other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:47                 ` Michal Hocko
@ 2017-10-13 15:56                   ` Christopher Lameter
  2017-10-13 16:17                     ` Michal Hocko
  2017-10-15  6:58                   ` Pavel Machek
  1 sibling, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-13 15:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri, 13 Oct 2017, Michal Hocko wrote:

> > There is a generic posix interface that could we used for a variety of
> > specific hardware dependent use cases.
>
> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
>
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:56                   ` Christopher Lameter
@ 2017-10-13 16:17                     ` Michal Hocko
  2017-10-15  7:50                       ` Guy Shattah
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-13 16:17 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > > There is a generic posix interface that could we used for a variety of
> > > specific hardware dependent use cases.
> >
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> 
> It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> >
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> And then in all the other use cases as well. It would be much easier if
> mmap could give you the memory you need instead of havig numerous drivers
> improvise on their own. This is in particular also useful
> for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 15:47                 ` Michal Hocko
  2017-10-13 15:56                   ` Christopher Lameter
@ 2017-10-15  6:58                   ` Pavel Machek
  2017-10-16  8:18                     ` Michal Hocko
  1 sibling, 1 reply; 63+ messages in thread
From: Pavel Machek @ 2017-10-15  6:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

Hi!

> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.
> 
> > > > The contiguous allocations are particularly useful for the RDMA API which
> > > > allows registering user space memory with devices.
> > >
> > > then make those devices expose an implementation of an mmap which does
> > > that. You would get both a proper access control (via fd), accounting
> > > and others.
> > 
> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
> 
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

So you'd suggest using ioctl() for allocating memory?

That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but better than
each driver inventing custom interface...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-13 16:17                     ` Michal Hocko
@ 2017-10-15  7:50                       ` Guy Shattah
  2017-10-16  8:24                         ` Michal Hocko
                                           ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Guy Shattah @ 2017-10-15  7:50 UTC (permalink / raw)
  To: Michal Hocko, Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka



On 13/10/2017 19:17, Michal Hocko wrote:
> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>
>>>> There is a generic posix interface that could we used for a variety of
>>>> specific hardware dependent use cases.
>>> Yes you wrote that already and my counter argument was that this generic
>>> posix interface shouldn't bypass virtual memory abstraction.
>> It does do that? In what way?
> availability of the virtual address space depends on the availability of
> the same sized contiguous physical memory range. That sounds like the
> abstraction is gone to large part to me.
In what way? userspace users will still be working with virtual memory.

>
>>>> There are numerous RDMA devices that would all need the mmap
>>>> implementation. And this covers only the needs of one subsystem. There are
>>>> other use cases.
>>> That doesn't prevent providing a library function which could be reused
>>> by all those drivers. Nothing really too much different from
>>> remap_pfn_range.
>> And then in all the other use cases as well. It would be much easier if
>> mmap could give you the memory you need instead of havig numerous drivers
>> improvise on their own. This is in particular also useful
>> for numerous embedded use cases where you need contiguous memory.
> But a generic implementation would have to deal with many issues as
> already mentioned. If you make this driver specific you can have access
> control based on fd etc... I really fail to see how this is any
> different from remap_pfn_range.
Why have several driver specific implementation if you can generalize 
the idea and implement
an already existing POSIX standard?
--
Guy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
  2017-10-12 11:22     ` Anshuman Khandual
  2017-10-12 14:37     ` Michal Hocko
@ 2017-10-15  8:07     ` Guy Shattah
  2 siblings, 0 replies; 63+ messages in thread
From: Guy Shattah @ 2017-10-15  8:07 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Christoph Lameter, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka



On 13/10/2017 19:17, Michal Hocko wrote:
> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>
>>>> There is a generic posix interface that could we used for a variety 
>>>> of specific hardware dependent use cases.
>>> Yes you wrote that already and my counter argument was that this 
>>> generic posix interface shouldn't bypass virtual memory abstraction.
>> It does do that? In what way?
> availability of the virtual address space depends on the availability 
> of the same sized contiguous physical memory range. That sounds like 
> the abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.

>
>>>> There are numerous RDMA devices that would all need the mmap 
>>>> implementation. And this covers only the needs of one subsystem. 
>>>> There are other use cases.
>>> That doesn't prevent providing a library function which could be 
>>> reused by all those drivers. Nothing really too much different from 
>>> remap_pfn_range.
>> And then in all the other use cases as well. It would be much easier 
>> if mmap could give you the memory you need instead of havig numerous 
>> drivers improvise on their own. This is in particular also useful for 
>> numerous embedded use cases where you need contiguous memory.
> But a generic implementation would have to deal with many issues as 
> already mentioned. If you make this driver specific you can have 
> access control based on fd etc... I really fail to see how this is any 
> different from remap_pfn_range.

Why have several driver specific implementation if you can generalize the idea and implement an already existing POSIX standard?
--
Guy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-15  6:58                   ` Pavel Machek
@ 2017-10-16  8:18                     ` Michal Hocko
  2017-10-16  9:54                       ` Pavel Machek
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-16  8:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> Hi!
> 
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> > 
> > > > > The contiguous allocations are particularly useful for the RDMA API which
> > > > > allows registering user space memory with devices.
> > > >
> > > > then make those devices expose an implementation of an mmap which does
> > > > that. You would get both a proper access control (via fd), accounting
> > > > and others.
> > > 
> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> > 
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> So you'd suggest using ioctl() for allocating memory?

Why not using standard mmap on the device fd?
 
> That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but better than
> each driver inventing custom interface...

As already pointed out elsewhere, I do not really see a different to
remap_pfn_range from the API point of view. A driver has some
requirements to the memory so those can be reflected in the mmap
implementation for the driver. I really do not see how that would be a
general interface without a lot of headache in future. Contiguous memory
is a hard requirement to guarantee or give out without risks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-15  7:50                       ` Guy Shattah
@ 2017-10-16  8:24                         ` Michal Hocko
  2017-10-16  9:11                           ` Guy Shattah
  2017-10-16 10:33                         ` Michal Nazarewicz
  2017-10-16 17:43                         ` Mike Kravetz
  2 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-16  8:24 UTC (permalink / raw)
  To: Guy Shattah
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> 
> 
> On 13/10/2017 19:17, Michal Hocko wrote:
> > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > 
> > > > > There is a generic posix interface that could we used for a variety of
> > > > > specific hardware dependent use cases.
> > > > Yes you wrote that already and my counter argument was that this generic
> > > > posix interface shouldn't bypass virtual memory abstraction.
> > > It does do that? In what way?
> > availability of the virtual address space depends on the availability of
> > the same sized contiguous physical memory range. That sounds like the
> > abstraction is gone to large part to me.
>
> In what way? userspace users will still be working with virtual memory.

So you are saying that providing an API which fails randomly because of
the physically fragmented memory is OK? Users shouldn't really care
about the state of the physical memory. That is what we have the virtual
memory for.
 
> > > > > There are numerous RDMA devices that would all need the mmap
> > > > > implementation. And this covers only the needs of one subsystem. There are
> > > > > other use cases.
> > > > That doesn't prevent providing a library function which could be reused
> > > > by all those drivers. Nothing really too much different from
> > > > remap_pfn_range.
> > > And then in all the other use cases as well. It would be much easier if
> > > mmap could give you the memory you need instead of havig numerous drivers
> > > improvise on their own. This is in particular also useful
> > > for numerous embedded use cases where you need contiguous memory.
> > But a generic implementation would have to deal with many issues as
> > already mentioned. If you make this driver specific you can have access
> > control based on fd etc... I really fail to see how this is any
> > different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the
> idea and implement
> an already existing POSIX standard?

Because users shouldn't really care, really. We do have means to get
large memory and having a guaranteed large memory is a PITA. Just look
at hugetlb and all the issues it exposes. And that one is preallocated
and it requires admin to do a conscious decision about the amount of the
memory. You would like to establish something similar except without
bounds to the size and no pre-allowed amount by an admin. This sounds
just crazy to me.

On the other hand if you make this per-device mmap implementation you
can have both admin defined policy on who is allowed this memory and
moreover drivers can implement their fallback strategies which best suit
their needs. I really fail to see how this is any different from using
specialized mmap implementations.

I might be really wrong but I consider such a general purpose flag quite
dangerous and future maintenance burden. At least from the hugetlb/THP
history I do not see why this should be any different.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16  8:24                         ` Michal Hocko
@ 2017-10-16  9:11                           ` Guy Shattah
  2017-10-16 12:32                             ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Guy Shattah @ 2017-10-16  9:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka



On 16/10/2017 11:24, Michal Hocko wrote:
> On Sun 15-10-17 10:50:29, Guy Shattah wrote:
>>
>> On 13/10/2017 19:17, Michal Hocko wrote:
>>> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>>>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>>>
>>>>>> There is a generic posix interface that could we used for a variety of
>>>>>> specific hardware dependent use cases.
>>>>> Yes you wrote that already and my counter argument was that this generic
>>>>> posix interface shouldn't bypass virtual memory abstraction.
>>>> It does do that? In what way?
>>> availability of the virtual address space depends on the availability of
>>> the same sized contiguous physical memory range. That sounds like the
>>> abstraction is gone to large part to me.
>> In what way? userspace users will still be working with virtual memory.
> So you are saying that providing an API which fails randomly because of
> the physically fragmented memory is OK? Users shouldn't really care
> about the state of the physical memory. That is what we have the virtual
> memory for.

Users still see and work with virtual addresses, just as before.
Users using the suggested API are aware that API might fail since it 
involves current
system memory state. This won't be the first system call or the last one 
to fail due to
reasons beyond user control. For example: any user app might fail due to 
number of
open files, disk space, memory availability, network availability. All 
beyond user control.
A smart user always has their ways to handle exceptions.
A typical user failing to allocate contiguous memory and May fallback to 
allocating
non-contiguous memory. And by the way - even if each vendor implements 
their own
methods to allocate contiguous memory then this vendor specific API 
might fail too.
For the same reasons.




>   
>>>>>> There are numerous RDMA devices that would all need the mmap
>>>>>> implementation. And this covers only the needs of one subsystem. There are
>>>>>> other use cases.
>>>>> That doesn't prevent providing a library function which could be reused
>>>>> by all those drivers. Nothing really too much different from
>>>>> remap_pfn_range.
>>>> And then in all the other use cases as well. It would be much easier if
>>>> mmap could give you the memory you need instead of havig numerous drivers
>>>> improvise on their own. This is in particular also useful
>>>> for numerous embedded use cases where you need contiguous memory.
>>> But a generic implementation would have to deal with many issues as
>>> already mentioned. If you make this driver specific you can have access
>>> control based on fd etc... I really fail to see how this is any
>>> different from remap_pfn_range.
>> Why have several driver specific implementation if you can generalize the
>> idea and implement
>> an already existing POSIX standard?
> Because users shouldn't really care, really. We do have means to get
> large memory and having a guaranteed large memory is a PITA. Just look
> at hugetlb and all the issues it exposes. And that one is preallocated
> and it requires admin to do a conscious decision about the amount of the
> memory. You would like to establish something similar except without
> bounds to the size and no pre-allowed amount by an admin. This sounds
> just crazy to me.

Users do care about the performance they get using devices which benefit
from contiguous memory allocation.
Assuming that user requires 700Mb of contiguous memory. Then why allocate
giant (1GB) page when you can allocate 700Mb out of the 1GB and put the 
rest of the
300Mb back in the huge-pages/small-pages pool?


>
> On the other hand if you make this per-device mmap implementation you
> can have both admin defined policy on who is allowed this memory and
> moreover drivers can implement their fallback strategies which best suit
> their needs. I really fail to see how this is any different from using
> specialized mmap implementations.
We tried doing it in the past. but the maintainer gave us a very good 
argument:
" If you want to support anonymous mmaps to allocate large contiguous
pages work with the MM folks on providing that in a generic fashion."

After discussing it with people who have the same requirements as we do -
I totally agree with him

http://comments.gmane.org/gmane.linux.drivers.rdma/31467

> I might be really wrong but I consider such a general purpose flag quite
> dangerous and future maintenance burden. At least from the hugetlb/THP
> history I do not see why this should be any different.
Could you please elaborate why is it dangerous and future maintenance 
burden?

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16  8:18                     ` Michal Hocko
@ 2017-10-16  9:54                       ` Pavel Machek
  2017-10-16 12:18                         ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Pavel Machek @ 2017-10-16  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]

On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> > Hi!
> > 
> > > Yes you wrote that already and my counter argument was that this generic
> > > posix interface shouldn't bypass virtual memory abstraction.
> > > 
> > > > > > The contiguous allocations are particularly useful for the RDMA API which
> > > > > > allows registering user space memory with devices.
> > > > >
> > > > > then make those devices expose an implementation of an mmap which does
> > > > > that. You would get both a proper access control (via fd), accounting
> > > > > and others.
> > > > 
> > > > There are numerous RDMA devices that would all need the mmap
> > > > implementation. And this covers only the needs of one subsystem. There are
> > > > other use cases.
> > > 
> > > That doesn't prevent providing a library function which could be reused
> > > by all those drivers. Nothing really too much different from
> > > remap_pfn_range.
> > 
> > So you'd suggest using ioctl() for allocating memory?
> 
> Why not using standard mmap on the device fd?

No, sorry, that's something very different work, right? Lets say I
have a disk, and I'd like to write to it, using continguous memory for
performance.

So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
structures there, maybe recieve from network, then decide to write
some and not write some other.

mmap(sda) does something very different... Everything you write to
that mmap will eventually go to the disk, and you don't have complete
control when.

Also, you can do mmap(MAP_CONTIG) and use that to both disk and
network. That would not work with mmap(sda) and mmap(eth0)...

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-15  7:50                       ` Guy Shattah
  2017-10-16  8:24                         ` Michal Hocko
@ 2017-10-16 10:33                         ` Michal Nazarewicz
  2017-10-16 11:09                           ` Guy Shattah
  2017-10-16 17:43                         ` Mike Kravetz
  2 siblings, 1 reply; 63+ messages in thread
From: Michal Nazarewicz @ 2017-10-16 10:33 UTC (permalink / raw)
  To: Guy Shattah, Michal Hocko, Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Sun, Oct 15 2017, Guy Shattah wrote:
> Why have several driver specific implementation if you can generalize
> the idea and implement an already existing POSIX standard?

Why is there a need for contiguous allocation?

CPU cares only to the point of huge pages and there’s already an effort
in the kernel to allocate huge pages transparently without user space
being aware of it.

If not CPU than various devices all of which may have very different
needs.  Some may be behind an IO MMU.  Some may support DMA.  Some may
indeed require physically continuous memory.  How is user space to know?

Furthermore, user space does not care whether allocation is physically
contiguous or not.  What it cares about is whether given allocation can
be passed as a buffer to a particular device.

If generalisation is the issue, then the solution is to define a common
API where user-space can allocate memory *in the context of* a device.
This provides a ‘give me memory I can use for this device’ request which
is what user space really wants.

So yeah, like others in this thread, the reason for this change alludes
me.  On the other hand, I don’t care much so I’ll limit myself to this
one message.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 10:33                         ` Michal Nazarewicz
@ 2017-10-16 11:09                           ` Guy Shattah
  0 siblings, 0 replies; 63+ messages in thread
From: Guy Shattah @ 2017-10-16 11:09 UTC (permalink / raw)
  To: Michal Nazarewicz, Michal Hocko, Christopher Lameter
  Cc: Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka



On 16/10/2017 13:33, Michal Nazarewicz wrote:
> On Sun, Oct 15 2017, Guy Shattah wrote:
>> Why have several driver specific implementation if you can generalize
>> the idea and implement an already existing POSIX standard?
> Why is there a need for contiguous allocation?

This was explained in detail during a talk delivered by me and 
Christopher Lameter
during Plumbers conference 2017 @ 
https://linuxplumbersconf.org/2017/ocw/proposals/4669
Please see the slides there.

> If generalisation is the issue, then the solution is to define a common
> API where user-space can allocate memory *in the context of* a device.
> This provides a ‘give me memory I can use for this device’ request which
> is what user space really wants.
Do you suggest to add a whole new common API instead of merely adding a 
flag to existing one?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16  9:54                       ` Pavel Machek
@ 2017-10-16 12:18                         ` Michal Hocko
  2017-10-16 16:02                           ` Christopher Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 12:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Guy Shattah, Anshuman Khandual,
	Laura Abbott, Vlastimil Babka

On Mon 16-10-17 11:54:47, Pavel Machek wrote:
> On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> > On Sun 15-10-17 08:58:56, Pavel Machek wrote:
[...]
> > > So you'd suggest using ioctl() for allocating memory?
> > 
> > Why not using standard mmap on the device fd?
> 
> No, sorry, that's something very different work, right? Lets say I
> have a disk, and I'd like to write to it, using continguous memory for
> performance.
> 
> So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> structures there, maybe recieve from network, then decide to write
> some and not write some other.

Why would you want this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16  9:11                           ` Guy Shattah
@ 2017-10-16 12:32                             ` Michal Hocko
  2017-10-16 16:00                               ` Christopher Lameter
  2017-10-17 10:50                               ` Guy Shattah
  0 siblings, 2 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 12:32 UTC (permalink / raw)
  To: Guy Shattah
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon 16-10-17 12:11:04, Guy Shattah wrote:
> 
> 
> On 16/10/2017 11:24, Michal Hocko wrote:
> > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > 
> > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > 
> > > > > > > There is a generic posix interface that could we used for a variety of
> > > > > > > specific hardware dependent use cases.
> > > > > > Yes you wrote that already and my counter argument was that this generic
> > > > > > posix interface shouldn't bypass virtual memory abstraction.
> > > > > It does do that? In what way?
> > > > availability of the virtual address space depends on the availability of
> > > > the same sized contiguous physical memory range. That sounds like the
> > > > abstraction is gone to large part to me.
> > > In what way? userspace users will still be working with virtual memory.
> > So you are saying that providing an API which fails randomly because of
> > the physically fragmented memory is OK? Users shouldn't really care
> > about the state of the physical memory. That is what we have the virtual
> > memory for.
> 
> Users still see and work with virtual addresses, just as before.
> Users using the suggested API are aware that API might fail since it
> involves current system memory state. This won't be the first system
> call or the last one to fail due to reasons beyond user control. For
> example: any user app might fail due to number of open files, disk
> space, memory availability, network availability. All beyond user
> control.

But the memory fragmentation is not something that directly map to the
memory usage. As such it behaves more or less randomly to the memory
utilization (see the difference to examples mentioned above?). It
depends on many other things basically rendering such an API to be
useless unless you guarantee that the large part of the memory is
movable.

> A smart user always has their ways to handle exceptions.  A typical
> user failing to allocate contiguous memory and May fallback to
> allocating non-contiguous memory. And by the way - even if each vendor
> implements their own methods to allocate contiguous memory then this
> vendor specific API might fail too.  For the same reasons.

yes the kernel side mmap implementation would have to care about this as
well. Nobody is questioning that part. I am just questioning such a
generic purpouse API is reasonable.
 
> > > > > > > There are numerous RDMA devices that would all need the mmap
> > > > > > > implementation. And this covers only the needs of one subsystem. There are
> > > > > > > other use cases.
> > > > > > That doesn't prevent providing a library function which could be reused
> > > > > > by all those drivers. Nothing really too much different from
> > > > > > remap_pfn_range.
> > > > > And then in all the other use cases as well. It would be much easier if
> > > > > mmap could give you the memory you need instead of havig numerous drivers
> > > > > improvise on their own. This is in particular also useful
> > > > > for numerous embedded use cases where you need contiguous memory.
> > > > But a generic implementation would have to deal with many issues as
> > > > already mentioned. If you make this driver specific you can have access
> > > > control based on fd etc... I really fail to see how this is any
> > > > different from remap_pfn_range.
> > > Why have several driver specific implementation if you can generalize the
> > > idea and implement
> > > an already existing POSIX standard?
> > Because users shouldn't really care, really. We do have means to get
> > large memory and having a guaranteed large memory is a PITA. Just look
> > at hugetlb and all the issues it exposes. And that one is preallocated
> > and it requires admin to do a conscious decision about the amount of the
> > memory. You would like to establish something similar except without
> > bounds to the size and no pre-allowed amount by an admin. This sounds
> > just crazy to me.
> 
> Users do care about the performance they get using devices which
> benefit from contiguous memory allocation.  Assuming that user
> requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> page when you can allocate 700Mb out of the 1GB and put the rest of
> the 300Mb back in the huge-pages/small-pages pool?

I believe I have explained that part. Large pages are under admin
control and responsibility. If you get a free ticket to large memory to
any user who can pin that memory then you are in serious troubles.
 
> > On the other hand if you make this per-device mmap implementation you
> > can have both admin defined policy on who is allowed this memory and
> > moreover drivers can implement their fallback strategies which best suit
> > their needs. I really fail to see how this is any different from using
> > specialized mmap implementations.
> We tried doing it in the past. but the maintainer gave us a very good
> argument:
> " If you want to support anonymous mmaps to allocate large contiguous
> pages work with the MM folks on providing that in a generic fashion."

Well, we can provide a generic library functions for your driver to use
so that you do not have to care about implementation details but I do
not think exposing this API to the userspace in a generic fashion is a
good idea. Especially when the only usecase that has been thought
through so far seems to be a very special HW optimiztion.

> After discussing it with people who have the same requirements as we do -
> I totally agree with him
> 
> http://comments.gmane.org/gmane.linux.drivers.rdma/31467
> 
> > I might be really wrong but I consider such a general purpose flag quite
> > dangerous and future maintenance burden. At least from the hugetlb/THP
> > history I do not see why this should be any different.
>
> Could you please elaborate why is it dangerous and future maintenance
> burden?

Providing large contiguous memory ranges is not easy and we actually do
not have any reliable way to offer such a functionality for the kernel
users because we assume they are not that many. Basically anything
larger than order-3 is best effort. Even changes constant improvements
of the compaction still leaves us with something we cannot fully rely
on. And now you want to expose this to the userspace with basically
arbitrary memory sizes to be supported?

But putting that aside. Pinning a lot of memory might cause many
performance issues and misbehavior. There are still kernel users
who need high order memory to work properly. On top of that you are
basically allowing an untrusted user to deplete higher order pages very
easily unless there is a clever way to enforce per user limit on this.

That being said, the list is far from being complete, I am pretty sure
more would pop out if I thought more thoroughly. The bottom line is that
while I see many problems to actually implement this feature and
maintain it longterm I simply do not see a large benefit outside of a
very specific HW.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 12:32                             ` Michal Hocko
@ 2017-10-16 16:00                               ` Christopher Lameter
  2017-10-16 17:42                                 ` Michal Hocko
  2017-10-17 10:50                               ` Guy Shattah
  1 sibling, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-16 16:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guy Shattah, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Mon, 16 Oct 2017, Michal Hocko wrote:

> But putting that aside. Pinning a lot of memory might cause many
> performance issues and misbehavior. There are still kernel users
> who need high order memory to work properly. On top of that you are
> basically allowing an untrusted user to deplete higher order pages very
> easily unless there is a clever way to enforce per user limit on this.

We already have that issue and have ways to control that by tracking
pinned and mlocked pages as well as limits on their allocations.

> That being said, the list is far from being complete, I am pretty sure
> more would pop out if I thought more thoroughly. The bottom line is that
> while I see many problems to actually implement this feature and
> maintain it longterm I simply do not see a large benefit outside of a
> very specific HW.

There is not much new here in terms of problems. The hardware that
needs this seems to become more and more plentiful. That is why we need a
generic implementation.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 12:18                         ` Michal Hocko
@ 2017-10-16 16:02                           ` Christopher Lameter
  2017-10-16 17:33                             ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-16 16:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pavel Machek, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon, 16 Oct 2017, Michal Hocko wrote:

> > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > structures there, maybe recieve from network, then decide to write
> > some and not write some other.
>
> Why would you want this?

Because we are receiving a 1GB block of data and then wan to write it to
disk. Maybe we want to modify things a bit and may not write all that we
received.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 16:02                           ` Christopher Lameter
@ 2017-10-16 17:33                             ` Michal Hocko
  2017-10-16 17:53                               ` Christopher Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 17:33 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Pavel Machek, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > structures there, maybe recieve from network, then decide to write
> > > some and not write some other.
> >
> > Why would you want this?
> 
> Because we are receiving a 1GB block of data and then wan to write it to
> disk. Maybe we want to modify things a bit and may not write all that we
> received.
 
And why do you need that in a single contiguous numbers? If performance,
do you have any numbers that would clearly tell the difference?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 16:00                               ` Christopher Lameter
@ 2017-10-16 17:42                                 ` Michal Hocko
  2017-10-16 17:56                                   ` Christopher Lameter
  2017-10-23 15:25                                   ` David Nellans
  0 siblings, 2 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 17:42 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Guy Shattah, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > But putting that aside. Pinning a lot of memory might cause many
> > performance issues and misbehavior. There are still kernel users
> > who need high order memory to work properly. On top of that you are
> > basically allowing an untrusted user to deplete higher order pages very
> > easily unless there is a clever way to enforce per user limit on this.
> 
> We already have that issue and have ways to control that by tracking
> pinned and mlocked pages as well as limits on their allocations.

Ohh, it is very different because mlock limit is really small (64kB)
which is not even close to what this is supposed to be about. Moreover
mlock doesn't prevent from migration and so it doesn't prevent
compaction to form higher order allocations.

Really, this is just too dangerous without a deep consideration of all
the potential consequences. The more I am thinking about this the more I
am convinced that this all should be driver specific mmap based thing.
If it turns out to be too restrictive over time and there are more
experiences about the usage we can consider thinking about a more
generic API. But starting from the generic MAP_ flag is just asking for
problems.

> > That being said, the list is far from being complete, I am pretty sure
> > more would pop out if I thought more thoroughly. The bottom line is that
> > while I see many problems to actually implement this feature and
> > maintain it longterm I simply do not see a large benefit outside of a
> > very specific HW.
> 
> There is not much new here in terms of problems. The hardware that
> needs this seems to become more and more plentiful. That is why we need a
> generic implementation.

It would really help to name that HW and other potential usecases
independent on the HW because I am rather skeptical about the
_plentiful_ part. And so I really do not see any foundation to claim
the generic part. Because, fundamentally, it is the HW which requires
the specific memory placement/physically contiguous range etc. So the
generic implementation doesn't really make sense in such a context.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-15  7:50                       ` Guy Shattah
  2017-10-16  8:24                         ` Michal Hocko
  2017-10-16 10:33                         ` Michal Nazarewicz
@ 2017-10-16 17:43                         ` Mike Kravetz
  2017-10-16 18:07                           ` Michal Hocko
  2 siblings, 1 reply; 63+ messages in thread
From: Mike Kravetz @ 2017-10-16 17:43 UTC (permalink / raw)
  To: Guy Shattah, Michal Hocko, Christopher Lameter
  Cc: linux-mm, linux-kernel, linux-api, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka

On 10/15/2017 12:50 AM, Guy Shattah wrote:
> On 13/10/2017 19:17, Michal Hocko wrote:
>> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>>
>>>>> There is a generic posix interface that could we used for a variety of
>>>>> specific hardware dependent use cases.
>>>> Yes you wrote that already and my counter argument was that this generic
>>>> posix interface shouldn't bypass virtual memory abstraction.
>>> It does do that? In what way?
>> availability of the virtual address space depends on the availability of
>> the same sized contiguous physical memory range. That sounds like the
>> abstraction is gone to large part to me.
> In what way? userspace users will still be working with virtual memory.
> 
>>
>>>>> There are numerous RDMA devices that would all need the mmap
>>>>> implementation. And this covers only the needs of one subsystem. There are
>>>>> other use cases.
>>>> That doesn't prevent providing a library function which could be reused
>>>> by all those drivers. Nothing really too much different from
>>>> remap_pfn_range.
>>> And then in all the other use cases as well. It would be much easier if
>>> mmap could give you the memory you need instead of havig numerous drivers
>>> improvise on their own. This is in particular also useful
>>> for numerous embedded use cases where you need contiguous memory.
>> But a generic implementation would have to deal with many issues as
>> already mentioned. If you make this driver specific you can have access
>> control based on fd etc... I really fail to see how this is any
>> different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the idea and implement
> an already existing POSIX standard?

Just to be clear, the posix standard talks about a typed memory object.
The suggested implementation has one create a connection to the memory
object to receive a fd, then use mmap as usual to get a mapping backed
by contiguous pages/memory.  Of course, this type of implementation is
not a requirement.  However, this type of implementation looks quite a
bit like hugetlbfs today.
- Both require opening a special file/device, and then calling mmap on
  the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
  still ends up using hugetbfs.  BTW, there was resistance to adding the
  MAP_HUGETLB flag to mmap.
- Allocation of contiguous memory is much like 'on demand' allocation of
  huge pages.  There are some (not many) users that use this model.  They
  attempt to allocate huge pages on demand, and if not available fall back
  to base pages.  This is how contiguous allocations would need to work.
  Of course, most hugetlbfs users pre-allocate pages for their use, and
  this 'might' be something useful for contiguous allocations as well.

I wonder if going down the path of a separate devide/filesystem/etc for
contiguous allocations might be a better option.  It would keep the
implementation somewhat separate.  However, I would then be afraid that
we end up with another 'separate/special vm' as in the case of hugetlbfs
today.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 17:33                             ` Michal Hocko
@ 2017-10-16 17:53                               ` Christopher Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christopher Lameter @ 2017-10-16 17:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pavel Machek, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Guy Shattah, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon, 16 Oct 2017, Michal Hocko wrote:

> On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> > On Mon, 16 Oct 2017, Michal Hocko wrote:
> >
> > > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > > structures there, maybe recieve from network, then decide to write
> > > > some and not write some other.
> > >
> > > Why would you want this?
> >
> > Because we are receiving a 1GB block of data and then wan to write it to
> > disk. Maybe we want to modify things a bit and may not write all that we
> > received.
>
> And why do you need that in a single contiguous numbers? If performance,
> do you have any numbers that would clearly tell the difference?

Again we have that in the presentation. Why keep asking the same question
if you already have the answer multiple times?

1G of data requires 250000 page structs to handle if the memory is not
contiguous. This is more than most controllers can support and thus the
overhead will dominate I/O. Also the scatter gather lists will cover lots
of linked 4k pages even to manage.

And in practice we already have multiple gigabytes per requests which
makes it even more severe. You cannot do a "cp" operation anymore. Instead
you need to have special code that allocates huge pages, does direct I/O
etc etc,

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 17:42                                 ` Michal Hocko
@ 2017-10-16 17:56                                   ` Christopher Lameter
  2017-10-16 18:17                                     ` Michal Hocko
  2017-10-23 15:25                                   ` David Nellans
  1 sibling, 1 reply; 63+ messages in thread
From: Christopher Lameter @ 2017-10-16 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guy Shattah, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Mon, 16 Oct 2017, Michal Hocko wrote:

> > We already have that issue and have ways to control that by tracking
> > pinned and mlocked pages as well as limits on their allocations.
>
> Ohh, it is very different because mlock limit is really small (64kB)
> which is not even close to what this is supposed to be about. Moreover
> mlock doesn't prevent from migration and so it doesn't prevent
> compaction to form higher order allocations.

The mlock limit is configurable. There is a tracking of pinned pages as
well.

> Really, this is just too dangerous without a deep consideration of all
> the potential consequences. The more I am thinking about this the more I
> am convinced that this all should be driver specific mmap based thing.
> If it turns out to be too restrictive over time and there are more
> experiences about the usage we can consider thinking about a more
> generic API. But starting from the generic MAP_ flag is just asking for
> problems.

This issue is already present with the pinning of lots of memory via the
RDMA API when in use for large gigabyte ranges. There is nothing new aside
from memory being contiguous with this approach.

> > There is not much new here in terms of problems. The hardware that
> > needs this seems to become more and more plentiful. That is why we need a
> > generic implementation.
>
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.

RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
and storage (NVME?) support.

This is not a hardware specific thing but a reflection of the general
limitations of the exiting 4k page struct scheme that limits performance
and causes severe pressure on I/O devices.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 17:43                         ` Mike Kravetz
@ 2017-10-16 18:07                           ` Michal Hocko
  2017-10-16 20:32                             ` Mike Kravetz
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 18:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
> On 10/15/2017 12:50 AM, Guy Shattah wrote:
> > On 13/10/2017 19:17, Michal Hocko wrote:
[...]
> >> But a generic implementation would have to deal with many issues as
> >> already mentioned. If you make this driver specific you can have access
> >> control based on fd etc... I really fail to see how this is any
> >> different from remap_pfn_range.
> > Why have several driver specific implementation if you can generalize the idea and implement
> > an already existing POSIX standard?
> 
> Just to be clear, the posix standard talks about a typed memory object.
> The suggested implementation has one create a connection to the memory
> object to receive a fd, then use mmap as usual to get a mapping backed
> by contiguous pages/memory.  Of course, this type of implementation is
> not a requirement.

I am not sure that POSIC standard for typed memory is easily
implementable in Linux. Does any OS actually implement this API?

> However, this type of implementation looks quite a
> bit like hugetlbfs today.
> - Both require opening a special file/device, and then calling mmap on
>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>   MAP_HUGETLB flag to mmap.

And I think we shouldn't really shape any API based on hugetlb.

> - Allocation of contiguous memory is much like 'on demand' allocation of
>   huge pages.  There are some (not many) users that use this model.  They
>   attempt to allocate huge pages on demand, and if not available fall back
>   to base pages.  This is how contiguous allocations would need to work.
>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>   this 'might' be something useful for contiguous allocations as well.

But there is still admin configuration required to consume memory from
the pool or overcommit that pool.

> I wonder if going down the path of a separate devide/filesystem/etc for
> contiguous allocations might be a better option.  It would keep the
> implementation somewhat separate.  However, I would then be afraid that
> we end up with another 'separate/special vm' as in the case of hugetlbfs
> today.

That depends on who is actually going to use the contiguous memory. If
we are talking about drivers to communication to the userspace then
using driver specific fd with its mmap implementation then we do not
need any special fs nor a seperate infrastructure. Well except for a
library function to handle the MM side of the thing.

If we really need a general purpose physical contiguous memory allocator
then I would agree that using MAP_ flag might be a way to go but that
would require a very careful consideration of who is allowed to allocate
and how much/large blocks. I do not see a good fit to conveying that
information to the kernel right now. Moreover, and most importantly, I
haven't heard any sound usecase for such a functionality in the first
place. There is some hand waving about performance but there are no real
numbers to back those claims AFAIK. Not to mention a serious
consideration of potential consequences of the whole MM.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 17:56                                   ` Christopher Lameter
@ 2017-10-16 18:17                                     ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 18:17 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Guy Shattah, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Mon 16-10-17 12:56:43, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > We already have that issue and have ways to control that by tracking
> > > pinned and mlocked pages as well as limits on their allocations.
> >
> > Ohh, it is very different because mlock limit is really small (64kB)
> > which is not even close to what this is supposed to be about. Moreover
> > mlock doesn't prevent from migration and so it doesn't prevent
> > compaction to form higher order allocations.
> 
> The mlock limit is configurable. There is a tracking of pinned pages as
> well.

I am not aware of any such generic tracking API. The attempt by Peter
has never been merged. So what we have right now is just an adhoc
tracking...
 
> > Really, this is just too dangerous without a deep consideration of all
> > the potential consequences. The more I am thinking about this the more I
> > am convinced that this all should be driver specific mmap based thing.
> > If it turns out to be too restrictive over time and there are more
> > experiences about the usage we can consider thinking about a more
> > generic API. But starting from the generic MAP_ flag is just asking for
> > problems.
> 
> This issue is already present with the pinning of lots of memory via the
> RDMA API when in use for large gigabyte ranges.

... like in those

> There is nothing new aside
> from memory being contiguous with this approach.

which makes a hell of a difference. Once you allow to pin larger blocks
of memory you make the whole compaction hopelessly ineffective.

> > > There is not much new here in terms of problems. The hardware that
> > > needs this seems to become more and more plentiful. That is why we need a
> > > generic implementation.
> >
> > It would really help to name that HW and other potential usecases
> > independent on the HW because I am rather skeptical about the
> > _plentiful_ part. And so I really do not see any foundation to claim
> > the generic part. Because, fundamentally, it is the HW which requires
> > the specific memory placement/physically contiguous range etc. So the
> > generic implementation doesn't really make sense in such a context.
> 
> RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
> and storage (NVME?) support.
> 
> This is not a hardware specific thing but a reflection of the general
> limitations of the exiting 4k page struct scheme that limits performance
> and causes severe pressure on I/O devices.

This is something more for storage people to comment. I expect (NVME)
storage to use DAX and it support for large and direct access. Nothing
really prevents RDMA HW to provide mmap implementation to use contiguous
pages, we already provide an API to allocate large memory.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 18:07                           ` Michal Hocko
@ 2017-10-16 20:32                             ` Mike Kravetz
  2017-10-16 20:58                               ` Michal Hocko
                                                 ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-16 20:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On 10/16/2017 11:07 AM, Michal Hocko wrote:
> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>> Just to be clear, the posix standard talks about a typed memory object.
>> The suggested implementation has one create a connection to the memory
>> object to receive a fd, then use mmap as usual to get a mapping backed
>> by contiguous pages/memory.  Of course, this type of implementation is
>> not a requirement.
> 
> I am not sure that POSIC standard for typed memory is easily
> implementable in Linux. Does any OS actually implement this API?

A quick search only reveals Blackberry QNX and PlayBook OS.

Also somewhat related.  In a earlier thread someone pointed out this
out of tree module used for contiguous allocations in SOC (and other?)
environments.  It even has the option of making use of CMA.
http://processors.wiki.ti.com/index.php/CMEM_Overview

>> However, this type of implementation looks quite a
>> bit like hugetlbfs today.
>> - Both require opening a special file/device, and then calling mmap on
>>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>>   MAP_HUGETLB flag to mmap.
> 
> And I think we shouldn't really shape any API based on hugetlb.

Agree.  I only wanted to point out the similarities.
But, it does make me wonder how much of a benefit hugetlb 1G pages would
make in the the RDMA performance comparison.  The table in the presentation
show a average speedup of something like 27% (or so) for contiguous allocation
which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
case, just wondering if it does help and how much.

>> - Allocation of contiguous memory is much like 'on demand' allocation of
>>   huge pages.  There are some (not many) users that use this model.  They
>>   attempt to allocate huge pages on demand, and if not available fall back
>>   to base pages.  This is how contiguous allocations would need to work.
>>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>>   this 'might' be something useful for contiguous allocations as well.
> 
> But there is still admin configuration required to consume memory from
> the pool or overcommit that pool.
> 
>> I wonder if going down the path of a separate devide/filesystem/etc for
>> contiguous allocations might be a better option.  It would keep the
>> implementation somewhat separate.  However, I would then be afraid that
>> we end up with another 'separate/special vm' as in the case of hugetlbfs
>> today.
> 
> That depends on who is actually going to use the contiguous memory. If
> we are talking about drivers to communication to the userspace then
> using driver specific fd with its mmap implementation then we do not
> need any special fs nor a seperate infrastructure. Well except for a
> library function to handle the MM side of the thing.

If we embed this functionality into device specific mmap calls it will
closely tie the usage to the devices.  However, don't we still have to
worry about potential interaction with other parts of the mm as you mention
below?  I guess that would be the library function and how it is used
by drivers.

-- 
Mike Kravetz

> If we really need a general purpose physical contiguous memory allocator
> then I would agree that using MAP_ flag might be a way to go but that
> would require a very careful consideration of who is allowed to allocate
> and how much/large blocks. I do not see a good fit to conveying that
> information to the kernel right now. Moreover, and most importantly, I
> haven't heard any sound usecase for such a functionality in the first
> place. There is some hand waving about performance but there are no real
> numbers to back those claims AFAIK. Not to mention a serious
> consideration of potential consequences of the whole MM.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 20:32                             ` Mike Kravetz
@ 2017-10-16 20:58                               ` Michal Hocko
  2017-10-16 21:03                               ` Laura Abbott
  2017-10-17  6:59                               ` Vlastimil Babka
  2 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-16 20:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Mon 16-10-17 13:32:45, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
[...]
> > That depends on who is actually going to use the contiguous memory. If
> > we are talking about drivers to communication to the userspace then
> > using driver specific fd with its mmap implementation then we do not
> > need any special fs nor a seperate infrastructure. Well except for a
> > library function to handle the MM side of the thing.
> 
> If we embed this functionality into device specific mmap calls it will
> closely tie the usage to the devices.  However, don't we still have to
> worry about potential interaction with other parts of the mm as you mention
> below?  I guess that would be the library function and how it is used
> by drivers.

Yes, those problems with pinning the amount of contiguous memory are
simply inherent. You have to be really careful when allowing to reserve large
partions of the contiguous memory. Especially if this is going to be a
very dynamic allocator. The main advantage of the per
device mmap is that it has its access control by default via file
permissions. You can simply rule the untrusted user out of the game. You
can also implement the per device usage limits. So you have some tools to
keep the usage under leash and evaluate potential costs vs. benefits.
That sounds to me much more safer than a generic API which would have
a tricky accounting and access control restrictions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 20:32                             ` Mike Kravetz
  2017-10-16 20:58                               ` Michal Hocko
@ 2017-10-16 21:03                               ` Laura Abbott
  2017-10-16 21:18                                 ` Mike Kravetz
  2017-10-17  6:59                               ` Vlastimil Babka
  2 siblings, 1 reply; 63+ messages in thread
From: Laura Abbott @ 2017-10-16 21:03 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual,
	Vlastimil Babka

On 10/16/2017 01:32 PM, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>>> Just to be clear, the posix standard talks about a typed memory object.
>>> The suggested implementation has one create a connection to the memory
>>> object to receive a fd, then use mmap as usual to get a mapping backed
>>> by contiguous pages/memory.  Of course, this type of implementation is
>>> not a requirement.
>>
>> I am not sure that POSIC standard for typed memory is easily
>> implementable in Linux. Does any OS actually implement this API?
> 
> A quick search only reveals Blackberry QNX and PlayBook OS.
> 
> Also somewhat related.  In a earlier thread someone pointed out this
> out of tree module used for contiguous allocations in SOC (and other?)
> environments.  It even has the option of making use of CMA.
> http://processors.wiki.ti.com/index.php/CMEM_Overview
> 

If we're at the point where we're discussing CMEM, I'd like to
point out that ion (drivers/staging/android/ion) already provides an
ioctl interface to allocate CMA and other types of memory. It's
mostly used for Android as the name implies. I don't pretend the
interface is perfect but it could be useful as a discussion point
for allocation interfaces.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 21:03                               ` Laura Abbott
@ 2017-10-16 21:18                                 ` Mike Kravetz
  0 siblings, 0 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-16 21:18 UTC (permalink / raw)
  To: Laura Abbott, Michal Hocko
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual,
	Vlastimil Babka

On 10/16/2017 02:03 PM, Laura Abbott wrote:
> On 10/16/2017 01:32 PM, Mike Kravetz wrote:
>> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>>>> Just to be clear, the posix standard talks about a typed memory object.
>>>> The suggested implementation has one create a connection to the memory
>>>> object to receive a fd, then use mmap as usual to get a mapping backed
>>>> by contiguous pages/memory.  Of course, this type of implementation is
>>>> not a requirement.
>>>
>>> I am not sure that POSIC standard for typed memory is easily
>>> implementable in Linux. Does any OS actually implement this API?
>>
>> A quick search only reveals Blackberry QNX and PlayBook OS.
>>
>> Also somewhat related.  In a earlier thread someone pointed out this
>> out of tree module used for contiguous allocations in SOC (and other?)
>> environments.  It even has the option of making use of CMA.
>> http://processors.wiki.ti.com/index.php/CMEM_Overview
>>
> 
> If we're at the point where we're discussing CMEM, I'd like to
> point out that ion (drivers/staging/android/ion) already provides an
> ioctl interface to allocate CMA and other types of memory. It's
> mostly used for Android as the name implies. I don't pretend the
> interface is perfect but it could be useful as a discussion point
> for allocation interfaces.

Thanks Laura,

I was just pointing out other use cases where people thought contiguous
allocations were useful.  And, it was useful enough that someone actually
wrote code to make it happen.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 20:32                             ` Mike Kravetz
  2017-10-16 20:58                               ` Michal Hocko
  2017-10-16 21:03                               ` Laura Abbott
@ 2017-10-17  6:59                               ` Vlastimil Babka
  2 siblings, 0 replies; 63+ messages in thread
From: Vlastimil Babka @ 2017-10-17  6:59 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: Guy Shattah, Christopher Lameter, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott

On 10/16/2017 10:32 PM, Mike Kravetz wrote:
> Agree.  I only wanted to point out the similarities.
> But, it does make me wonder how much of a benefit hugetlb 1G pages would
> make in the the RDMA performance comparison.  The table in the presentation
> show a average speedup of something like 27% (or so) for contiguous allocation
> which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
> case, just wondering if it does help and how much.

Good point. If somebody cares about performance benefits of contiguous
memory wrt device access, they would probably want also the TLB
performance benefits of huge pages.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 12:32                             ` Michal Hocko
  2017-10-16 16:00                               ` Christopher Lameter
@ 2017-10-17 10:50                               ` Guy Shattah
  2017-10-17 10:59                                 ` Michal Hocko
  2017-10-17 13:22                                 ` Michal Nazarewicz
  1 sibling, 2 replies; 63+ messages in thread
From: Guy Shattah @ 2017-10-17 10:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka



> > On 16/10/2017 11:24, Michal Hocko wrote:
> > > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > >
> > > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > > > > There are numerous RDMA devices that would all need the
> > > > > > > > mmap implementation. And this covers only the needs of one
> > > > > > > > subsystem. There are other use cases.
> > > > > > > That doesn't prevent providing a library function which
> > > > > > > could be reused by all those drivers. Nothing really too
> > > > > > > much different from remap_pfn_range.
> > > > > > And then in all the other use cases as well. It would be much
> > > > > > easier if mmap could give you the memory you need instead of
> > > > > > havig numerous drivers improvise on their own. This is in
> > > > > > particular also useful for numerous embedded use cases where you
> need contiguous memory.
> > > > > But a generic implementation would have to deal with many issues
> > > > > as already mentioned. If you make this driver specific you can
> > > > > have access control based on fd etc... I really fail to see how
> > > > > this is any different from remap_pfn_range.
> > > > Why have several driver specific implementation if you can
> > > > generalize the idea and implement an already existing POSIX
> > > > standard?
> > > Because users shouldn't really care, really. We do have means to get
> > > large memory and having a guaranteed large memory is a PITA. Just
> > > look at hugetlb and all the issues it exposes. And that one is
> > > preallocated and it requires admin to do a conscious decision about
> > > the amount of the memory. You would like to establish something
> > > similar except without bounds to the size and no pre-allowed amount
> > > by an admin. This sounds just crazy to me.
> >
> > Users do care about the performance they get using devices which
> > benefit from contiguous memory allocation.  Assuming that user
> > requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> > page when you can allocate 700Mb out of the 1GB and put the rest of
> > the 300Mb back in the huge-pages/small-pages pool?
> 
> I believe I have explained that part. Large pages are under admin control and
> responsibility. If you get a free ticket to large memory to any user who can
> pin that memory then you are in serious troubles.
> 
> > > On the other hand if you make this per-device mmap implementation
> > > you can have both admin defined policy on who is allowed this memory
> > > and moreover drivers can implement their fallback strategies which
> > > best suit their needs. I really fail to see how this is any
> > > different from using specialized mmap implementations.
> > We tried doing it in the past. but the maintainer gave us a very good
> > argument:
> > " If you want to support anonymous mmaps to allocate large contiguous
> > pages work with the MM folks on providing that in a generic fashion."
> 
> Well, we can provide a generic library functions for your driver to use so that
> you do not have to care about implementation details but I do not think
> exposing this API to the userspace in a generic fashion is a good idea.
> Especially when the only usecase that has been thought through so far seems
> to be a very special HW optimiztion.

Are you going to be OK with kernel API which implements contiguous memory allocation?
Possibly with mmap style?  Many drivers could utilize it instead of having their own weird
and possibly non-standard way to allocate contiguous memory.
Such API won't be available for user space.

We can begin with implementing kernel API and postpone the userspace api discussion for a future date.
if it is sufficient. We might not have to discuss it at all.
 

> 
> > After discussing it with people who have the same requirements as we
> > do - I totally agree with him
> >
> >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcom
> m
> >
> ents.gmane.org%2Fgmane.linux.drivers.rdma%2F31467&data=02%7C01%7Cs
> guy%
> >
> 40mellanox.com%7C24d72e65908044f3d38a08d5149204ee%7Ca652971c7d
> 2e4d9ba6
> >
> a4d149256f461b%7C0%7C0%7C636437539732729965&sdata=oueheNfnsMS
> PAGAehcT5
> > ZDteHxMVQ9%2F7nJNKPPfgVvM%3D&reserved=0
> >
> > > I might be really wrong but I consider such a general purpose flag
> > > quite dangerous and future maintenance burden. At least from the
> > > hugetlb/THP history I do not see why this should be any different.
> >
> > Could you please elaborate why is it dangerous and future maintenance
> > burden?
> 
> Providing large contiguous memory ranges is not easy and we actually do not
> have any reliable way to offer such a functionality for the kernel users
> because we assume they are not that many. Basically anything larger than
> order-3 is best effort. Even changes constant improvements of the
> compaction still leaves us with something we cannot fully rely on. And now
> you want to expose this to the userspace with basically arbitrary memory
> sizes to be supported?
> 
> But putting that aside. Pinning a lot of memory might cause many
> performance issues and misbehavior. There are still kernel users who need
> high order memory to work properly. On top of that you are basically
> allowing an untrusted user to deplete higher order pages very easily unless
> there is a clever way to enforce per user limit on this.

My previous suggestion prevents untrusted userspace code.


Guy 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 10:50                               ` Guy Shattah
@ 2017-10-17 10:59                                 ` Michal Hocko
  2017-10-17 13:22                                 ` Michal Nazarewicz
  1 sibling, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2017-10-17 10:59 UTC (permalink / raw)
  To: Guy Shattah
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Joonsoo Kim, Anshuman Khandual, Laura Abbott,
	Vlastimil Babka

On Tue 17-10-17 10:50:02, Guy Shattah wrote:
[...]
> > Well, we can provide a generic library functions for your driver to use so that
> > you do not have to care about implementation details but I do not think
> > exposing this API to the userspace in a generic fashion is a good idea.
> > Especially when the only usecase that has been thought through so far seems
> > to be a very special HW optimiztion.
> 
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?

We already do have alloc_contig_range. It is a dumb allocator so it is
not very suitable for short term allocations.

> Possibly with mmap style?  Many drivers could utilize it instead of
> having their own weird and possibly non-standard way to allocate
> contiguous memory.  Such API won't be available for user space.

Yes, an mmap helper which performs and enforces some accounting would be a
good start.

> We can begin with implementing kernel API and postpone the userspace
> api discussion for a future date. if it is sufficient. We might not
> have to discuss it at all.

Yeah, that was my thinking as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 10:50                               ` Guy Shattah
  2017-10-17 10:59                                 ` Michal Hocko
@ 2017-10-17 13:22                                 ` Michal Nazarewicz
  2017-10-17 14:20                                   ` Guy Shattah
  1 sibling, 1 reply; 63+ messages in thread
From: Michal Nazarewicz @ 2017-10-17 13:22 UTC (permalink / raw)
  To: Guy Shattah, Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka

On Tue, Oct 17 2017, Guy Shattah wrote:
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?  Possibly with mmap style?  Many drivers could
> utilize it instead of having their own weird and possibly non-standard
> way to allocate contiguous memory.  Such API won't be available for
> user space.

What you describe sounds like CMA.  It may be far from perfect but it’s
there already and drivers which need contiguous memory can allocate it.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 13:22                                 ` Michal Nazarewicz
@ 2017-10-17 14:20                                   ` Guy Shattah
  2017-10-17 17:44                                     ` Vlastimil Babka
  2017-10-17 18:23                                     ` Mike Kravetz
  0 siblings, 2 replies; 63+ messages in thread
From: Guy Shattah @ 2017-10-17 14:20 UTC (permalink / raw)
  To: Michal Nazarewicz, Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka



> On Tue, Oct 17 2017, Guy Shattah wrote:
> > Are you going to be OK with kernel API which implements contiguous
> > memory allocation?  Possibly with mmap style?  Many drivers could
> > utilize it instead of having their own weird and possibly non-standard
> > way to allocate contiguous memory.  Such API won't be available for
> > user space.
> 
> What you describe sounds like CMA.  It may be far from perfect but it’s there
> already and drivers which need contiguous memory can allocate it.
> 

1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the box'
2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
   allocated memory. RDMA users often require 1Gb or more, sometimes more.
3. CMA reserves memory in advance, our suggestion is using existing kernel memory
     mechanisms (THP for example) to allocate memory. 

Guy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 14:20                                   ` Guy Shattah
@ 2017-10-17 17:44                                     ` Vlastimil Babka
  2017-10-17 18:23                                     ` Mike Kravetz
  1 sibling, 0 replies; 63+ messages in thread
From: Vlastimil Babka @ 2017-10-17 17:44 UTC (permalink / raw)
  To: Guy Shattah, Michal Nazarewicz, Michal Hocko
  Cc: Christopher Lameter, Mike Kravetz, linux-mm, linux-kernel,
	linux-api, Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott

On 10/17/2017 04:20 PM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>    allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel memory
>      mechanisms (THP for example) to allocate memory. 

You can already use THP, right? madvise(MADV_HUGEPAGE) increases your
chances to get the huge pages. Then you can mlock() them if you want.
And you get the TLB benefits. There's no guarantee of course, but you
shouldn't require a guarantee for MMAP_CONTIG anyway, because it's for
performance reasons, not functionality. So either MMAP_CONTIG would have
to fallback itself, or the userspace caller. Or would your scenario
rather fail than perform suboptimally?

> Guy
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 14:20                                   ` Guy Shattah
  2017-10-17 17:44                                     ` Vlastimil Babka
@ 2017-10-17 18:23                                     ` Mike Kravetz
  2017-10-17 19:56                                       ` Vlastimil Babka
  1 sibling, 1 reply; 63+ messages in thread
From: Mike Kravetz @ 2017-10-17 18:23 UTC (permalink / raw)
  To: Guy Shattah, Michal Nazarewicz, Michal Hocko
  Cc: Christopher Lameter, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott, Vlastimil Babka

On 10/17/2017 07:20 AM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>    allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel memory
>      mechanisms (THP for example) to allocate memory. 

I would not totally rule out the use of CMA.  I like the way that it reserves
memory, but does not prohibit use by others.  In addition, there can be
device (or purpose) specific reservations.

However, since reservations need to happen quite early it is often done on
the kernel command line.  IMO, this should be avoided if possible.  There
are interfaces for arch specific code to make reservations.  I do not know
the system initialization sequence well enough to know if it would be
possible for driver code to make CMA reservations.  But, it looks doubtful.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-17 18:23                                     ` Mike Kravetz
@ 2017-10-17 19:56                                       ` Vlastimil Babka
  0 siblings, 0 replies; 63+ messages in thread
From: Vlastimil Babka @ 2017-10-17 19:56 UTC (permalink / raw)
  To: Mike Kravetz, Guy Shattah, Michal Nazarewicz, Michal Hocko
  Cc: Christopher Lameter, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Aneesh Kumar K . V, Joonsoo Kim,
	Anshuman Khandual, Laura Abbott

On 10/17/2017 08:23 PM, Mike Kravetz wrote:
> On 10/17/2017 07:20 AM, Guy Shattah wrote:
>> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the box'
>> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>>    allocated memory. RDMA users often require 1Gb or more, sometimes more.
>> 3. CMA reserves memory in advance, our suggestion is using existing kernel memory
>>      mechanisms (THP for example) to allocate memory. 
> 
> I would not totally rule out the use of CMA.  I like the way that it reserves
> memory, but does not prohibit use by others.  In addition, there can be
> device (or purpose) specific reservations.

I think the use case are devices that *cannot* function without
contiguous memory, typical examples IIRC are smartphone cameras on with
Android where only single app is working with the device at given time,
so it's ok to reserve single area for the device, and allocation is done
by the driver. Here we are talking about allocations done by potentially
multiple userspace applications, so how do we reconcile that with the
reservations? How does a single flag identify which device's area to
use? How do we prevent one process depleting the area for other
processes? IMHO it's another indication that a generic interface is
infeasible and it should be driver-specific.

BTW, does RDMA need a specific NUMA node to work optimally? (one closest
to the device I presume?) Will it be the job of userspace to discover
and bind itself to that node, in addition to using MAP_CONTIG? Or would
that be another thing best handled by the driver?

> However, since reservations need to happen quite early it is often done on
> the kernel command line.  IMO, this should be avoided if possible.  There
> are interfaces for arch specific code to make reservations.  I do not know
> the system initialization sequence well enough to know if it would be
> possible for driver code to make CMA reservations.  But, it looks doubtful.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support
  2017-10-16 17:42                                 ` Michal Hocko
  2017-10-16 17:56                                   ` Christopher Lameter
@ 2017-10-23 15:25                                   ` David Nellans
  1 sibling, 0 replies; 63+ messages in thread
From: David Nellans @ 2017-10-23 15:25 UTC (permalink / raw)
  To: Michal Hocko, Christopher Lameter
  Cc: Guy Shattah, Mike Kravetz, linux-mm, linux-kernel, linux-api,
	Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K . V,
	Joonsoo Kim, Anshuman Khandual, Laura Abbott, Vlastimil Babka

On 10/16/2017 12:42 PM, Michal Hocko wrote:
> On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
>> On Mon, 16 Oct 2017, Michal Hocko wrote:
>>> That being said, the list is far from being complete, I am pretty sure
>>> more would pop out if I thought more thoroughly. The bottom line is that
>>> while I see many problems to actually implement this feature and
>>> maintain it longterm I simply do not see a large benefit outside of a
>>> very specific HW.
>> There is not much new here in terms of problems. The hardware that
>> needs this seems to become more and more plentiful. That is why we need a
>> generic implementation.
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.
>

There are TLB's in AMD Xen that can take advantage of contig memory to
improve TLB coverage.  AFAIK contig is not functionally required, its
purely a performance optimization.  Current Xen TLB implementation
doesn't support arbitrary contig lengths, page sizes, etc, but its a
start.  This
type of TLB optimization can be handled on the back end by de-fragging
phys mem (when possible) now that both base and THPs can be easily
migrated; no need for up-front contig, but defrag isn't free either.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
                   ` (3 preceding siblings ...)
  2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
@ 2017-10-23 22:10 ` Dave Hansen
  2017-10-24 22:49   ` Mike Kravetz
  4 siblings, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2017-10-23 22:10 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter

On 10/03/2017 04:56 PM, Mike Kravetz wrote:
> mmap(MAP_CONTIG) would have the following semantics:
> - The entire mapping (length size) would be backed by physically contiguous
>   pages.
> - If 'length' physically contiguous pages can not be allocated, then mmap
>   will fail.
> - MAP_CONTIG only works with MAP_ANONYMOUS mappings.
> - MAP_CONTIG will lock the associated pages in memory.  As such, the same
>   privileges and limits that apply to mlock will also apply to MAP_CONTIG.
> - A MAP_CONTIG mapping can not be expanded.

Do you also need to lock out the NUMA migration APIs somehow?  What
about KSM (or does it already ignore VM_LOCKED)?

> - At fork time, private MAP_CONTIG mappings will be converted to regular
>   (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
>   will not require a contiguous allocation.
Maybe we should just define it as acting as if it had MADV_DONTFORK set
on it, and also that it doesn't allow MADV_DONTFORK to be called on it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] mmap(MAP_CONTIG)
  2017-10-23 22:10 ` [RFC] mmap(MAP_CONTIG) Dave Hansen
@ 2017-10-24 22:49   ` Mike Kravetz
  0 siblings, 0 replies; 63+ messages in thread
From: Mike Kravetz @ 2017-10-24 22:49 UTC (permalink / raw)
  To: Dave Hansen, linux-mm, linux-kernel, linux-api
  Cc: Marek Szyprowski, Michal Nazarewicz, Aneesh Kumar K.V,
	Joonsoo Kim, Guy Shattah, Christoph Lameter

On 10/23/2017 03:10 PM, Dave Hansen wrote:
> On 10/03/2017 04:56 PM, Mike Kravetz wrote:
>> mmap(MAP_CONTIG) would have the following semantics:
>> - The entire mapping (length size) would be backed by physically contiguous
>>   pages.
>> - If 'length' physically contiguous pages can not be allocated, then mmap
>>   will fail.
>> - MAP_CONTIG only works with MAP_ANONYMOUS mappings.
>> - MAP_CONTIG will lock the associated pages in memory.  As such, the same
>>   privileges and limits that apply to mlock will also apply to MAP_CONTIG.
>> - A MAP_CONTIG mapping can not be expanded.
> 
> Do you also need to lock out the NUMA migration APIs somehow?  What
> about KSM (or does it already ignore VM_LOCKED)?

Yes, and no.
The primary use case driving this request is RDMA.  As such, the pages
can not move while being used for this purpose.

When this thread was started the thought was that generic mmap would
handle the contiguous allocations.  The resulting allocated pages would
be handed to the driver for additional setup based on it's specific needs.
Since then, the thought is that the driver should handle contiguous
allocations as well.  I am looking at making the existing contiguous memory
allocator more usable for driver writers.

-- 
Mike Kravetz

> 
>> - At fork time, private MAP_CONTIG mappings will be converted to regular
>>   (non-MAP_CONTIG) mapping in the child.  As such a COW fault in the child
>>   will not require a contiguous allocation.
> Maybe we should just define it as acting as if it had MADV_DONTFORK set
> on it, and also that it doesn't allow MADV_DONTFORK to be called on it.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2017-10-24 22:49 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-03 23:56 [RFC] mmap(MAP_CONTIG) Mike Kravetz
2017-10-04 11:54 ` Michal Nazarewicz
2017-10-04 17:08   ` Mike Kravetz
2017-10-04 21:29     ` Laura Abbott
2017-10-04 13:49 ` Anshuman Khandual
2017-10-04 16:05   ` Christopher Lameter
2017-10-04 17:38     ` Mike Kravetz
2017-10-04 17:35   ` Mike Kravetz
2017-10-05  7:06 ` Vlastimil Babka
2017-10-05 14:30   ` Christopher Lameter
2017-10-12  1:46 ` [RFC PATCH 0/3] Add mmap(MAP_CONTIG) support Mike Kravetz
2017-10-12  1:46   ` [RFC PATCH 1/3] mm/map_contig: Add VM_CONTIG flag to vma struct Mike Kravetz
2017-10-12  1:46   ` [RFC PATCH 2/3] mm/map_contig: Use pre-allocated pages for VM_CONTIG mappings Mike Kravetz
2017-10-12 11:04     ` Anshuman Khandual
2017-10-12  1:46   ` [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support Mike Kravetz
2017-10-12 11:22     ` Anshuman Khandual
2017-10-13 15:14       ` Christopher Lameter
2017-10-12 14:37     ` Michal Hocko
2017-10-12 17:19       ` Mike Kravetz
2017-10-13  8:40         ` Michal Hocko
2017-10-13 15:20           ` Christopher Lameter
2017-10-13 15:28             ` Michal Hocko
2017-10-13 15:42               ` Christopher Lameter
2017-10-13 15:47                 ` Michal Hocko
2017-10-13 15:56                   ` Christopher Lameter
2017-10-13 16:17                     ` Michal Hocko
2017-10-15  7:50                       ` Guy Shattah
2017-10-16  8:24                         ` Michal Hocko
2017-10-16  9:11                           ` Guy Shattah
2017-10-16 12:32                             ` Michal Hocko
2017-10-16 16:00                               ` Christopher Lameter
2017-10-16 17:42                                 ` Michal Hocko
2017-10-16 17:56                                   ` Christopher Lameter
2017-10-16 18:17                                     ` Michal Hocko
2017-10-23 15:25                                   ` David Nellans
2017-10-17 10:50                               ` Guy Shattah
2017-10-17 10:59                                 ` Michal Hocko
2017-10-17 13:22                                 ` Michal Nazarewicz
2017-10-17 14:20                                   ` Guy Shattah
2017-10-17 17:44                                     ` Vlastimil Babka
2017-10-17 18:23                                     ` Mike Kravetz
2017-10-17 19:56                                       ` Vlastimil Babka
2017-10-16 10:33                         ` Michal Nazarewicz
2017-10-16 11:09                           ` Guy Shattah
2017-10-16 17:43                         ` Mike Kravetz
2017-10-16 18:07                           ` Michal Hocko
2017-10-16 20:32                             ` Mike Kravetz
2017-10-16 20:58                               ` Michal Hocko
2017-10-16 21:03                               ` Laura Abbott
2017-10-16 21:18                                 ` Mike Kravetz
2017-10-17  6:59                               ` Vlastimil Babka
2017-10-15  6:58                   ` Pavel Machek
2017-10-16  8:18                     ` Michal Hocko
2017-10-16  9:54                       ` Pavel Machek
2017-10-16 12:18                         ` Michal Hocko
2017-10-16 16:02                           ` Christopher Lameter
2017-10-16 17:33                             ` Michal Hocko
2017-10-16 17:53                               ` Christopher Lameter
2017-10-15  8:07     ` Guy Shattah
2017-10-12 10:36   ` [RFC PATCH 0/3] " Anshuman Khandual
2017-10-12 14:25     ` Anshuman Khandual
2017-10-23 22:10 ` [RFC] mmap(MAP_CONTIG) Dave Hansen
2017-10-24 22:49   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).