linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
       [not found] ` <6b639da5-ad9a-158c-ad4a-7a4e44bd98fc@gmx.de>
@ 2017-10-20 22:42   ` Mike Kravetz
  2017-10-23 11:42     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2017-10-20 22:42 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: linux-mm, linux-kernel, Michal Hocko, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 10/19/2017 12:34 AM, C.Wehrmeyer wrote:
> I apologise in case this message is going to arrive multiple times at the mailing list. I've had connection problems this morning while trying to push it through regardless, but it might or might not have been sent properly. I'm sorry for the inconvenience.
> 
> On 2017-10-08 18:47 Mike Kravetz wrote:
>> You are correct.  That check in function vma_to_resize() will prevent
>> mremap from growing or relocating hugetlb backed mappings.  This check
>> existed in the 2.6.0 linux kernel, so this restriction has existed for
>> a very long time.  I'm guessing that growing or relocating a hugetlb
>> mapping was never allowed.  Perhaps the mremap man page should list this
>> restriction.
> 
> I do not see such mentioning:
> 
> http://man7.org/linux/man-pages/man2/mremap.2.html
> 
> The author(s) deliberately use the term "page aligned", without specifying the page size that was used creating the initial mapping. And even more:
> 
>> mremap() uses the Linux page table scheme.  mremap() changes the
>> mapping between virtual addresses and memory pages.  This can be used
>> to implement a very efficient realloc(3).
> 
> There is not much of a very efficient realloc(3) left if you cannot modify mappings with a higher page size, is there?
> 
>> Is there a specific use case where the ability to grow hugetlb mappings
>> is desired?  Adding this functionality would involve more than simply
>> removing the above if statement.  One area of concern would be hugetlb
>> huge page reservations.  If there is a compelling use case, adding the
>> functionality may be worth consideration.  If not, I suggest we just
>> document the limitation.
> 
> Paging was introduced to the x86 processor family with the 80386 in 1985, with 4 KiBs per default. It's been 32 years since that, and modern CPUs in the consumer market have support for 2 MiB and 1 GiB pages, and yet default allocators usually just stick to the default without bothering whether or not there actually are hugepages available.
> 
> One 2-MiB page removes 512 4-KiB pages from the TLB, seeing as at least my TLBs are specialised in buffering one type of pages. I'm certain that at some point in the future the need for deliberately reserving hugepages via the kernel interface is going to be removed, and hugepages will become the usual way of allocating memory.
> 
> As for the specific use case: I've written my own allocator that is not bound on the same limitations that usual malloc/realloc/free allocators are bound. As such I want to be able to eliminate as many page walks as possible.
> 
> Just excepting the limitation would put Linux down on the same level as the Windows API, where no VirtualRealloc exists. My allocator needs to work with Linux and Windows; for the latter one I'm already managing a table of consecutive mappings in user-space that, if a relocation has to be made, creates an entirely new mapping into which the data of the previous mappings is copied. This is redundant, because the kernel and the process keep their own copies of the mapping table, and this is slow because the kernel could just re-adjust the position within the address space, whereas the process has to memcpy all the data from the old to the new mappings.
> 
> Those are the very problems mremap was supposed to remove in the first place. Making the limitation documented is the lazy way that will force implementers to workaround it.

mremap has never supported moving or growing hugetlb mappings.  Someone
(before git history) added this explicit check to the mremap code.  Perhaps
it was done when huge page support was introduced?

I am of the opinion that we should simply document this limitation.  AFAIK,
this this the first time anyone has asked about it in 15 years.  What is the
opinion of others?

>From a 'scope of work' perspective, I think moving hugetlb mappings should
be pretty straight forward.  The bigger issue is in growing, and managing
huge page reservations when growing.

-- 
Mike Kravetz

> 
> As for any kind of speed penalty that this might introduce (because flags have to be checked, interfaces to be changed, and constants to be replaced): hugepages will also remove the need to allocate memory. My allocator just doesn't call the kernel each time it requires memory, but only when it is absolutely necessary. That necessity can be postponed the larger the mapping is that I can allocate in one go.
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-20 22:42   ` PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL Mike Kravetz
@ 2017-10-23 11:42     ` Michal Hocko
  2017-10-23 12:22       ` C.Wehrmeyer
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-23 11:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: C.Wehrmeyer, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Fri 20-10-17 15:42:25, Mike Kravetz wrote:
> On 10/19/2017 12:34 AM, C.Wehrmeyer wrote:
[...]
> > As for the specific use case: I've written my own allocator that is
> > not bound on the same limitations that usual malloc/realloc/free
> > allocators are bound. As such I want to be able to eliminate as many
> > page walks as possible.
> >
> > Just excepting the limitation would put Linux down on the same level
> > as the Windows API, where no VirtualRealloc exists. My allocator
> > needs to work with Linux and Windows; for the latter one I'm already
> > managing a table of consecutive mappings in user-space that, if
> > a relocation has to be made, creates an entirely new mapping
> > into which the data of the previous mappings is copied. This is
> > redundant, because the kernel and the process keep their own copies
> > of the mapping table, and this is slow because the kernel could just
> > re-adjust the position within the address space, whereas the process
> > has to memcpy all the data from the old to the new mappings.
> >
> > Those are the very problems mremap was supposed to remove in the
> > first place. Making the limitation documented is the lazy way that
> > will force implementers to workaround it.
> 
> mremap has never supported moving or growing hugetlb mappings.  Someone
> (before git history) added this explicit check to the mremap code.  Perhaps
> it was done when huge page support was introduced?

yes, that is the case.
 
> I am of the opinion that we should simply document this limitation.  AFAIK,
> this this the first time anyone has asked about it in 15 years.  What is the
> opinion of others?

I do not remember any such a request either. I can see some merit in the
described use case. It is not specific on why hugetlb pages are used for
the allocator memory because that comes with it own issues. If somebody
is really thrilled enough to implement this the remapping feature for
hugetlb I wouldn't be opposed as long as the implementation is clean and
wouldn't add an additional mess to the code base. I suspect that the vma
enlarging might be a hard deal. Anyway starting with a documentation
update sounds like a good thing anyway. In any case such a feature will
be available only for new kernels so people should be aware of the state
on older kernels.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 11:42     ` Michal Hocko
@ 2017-10-23 12:22       ` C.Wehrmeyer
  2017-10-23 12:41         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-23 12:22 UTC (permalink / raw)
  To: Michal Hocko, Mike Kravetz
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A. Shutemov,
	Vlastimil Babka

On 2017-10-23 13:42, Michal Hocko wrote:
> I do not remember any such a request either. I can see some merit in the
> described use case. It is not specific on why hugetlb pages are used for
> the allocator memory because that comes with it own issues.

That is yet for the user to specify. As of now hugepages still require a 
special setup that not all people might have as of now - to my knowledge 
a kernel being compiled with CONFIG_TRANSPARENT_HUGEPAGE=y and a number 
of such pages being allocated either through the kernel boot line or 
through /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. I'm 
deliberately ignoring 1-GiB pages here because those are only 
allocatable during boot, when no processes have been spawned and memory 
is still not fragmented.

My point is that I can see people not being too eager to support 1 GiB 
pages as of now unless for very specific use case. 2-MiB pages, on the 
other hand, shouldn't have those limitations anymore. User-space 
programs should be capable of allocating such pages without the need for 
the user to fiddle with nr_hugepages beforehand.

Some time ago I've written some code to detect TLB capabilities on my 
current testing CPU, those are the results:

[TLB] Instruction TLB: 2M/4M pages, fully associative, 8 entries
[TLB] Data TLB: 4 KByte pages, 4-way set associative, 64 entries
[TLB] Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 
entries and a separate array with 1 GByte pages, 4-way set associative, 
4 entries
[TLB] Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
[STLB] Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 
1024 entries

With the knowledge that allocations in the Mebibyte range aren't 
uncommon at all nowadays and that one 2-MiB page eliminates the need for 
512 4-KiB pages, we really should make advances towards treating 2-MiB 
pages just as casual as older pages. Allocators can still query if the 
kernel supports the specified page size, and specifying MAP_HUGETLB | 
MAP_HUGE_2MB would still be required in order to not break older 
programs, but from my perspective there is a lot to gain here.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 12:22       ` C.Wehrmeyer
@ 2017-10-23 12:41         ` Michal Hocko
  2017-10-23 14:00           ` C.Wehrmeyer
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-23 12:41 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Mon 23-10-17 14:22:30, C.Wehrmeyer wrote:
> On 2017-10-23 13:42, Michal Hocko wrote:
> > I do not remember any such a request either. I can see some merit in the
> > described use case. It is not specific on why hugetlb pages are used for
> > the allocator memory because that comes with it own issues.
> 
> That is yet for the user to specify. As of now hugepages still require a
> special setup that not all people might have as of now - to my knowledge a
> kernel being compiled with CONFIG_TRANSPARENT_HUGEPAGE=y and a number of
> such pages being allocated either through the kernel boot line or through

CONFIG_TRANSPARENT_HUGEPAGE has nothing to do with hugetlb pages. These
are THP which do not need any special configuration and mremap works on
them.

> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. I'm deliberately
> ignoring 1-GiB pages here because those are only allocatable during boot,
> when no processes have been spawned and memory is still not fragmented.

This is no longer true. GB pages can be allocated during runtime as
well.
 
> My point is that I can see people not being too eager to support 1 GiB pages
> as of now unless for very specific use case.

1G or 2M pages make absolutely no difference from the mremap semantic.
It is just pte to be updated. The problem at hands is that hugetlb
implementation is far from straightforward and the lack of mremap is
mainly caused by implementation details (like reservetions I presume).

> 2-MiB pages, on the other hand,
> shouldn't have those limitations anymore. User-space programs should be
> capable of allocating such pages without the need for the user to fiddle
> with nr_hugepages beforehand.

And that is what we have THP for...

[...]

> With the knowledge that allocations in the Mebibyte range aren't uncommon at
> all nowadays and that one 2-MiB page eliminates the need for 512 4-KiB
> pages, we really should make advances towards treating 2-MiB pages just as
> casual as older pages. Allocators can still query if the kernel supports the
> specified page size, and specifying MAP_HUGETLB | MAP_HUGE_2MB would still
> be required in order to not break older programs, but from my perspective
> there is a lot to gain here.

I can see your sentiment here but hugetlb has never been really a full
featured type of memory. General purpose allocator playing with hugetlb
pages is rather tricky and I would be really cautious there. I would
rather play with THP to reduce the TLB footprint.

So by all means, mremap _should_ work with hugetlb pages but the
additional implementation and potentially the complexity should have a
strong usecase. If we can do mremap with old_size == new_size trivially
implemented then I am not really against but full featured mremap is not
worth it IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 12:41         ` Michal Hocko
@ 2017-10-23 14:00           ` C.Wehrmeyer
  2017-10-23 16:13             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-23 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 2017-10-23 14:41, Michal Hocko wrote:
> On Mon 23-10-17 14:22:30, C.Wehrmeyer wrote:
>> On 2017-10-23 13:42, Michal Hocko wrote:
>>> I do not remember any such a request either. I can see some merit in the
>>> described use case. It is not specific on why hugetlb pages are used for
>>> the allocator memory because that comes with it own issues.
>>
>> That is yet for the user to specify. As of now hugepages still require a
>> special setup that not all people might have as of now - to my knowledge a
>> kernel being compiled with CONFIG_TRANSPARENT_HUGEPAGE=y and a number of
>> such pages being allocated either through the kernel boot line or through
> 
> CONFIG_TRANSPARENT_HUGEPAGE has nothing to do with hugetlb pages. These
> are THP which do not need any special configuration and mremap works on
> them.

I was not aware of the fact that HP != THP, so thank you for clarifying 
that.

> This is no longer true. GB pages can be allocated during runtime as
> well.

Didn't know that as well. I just knew the last time I tested this it was 
not possible.

>> 2-MiB pages, on the other hand,
>> shouldn't have those limitations anymore. User-space programs should be
>> capable of allocating such pages without the need for the user to fiddle
>> with nr_hugepages beforehand.
> 
> And that is what we have THP for...

Then I might have been using it incorrectly? I've been digging through 
Documentation/vm/transhuge.txt after your initial pointing out, and 
verified that the kernel uses THPs pretty much always, without the usage 
of madvise:

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

And just to be very sure I've added:

if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
         errno_tmp = errno;
         fprintf(stderr,"madvise: %u\n",errno_tmp);
         goto out;
}

/*Make sure the mapping is actually used*/
memset(buf1,'!',ALLOC_SIZE_1);

/*Give me time for monitoring*/
sleep(2000);

right after the mmap call. I've also made sure that nothing is being 
optimised away by the compiler. With a 2MiB mapping being requested this 
should be a good opportunity for the kernel, and yet when I try to 
figure out how many THPs my processes uses:

$ cat /proc/21986/smaps  | grep 'AnonHuge'

I just end up with lots of:

AnonHugePages:         0 kB

And cat /proc/meminfo | grep 'Huge' doesn't change significantly as 
well. Am I just doing something wrong here, or shouldn't I trust the THP 
mechanisms to actually allocate hugepages for me?

> General purpose allocator playing with hugetlb
> pages is rather tricky and I would be really cautious there. I would
> rather play with THP to reduce the TLB footprint.

May one ask why you'd recommend to be cautious here? I understand that 
actual huge pages can slow down certain things - swapping comes to mind 
immediately, which is probably the reason why Linux (used to?) lock such 
pages in memory as well.

I once again want to emphasise that this is my first time writing to the 
mailing list. It might be redundant, but I'm not yet used to any 
conventions or technical details you're familiar with.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 14:00           ` C.Wehrmeyer
@ 2017-10-23 16:13             ` Michal Hocko
  2017-10-23 16:46               ` C.Wehrmeyer
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-23 16:13 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Mon 23-10-17 16:00:13, C.Wehrmeyer wrote:
[...]
> > And that is what we have THP for...
> 
> Then I might have been using it incorrectly? I've been digging through
> Documentation/vm/transhuge.txt after your initial pointing out, and verified
> that the kernel uses THPs pretty much always, without the usage of madvise:
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never

OK

> And just to be very sure I've added:
> 
> if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
>         errno_tmp = errno;
>         fprintf(stderr,"madvise: %u\n",errno_tmp);
>         goto out;
> }
> 
> /*Make sure the mapping is actually used*/
> memset(buf1,'!',ALLOC_SIZE_1);

Is the buffer aligned to 2MB?
 
> /*Give me time for monitoring*/
> sleep(2000);
> 
> right after the mmap call. I've also made sure that nothing is being
> optimised away by the compiler. With a 2MiB mapping being requested this
> should be a good opportunity for the kernel, and yet when I try to figure
> out how many THPs my processes uses:
> 
> $ cat /proc/21986/smaps  | grep 'AnonHuge'
> 
> I just end up with lots of:
> 
> AnonHugePages:         0 kB
> 
> And cat /proc/meminfo | grep 'Huge' doesn't change significantly as well. Am
> I just doing something wrong here, or shouldn't I trust the THP mechanisms
> to actually allocate hugepages for me?

If the mapping is aligned properly then the rest is up to system and
availability of large physically contiguous memory blocks.

> > General purpose allocator playing with hugetlb
> > pages is rather tricky and I would be really cautious there. I would
> > rather play with THP to reduce the TLB footprint.
> 
> May one ask why you'd recommend to be cautious here? I understand that
> actual huge pages can slow down certain things - swapping comes to mind
> immediately, which is probably the reason why Linux (used to?) lock such
> pages in memory as well.

THP shouldn't cause any significant slowdown or other issues (these
days). The main reason for the static pre allocated huge pages pool
(hugetlb) was a guarantee of the huge pages availability. Such a
pool has not been reclaimable. This brings an obvious issues, e.g.
unreclaimable huge pages will reduce the amount of usable memory for the
rest of the system so you should really think how much to reserve to not
get into memory short situations. That makes a general purpose hugetlb
pages usage rather challenging.

THP on the other hand can come and go as the system is able to
create/keep them without userspace involvement. You can hint a range by
madvise and the system will try harder to give you THP.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 16:13             ` Michal Hocko
@ 2017-10-23 16:46               ` C.Wehrmeyer
  2017-10-23 16:57                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-23 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 23-10-17 18:13, Michal Hocko wrote:
> On Mon 23-10-17 16:00:13, C.Wehrmeyer wrote:
>> And just to be very sure I've added:
>>
>> if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
>>          errno_tmp = errno;
>>          fprintf(stderr,"madvise: %u\n",errno_tmp);
>>          goto out;
>> }
>>
>> /*Make sure the mapping is actually used*/
>> memset(buf1,'!',ALLOC_SIZE_1);
> 
> Is the buffer aligned to 2MB?

When I omit MAP_HUGETLB for the flags that mmap receives - no.

#define ALLOC_SIZE_1 (2 * 1024 * 1024)
[...]
buf1 = mmap (
         NULL,
         ALLOC_SIZE_1,
         prot, /*PROT_READ | PROT_WRITE*/
         flags /*MAP_PRIVATE | MAP_ANONYMOUS*/,
         -1,
         0
);

In such a case buf1 usually contains addresses which are aligned to 4 
KiBs, such as 0x7f07d76e9000. 2-MiB-aligned addresses, such as 
0x7f89f5e00000, are only produced with MAP_HUGETLB - which, if I 
understood the documentation correctly, is not the point of THPs as they 
are supposed to be transparent.

I'm not exactly sure how I'm supposed to force mmap to give me any other 
kind of address, if that is going to be your suggestion - unless I'd 
read the mapping configuration for the current process and find myself a 
spot where I can tell mmap to create a mapping for me using MAP_FIXED. 
But that wouldn't be transparent, either.

>> /*Give me time for monitoring*/
>> sleep(2000);
>>
>> right after the mmap call. I've also made sure that nothing is being
>> optimised away by the compiler. With a 2MiB mapping being requested this
>> should be a good opportunity for the kernel, and yet when I try to figure
>> out how many THPs my processes uses:
>>
>> $ cat /proc/21986/smaps  | grep 'AnonHuge'
>>
>> I just end up with lots of:
>>
>> AnonHugePages:         0 kB
>>
>> And cat /proc/meminfo | grep 'Huge' doesn't change significantly as well. Am
>> I just doing something wrong here, or shouldn't I trust the THP mechanisms
>> to actually allocate hugepages for me?
> 
> If the mapping is aligned properly then the rest is up to system and
> availability of large physically contiguous memory blocks.

I have about 5 GiBs of free memory right now, and while I can not 
guarantee that memory fragmentation prevents the kernel from using THP, 
manually reserving 256 2-MiB pages through nr_hugepages and then freeing 
them works just fine. Yes, after allocating them I checked if 
nr_hugepages actually was 256. And yet, after immediately running my 
program, there would be no change any of the AnonHugePages elements that 
smaps exports. Also (while omitting MAP_HUGETLB) buf1 remains to be 
aligned to 4 KiB.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 16:46               ` C.Wehrmeyer
@ 2017-10-23 16:57                 ` Michal Hocko
  2017-10-23 17:52                   ` C.Wehrmeyer
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-23 16:57 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Mon 23-10-17 18:46:59, C.Wehrmeyer wrote:
> On 23-10-17 18:13, Michal Hocko wrote:
> > On Mon 23-10-17 16:00:13, C.Wehrmeyer wrote:
> > > And just to be very sure I've added:
> > > 
> > > if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
> > >          errno_tmp = errno;
> > >          fprintf(stderr,"madvise: %u\n",errno_tmp);
> > >          goto out;
> > > }
> > > 
> > > /*Make sure the mapping is actually used*/
> > > memset(buf1,'!',ALLOC_SIZE_1);
> > 
> > Is the buffer aligned to 2MB?
> 
> When I omit MAP_HUGETLB for the flags that mmap receives - no.
> 
> #define ALLOC_SIZE_1 (2 * 1024 * 1024)
> [...]
> buf1 = mmap (
>         NULL,
>         ALLOC_SIZE_1,
>         prot, /*PROT_READ | PROT_WRITE*/
>         flags /*MAP_PRIVATE | MAP_ANONYMOUS*/,
>         -1,
>         0
> );
> 
> In such a case buf1 usually contains addresses which are aligned to 4 KiBs,
> such as 0x7f07d76e9000. 2-MiB-aligned addresses, such as 0x7f89f5e00000, are
> only produced with MAP_HUGETLB - which, if I understood the documentation
> correctly, is not the point of THPs as they are supposed to be transparent.

yes. You can use posix_memalign or you can mmap a larger block and
munmap the initial unaligned part.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 16:57                 ` Michal Hocko
@ 2017-10-23 17:52                   ` C.Wehrmeyer
  2017-10-23 18:02                     ` Michal Hocko
  2017-10-23 18:51                     ` Mike Kravetz
  0 siblings, 2 replies; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-23 17:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 2017-10-23 18:57, Michal Hocko wrote:
> On Mon 23-10-17 18:46:59, C.Wehrmeyer wrote:
>> On 23-10-17 18:13, Michal Hocko wrote:
>>> On Mon 23-10-17 16:00:13, C.Wehrmeyer wrote:
>>>> And just to be very sure I've added:
>>>>
>>>> if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
>>>>           errno_tmp = errno;
>>>>           fprintf(stderr,"madvise: %u\n",errno_tmp);
>>>>           goto out;
>>>> }
>>>>
>>>> /*Make sure the mapping is actually used*/
>>>> memset(buf1,'!',ALLOC_SIZE_1);
>>>
>>> Is the buffer aligned to 2MB?
>>
>> When I omit MAP_HUGETLB for the flags that mmap receives - no.
>>
>> #define ALLOC_SIZE_1 (2 * 1024 * 1024)
>> [...]
>> buf1 = mmap (
>>          NULL,
>>          ALLOC_SIZE_1,
>>          prot, /*PROT_READ | PROT_WRITE*/
>>          flags /*MAP_PRIVATE | MAP_ANONYMOUS*/,
>>          -1,
>>          0
>> );
>>
>> In such a case buf1 usually contains addresses which are aligned to 4 KiBs,
>> such as 0x7f07d76e9000. 2-MiB-aligned addresses, such as 0x7f89f5e00000, are
>> only produced with MAP_HUGETLB - which, if I understood the documentation
>> correctly, is not the point of THPs as they are supposed to be transparent.
> 
> yes. You can use posix_memalign

Useless. We don't use the memory allocation structures of malloc/free, 
and yet that's exactly what this function requires us to do. The reason 
why we use mmap and mremap is to get rid of userspace-crap in the first 
place.

> or you can mmap a larger block and
> munmap the initial unaligned part.

And how is that supposed to be transparent? When I hear "transparent" I 
think of a mechanism which I can put under a system so that it benefits 
from it, while the system does not notice or at least does not need to 
be aware of it. The system also does not need to be changed for it.

This approach is even more un-transparent than providing a flag to mmap 
in order to make hugepages work correctly.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 17:52                   ` C.Wehrmeyer
@ 2017-10-23 18:02                     ` Michal Hocko
  2017-10-24  7:41                       ` C.Wehrmeyer
  2017-10-23 18:51                     ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-23 18:02 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
[...]
> > or you can mmap a larger block and
> > munmap the initial unaligned part.
> 
> And how is that supposed to be transparent? When I hear "transparent" I
> think of a mechanism which I can put under a system so that it benefits from
> it, while the system does not notice or at least does not need to be aware
> of it. The system also does not need to be changed for it.

How do you expect to get a huge page when the mapping itself is not
properly aligned? Sure if you have a large mapping then you probably do
not care about first and last chunk being not 2MB aligned but in order
to get a THP you really need a 2MB aligned address.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 17:52                   ` C.Wehrmeyer
  2017-10-23 18:02                     ` Michal Hocko
@ 2017-10-23 18:51                     ` Mike Kravetz
  2017-10-24  8:09                       ` C.Wehrmeyer
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2017-10-23 18:51 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 10/23/2017 10:52 AM, C.Wehrmeyer wrote:
> On 2017-10-23 18:57, Michal Hocko wrote:
>> On Mon 23-10-17 18:46:59, C.Wehrmeyer wrote:
>>> On 23-10-17 18:13, Michal Hocko wrote:
>>>> On Mon 23-10-17 16:00:13, C.Wehrmeyer wrote:
>>>>> And just to be very sure I've added:
>>>>>
>>>>> if (madvise(buf1,ALLOC_SIZE_1,MADV_HUGEPAGE)) {
>>>>>           errno_tmp = errno;
>>>>>           fprintf(stderr,"madvise: %u\n",errno_tmp);
>>>>>           goto out;
>>>>> }
>>>>>
>>>>> /*Make sure the mapping is actually used*/
>>>>> memset(buf1,'!',ALLOC_SIZE_1);
>>>>
>>>> Is the buffer aligned to 2MB?
>>>
>>> When I omit MAP_HUGETLB for the flags that mmap receives - no.
>>>
>>> #define ALLOC_SIZE_1 (2 * 1024 * 1024)
>>> [...]
>>> buf1 = mmap (
>>>          NULL,
>>>          ALLOC_SIZE_1,
>>>          prot, /*PROT_READ | PROT_WRITE*/
>>>          flags /*MAP_PRIVATE | MAP_ANONYMOUS*/,
>>>          -1,
>>>          0
>>> );
>>>
>>> In such a case buf1 usually contains addresses which are aligned to 4 KiBs,
>>> such as 0x7f07d76e9000. 2-MiB-aligned addresses, such as 0x7f89f5e00000, are
>>> only produced with MAP_HUGETLB - which, if I understood the documentation
>>> correctly, is not the point of THPs as they are supposed to be transparent.
>>
>> yes. You can use posix_memalign
> 
> Useless. We don't use the memory allocation structures of malloc/free, and yet that's exactly what this function requires us to do. The reason why we use mmap and mremap is to get rid of userspace-crap in the first place.
> 
>> or you can mmap a larger block and
>> munmap the initial unaligned part.
> 
> And how is that supposed to be transparent? When I hear "transparent" I think of a mechanism which I can put under a system so that it benefits from it, while the system does not notice or at least does not need to be aware of it. The system also does not need to be changed for it.
> 
> This approach is even more un-transparent than providing a flag to mmap in order to make hugepages work correctly.

Well at least this has a built in fall back mechanism.  When using hugetlb(fs)
pages, you would need to handle the case where mremap fails due to lack of
configured huge pages.

I assume your allocator will be for somewhat general application usage.  Yet,
for the most reliability the user/admin will need to know at boot time how
many huge pages will be needed and set that up.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 18:02                     ` Michal Hocko
@ 2017-10-24  7:41                       ` C.Wehrmeyer
  2017-10-24  8:12                         ` Michal Hocko
  2017-10-27 14:29                         ` Vlastimil Babka
  0 siblings, 2 replies; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-24  7:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 2017-10-23 20:02, Michal Hocko wrote:
> On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
> [...]
>>> or you can mmap a larger block and
>>> munmap the initial unaligned part.
>>
>> And how is that supposed to be transparent? When I hear "transparent" I
>> think of a mechanism which I can put under a system so that it benefits from
>> it, while the system does not notice or at least does not need to be aware
>> of it. The system also does not need to be changed for it.
> 
> How do you expect to get a huge page when the mapping itself is not
> properly aligned?

There are four ways that I can think of from the top of my head, but 
only one of them would be actually transparent.

1. Provide a flag to mmap, which might be something different from 
MAP_HUGETLB. After all your question revolved merely around properly 
aligned pages - we don't want to *force* the kernel to reserve 
hugepages, we just want it to provide the proper alignment in this case. 
That wouldn't be very transparent, but it would be the easiest route to 
go (and mmap already kind-of supports such a thing).

2. Based on transparent_hugepage/enabled always churn out properly 
aligned pages. In this case madvise(MADV_HUGEPAGE) becomes obsolete - 
after all it's mmap which decides what kind of addresses we get. First 
getting *some* mapping that isn't properly aligned for hugepages and 
*then* trying to mitigate the damage by another syscall not only defies 
the meaning of "transparent", but might also be hard to implement 
kernel-side. Let's say I map 8 MiBs of memory, without mmap knowing that 
I'd prefer this to be allocated via THPs. I could either go with your 
route (map 8 MiBs and then some more, trim at the beginning and the end, 
and then tell madvise that all of that is now going to be hugepages - 
which is something that could easily be done in the kernel, especially 
with the internal knowledge about what the actual page size is and 
without all those context switches that one takes in by mapping, 
munmapping, munmapping *again* and then *madvising* the actual memory), 
or I'd go with my third option.

3. I map 8 MiBs, some some misaligned address from mmap, and then try to 
mitigate the damage by telling madvise that all that is now supposed to 
use hugepages. The dumb way of implementing this would be to split the 
mapping - one section at the beginning has 256 4-KiB pages, the next one 
utilises 3 2-MiB pages, and the last section has 256 4-KiB pages again 
(or some such), effectively equalling 8 MiBs. I don't even know if Linux 
supports variable-page-size mappings, and of course we're still carrying 
512 4-KiBs pages with us that would have easily been mapped into one 
2-MiB page, which is why I call it the dumb way.

4. Like three, but a wee bit smarter: introduce another system call that 
works like madvise(MADV_HUGEPAGE), but let it return the address of a 
properly aligned mapping, thus giving userspace 4 genuine 2-MiB pages. 
Just like 3) that wouldn't be transparent, but at least it's only 4 
context switches that don't give us half-baked hugepages. However, this 
approach would effectively only be 1), just more complicated and 
un-transparent.

tl; dr:

1. Provide mmap with some sort of flag (which would be redundant IMHO) 
in order to churn out properly aligned pages (not transparent, but the 
current MAP_HUGETLB flag isn't either).
2. Based on THP enabling status always churn out properly aligned pages, 
and just failsafe to smaller pages if hugepages couldn't be allocated 
(truly transparent).
3. Map in memory, then tell madvise to make as many hugepages out of it 
as possible while still keeping the initial mapping (not transparent, 
and not sure Linux can actually do that).
4. Introduce a new system call (not transparent from the get-go) to give 
out properly aligned pages, or make them properly aligned while the 
mapping is transformed from not-properly-aligned to properly-aligned.

Your call.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-23 18:51                     ` Mike Kravetz
@ 2017-10-24  8:09                       ` C.Wehrmeyer
  0 siblings, 0 replies; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-24  8:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 2017-10-23 20:51, Mike Kravetz wrote:
 > [...]
> Well at least this has a built in fall back mechanism.  When using hugetlb(fs)
> pages, you would need to handle the case where mremap fails due to lack of
> configured huge pages.

You're missing the point. I never asked for a fall-back mechanism, even 
though it certainly has its use cases. It just isn't mine. In such a 
situation it wouldn't be hard to detect if the user requested huger 
pages, and then fall back to a smaller size. The only difference is that 
I'd have to implement it myself.

But all of that does not change the fact that it's not transparent.

> I assume your allocator will be for somewhat general application usage.

Define "general purpose" first. The allocator itself isn't transparent 
to typical malloc/realloc/free-based approaches, and it isn't so very 
deliberately.

> Yet,
> for the most reliability the user/admin will need to know at boot time how
> many huge pages will be needed and set that up.
That's what I'm trying to argue. With how much memory were typical 386s 
equipped back then? 16 MiBs? With a page size of 4 KiBs that leaves 4096 
pages to map the entirety of RAM.

My current testing box has 8 GiBs. If I were to map the entirety of my 
RAM with 2-MiB pages that would still require 4096 pages. Did anyone set 
up pages pools with Linux in the 90s? Did anyone complain that 4096 
bytes are too much of a page size to effectively use memory?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-24  7:41                       ` C.Wehrmeyer
@ 2017-10-24  8:12                         ` Michal Hocko
  2017-10-24  8:32                           ` C.Wehrmeyer
  2017-10-27 14:29                         ` Vlastimil Babka
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-10-24  8:12 UTC (permalink / raw)
  To: C.Wehrmeyer
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On Tue 24-10-17 09:41:46, C.Wehrmeyer wrote:
[...]
> 1. Provide mmap with some sort of flag (which would be redundant IMHO) in
> order to churn out properly aligned pages (not transparent, but the current
> MAP_HUGETLB flag isn't either).

You can easily implement such a thing in userspace. In fact glibc has
already done that for you.

> 2. Based on THP enabling status always churn out properly aligned pages, and
> just failsafe to smaller pages if hugepages couldn't be allocated (truly
> transparent).
> 3. Map in memory, then tell madvise to make as many hugepages out of it as
> possible while still keeping the initial mapping (not transparent, and not
> sure Linux can actually do that).

I think there is still some confusion here. Kernel will try to fault in
THP pages on properly aligned addresses. So if you create a larger
mapping than the THP size then you will get a THP (assuming the memory
is not fragmented). It is just the unaligned addresses will get regular
pages.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-24  8:12                         ` Michal Hocko
@ 2017-10-24  8:32                           ` C.Wehrmeyer
  0 siblings, 0 replies; 18+ messages in thread
From: C.Wehrmeyer @ 2017-10-24  8:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Vlastimil Babka

On 2017-10-24 10:12, Michal Hocko wrote:
> On Tue 24-10-17 09:41:46, C.Wehrmeyer wrote:
> [...]
>> 1. Provide mmap with some sort of flag (which would be redundant IMHO) in
>> order to churn out properly aligned pages (not transparent, but the current
>> MAP_HUGETLB flag isn't either).
> 
> You can easily implement such a thing in userspace. In fact glibc has
> already done that for you.

That's not the point. The point is that it's not *transparent*. Let me 
paraphrase your statements:

"Yes, you can have hugepages by just allocating things normally. THPs 
will then be used - maybe. Even though you might know best how much 
memory you actually require it requires you to fiddle with the mappings 
in order to get complete hugepages coverage, because mmap does not 
provide a mechanism for that. Or you can just live with your mappings 
only being half-hugepaged. How is that not transparent?"

Unfortunately the ratio (512) is big enough that I'm not completely OK 
with that. And in the distant future, when we all use 1-GiB pages, that 
ratio becomes even bigger.

> [...]
> I think there is still some confusion here. Kernel will try to fault in
> THP pages on properly aligned addresses. So if you create a larger
> mapping than the THP size then you will get a THP (assuming the memory
> is not fragmented). It is just the unaligned addresses will get regular
> pages.

OK, I wasn't sure about that one as well - which is why I didn't dare to 
lay hands on the kernel. It DOES support variable-sized-pages. That does 
not change the fact, however, that when THPs are enabled mmap should 
give userspace properly aligned pages exactly to avoid those smaller pages.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-24  7:41                       ` C.Wehrmeyer
  2017-10-24  8:12                         ` Michal Hocko
@ 2017-10-27 14:29                         ` Vlastimil Babka
  2017-10-27 17:06                           ` Mike Kravetz
  2017-10-27 17:31                           ` Kirill A. Shutemov
  1 sibling, 2 replies; 18+ messages in thread
From: Vlastimil Babka @ 2017-10-27 14:29 UTC (permalink / raw)
  To: C.Wehrmeyer, Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov

On 10/24/2017 09:41 AM, C.Wehrmeyer wrote:
> On 2017-10-23 20:02, Michal Hocko wrote:
>> On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
>> [...]
>>>> or you can mmap a larger block and
>>>> munmap the initial unaligned part.
>>>
>>> And how is that supposed to be transparent? When I hear "transparent" I
>>> think of a mechanism which I can put under a system so that it benefits from
>>> it, while the system does not notice or at least does not need to be aware
>>> of it. The system also does not need to be changed for it.
>>
>> How do you expect to get a huge page when the mapping itself is not
>> properly aligned?
> 
> There are four ways that I can think of from the top of my head, but 
> only one of them would be actually transparent.
> 
> 1. Provide a flag to mmap, which might be something different from 
> MAP_HUGETLB. After all your question revolved merely around properly 
> aligned pages - we don't want to *force* the kernel to reserve 
> hugepages, we just want it to provide the proper alignment in this case. 
> That wouldn't be very transparent, but it would be the easiest route to 
> go (and mmap already kind-of supports such a thing).

Maybe just have mmap() detect that the requested size is a multiple of
huge page size, and then align it automatically? I.e. a heuristic that
should work in 99% of the cases?

> 2. Based on transparent_hugepage/enabled always churn out properly 
> aligned pages. In this case madvise(MADV_HUGEPAGE) becomes obsolete - 

madvise(MADV_HUGEPAGE) isn't about alignment. It controls whether the
mapping can get THP pages when the system global default is set to
"madvise" (thus other mappings don't get them at all), or whether the
system will try harder to defragment memory during page fault to
instantiate a THP page, when the "defrag" option is not set to "always"
but "madvise".

> after all it's mmap which decides what kind of addresses we get. First 
> getting *some* mapping that isn't properly aligned for hugepages and 
> *then* trying to mitigate the damage by another syscall not only defies 
> the meaning of "transparent", but might also be hard to implement 
> kernel-side. Let's say I map 8 MiBs of memory, without mmap knowing that 
> I'd prefer this to be allocated via THPs. I could either go with your 
> route (map 8 MiBs and then some more, trim at the beginning and the end, 
> and then tell madvise that all of that is now going to be hugepages - 
> which is something that could easily be done in the kernel, especially 
> with the internal knowledge about what the actual page size is and 
> without all those context switches that one takes in by mapping, 
> munmapping, munmapping *again* and then *madvising* the actual memory), 
> or I'd go with my third option.
> 
> 3. I map 8 MiBs, some some misaligned address from mmap, and then try to 
> mitigate the damage by telling madvise that all that is now supposed to 
> use hugepages. The dumb way of implementing this would be to split the 
> mapping - one section at the beginning has 256 4-KiB pages, the next one 
> utilises 3 2-MiB pages, and the last section has 256 4-KiB pages again 
> (or some such), effectively equalling 8 MiBs. I don't even know if Linux 
> supports variable-page-size mappings, and of course we're still carrying

Yes, Linux can combine THP huge pages and base pages in the same mapping.

> 512 4-KiBs pages with us that would have easily been mapped into one 
> 2-MiB page, which is why I call it the dumb way.
> 
> 4. Like three, but a wee bit smarter: introduce another system call that 
> works like madvise(MADV_HUGEPAGE), but let it return the address of a 
> properly aligned mapping, thus giving userspace 4 genuine 2-MiB pages. 
> Just like 3) that wouldn't be transparent, but at least it's only 4 
> context switches that don't give us half-baked hugepages. However, this 
> approach would effectively only be 1), just more complicated and 
> un-transparent.
> 
> tl; dr:
> 
> 1. Provide mmap with some sort of flag (which would be redundant IMHO) 
> in order to churn out properly aligned pages (not transparent, but the 
> current MAP_HUGETLB flag isn't either).
> 2. Based on THP enabling status always churn out properly aligned pages, 
> and just failsafe to smaller pages if hugepages couldn't be allocated 
> (truly transparent).
> 3. Map in memory, then tell madvise to make as many hugepages out of it 
> as possible while still keeping the initial mapping (not transparent, 
> and not sure Linux can actually do that).
> 4. Introduce a new system call (not transparent from the get-go) to give 
> out properly aligned pages, or make them properly aligned while the 
> mapping is transformed from not-properly-aligned to properly-aligned.
> 
> Your call.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-27 14:29                         ` Vlastimil Babka
@ 2017-10-27 17:06                           ` Mike Kravetz
  2017-10-27 17:31                           ` Kirill A. Shutemov
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2017-10-27 17:06 UTC (permalink / raw)
  To: Vlastimil Babka, C.Wehrmeyer, Michal Hocko
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A. Shutemov

On 10/27/2017 07:29 AM, Vlastimil Babka wrote:
> On 10/24/2017 09:41 AM, C.Wehrmeyer wrote:
>> On 2017-10-23 20:02, Michal Hocko wrote:
>>> On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
>>> [...]
>>
>> 1. Provide a flag to mmap, which might be something different from 
>> MAP_HUGETLB. After all your question revolved merely around properly 
>> aligned pages - we don't want to *force* the kernel to reserve 
>> hugepages, we just want it to provide the proper alignment in this case. 
>> That wouldn't be very transparent, but it would be the easiest route to 
>> go (and mmap already kind-of supports such a thing).
> 
> Maybe just have mmap() detect that the requested size is a multiple of
> huge page size, and then align it automatically? I.e. a heuristic that
> should work in 99% of the cases?

We already do this for DAX (see thp_get_unmapped_area).  So, not much
code to write.  But could potentially fragment address spaces more.
We could also check to determine if the system/process/mapping is even
THP enabled before doing the alignment.

I like the idea, but still am concerned about fragmentation.  In addition,
even though applications shouldn't care where new mappings are placed it
would not surprise me that such a change will be noticeable to some.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
  2017-10-27 14:29                         ` Vlastimil Babka
  2017-10-27 17:06                           ` Mike Kravetz
@ 2017-10-27 17:31                           ` Kirill A. Shutemov
  1 sibling, 0 replies; 18+ messages in thread
From: Kirill A. Shutemov @ 2017-10-27 17:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: C.Wehrmeyer, Michal Hocko, Mike Kravetz, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A. Shutemov

On Fri, Oct 27, 2017 at 04:29:16PM +0200, Vlastimil Babka wrote:
> On 10/24/2017 09:41 AM, C.Wehrmeyer wrote:
> > On 2017-10-23 20:02, Michal Hocko wrote:
> >> On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
> >> [...]
> >>>> or you can mmap a larger block and
> >>>> munmap the initial unaligned part.
> >>>
> >>> And how is that supposed to be transparent? When I hear "transparent" I
> >>> think of a mechanism which I can put under a system so that it benefits from
> >>> it, while the system does not notice or at least does not need to be aware
> >>> of it. The system also does not need to be changed for it.
> >>
> >> How do you expect to get a huge page when the mapping itself is not
> >> properly aligned?
> > 
> > There are four ways that I can think of from the top of my head, but 
> > only one of them would be actually transparent.
> > 
> > 1. Provide a flag to mmap, which might be something different from 
> > MAP_HUGETLB. After all your question revolved merely around properly 
> > aligned pages - we don't want to *force* the kernel to reserve 
> > hugepages, we just want it to provide the proper alignment in this case. 
> > That wouldn't be very transparent, but it would be the easiest route to 
> > go (and mmap already kind-of supports such a thing).
> 
> Maybe just have mmap() detect that the requested size is a multiple of
> huge page size, and then align it automatically? I.e. a heuristic that
> should work in 99% of the cases?

Just don't bother.

Anon mapping for appliaction that would really benefit THP would grow
naturally: kernel will allocation new mapping next to the old one and
merge them. Doing fancy things here may hurt performance due to going
number of VMAs.

And we already do right thing for file mapping (tmpfs/shmem):
->get_unmapped_area would provide the right spot for the file, given the
size of mapping and ->vm_pgoff.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-10-27 17:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <93684e4b-9e60-ef3a-ba62-5719fdf7cff9@gmx.de>
     [not found] ` <6b639da5-ad9a-158c-ad4a-7a4e44bd98fc@gmx.de>
2017-10-20 22:42   ` PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL Mike Kravetz
2017-10-23 11:42     ` Michal Hocko
2017-10-23 12:22       ` C.Wehrmeyer
2017-10-23 12:41         ` Michal Hocko
2017-10-23 14:00           ` C.Wehrmeyer
2017-10-23 16:13             ` Michal Hocko
2017-10-23 16:46               ` C.Wehrmeyer
2017-10-23 16:57                 ` Michal Hocko
2017-10-23 17:52                   ` C.Wehrmeyer
2017-10-23 18:02                     ` Michal Hocko
2017-10-24  7:41                       ` C.Wehrmeyer
2017-10-24  8:12                         ` Michal Hocko
2017-10-24  8:32                           ` C.Wehrmeyer
2017-10-27 14:29                         ` Vlastimil Babka
2017-10-27 17:06                           ` Mike Kravetz
2017-10-27 17:31                           ` Kirill A. Shutemov
2017-10-23 18:51                     ` Mike Kravetz
2017-10-24  8:09                       ` C.Wehrmeyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).