[RFC] Hugepage collapse in process context

* [RFC] Hugepage collapse in process context
@ 2021-02-17  4:24 David Rientjes
  2021-02-17  8:21 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Rientjes @ 2021-02-17  4:24 UTC (permalink / raw)
  To: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Michal Hocko, Matthew Wilcox, Minchan Kim,
	Vlastimil Babka
  Cc: Chris Kennelly, linux-mm

Hi everybody,

Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
That's normally fine as a system-wide setting, but some applications would 
benefit from a more aggressive approach (as long as they are willing to 
pay for it).

Instead of adding priorities for eligible ranges of memory to khugepaged, 
temporarily speeding khugepaged up for the whole system, or sharding its 
work for memory belonging to a certain process, one approach would be to 
allow userspace to induce hugepage collapse.

The benefit to this approach would be that this is done in process context 
so its cpu is charged to the process that is inducing the collapse.  
Khugepaged is not involved.

Idea was to allow userspace to induce hugepage collapse through the new 
process_madvise() call.  This allows us to collapse hugepages on behalf of 
current or another process for a vectored set of ranges.

This could be done through a new process_madvise() mode *or* it could be a 
flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
to be passed.  For example, MADV_F_SYNC.

When done, this madvise call would allocate a hugepage on the right node 
and attempt to do the collapse in process context just as khugepaged would 
otherwise do.

This would immediately be useful for a malloc implementation, for example, 
that has released its memory back to the system using MADV_DONTNEED and 
will subsequently refault the memory.  Rather than wait for khugepaged to 
come along 30m later, for example, and collapse this memory into a 
hugepage (which could take a much longer time on a very large system), an 
alternative would be to use this process_madvise() mode to induce the 
action up front.  In other words, say "I'm returning this memory to the 
application and it's going to be hot, so back it by a hugepage now rather 
than waiting until later."

It would also be useful for read-only file-backed mappings for text 
segments.  Khugepaged should be happy, it's just less work done by generic 
kthreads that gets charged as an overall tax to everybody.

Thoughts?

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread