All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Hugepage collapse in process context
@ 2021-02-17  4:24 David Rientjes
  2021-02-17  8:21 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Rientjes @ 2021-02-17  4:24 UTC (permalink / raw)
  To: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Michal Hocko, Matthew Wilcox, Minchan Kim,
	Vlastimil Babka
  Cc: Chris Kennelly, linux-mm

Hi everybody,

Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
That's normally fine as a system-wide setting, but some applications would 
benefit from a more aggressive approach (as long as they are willing to 
pay for it).

Instead of adding priorities for eligible ranges of memory to khugepaged, 
temporarily speeding khugepaged up for the whole system, or sharding its 
work for memory belonging to a certain process, one approach would be to 
allow userspace to induce hugepage collapse.

The benefit to this approach would be that this is done in process context 
so its cpu is charged to the process that is inducing the collapse.  
Khugepaged is not involved.

Idea was to allow userspace to induce hugepage collapse through the new 
process_madvise() call.  This allows us to collapse hugepages on behalf of 
current or another process for a vectored set of ranges.

This could be done through a new process_madvise() mode *or* it could be a 
flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
to be passed.  For example, MADV_F_SYNC.

When done, this madvise call would allocate a hugepage on the right node 
and attempt to do the collapse in process context just as khugepaged would 
otherwise do.

This would immediately be useful for a malloc implementation, for example, 
that has released its memory back to the system using MADV_DONTNEED and 
will subsequently refault the memory.  Rather than wait for khugepaged to 
come along 30m later, for example, and collapse this memory into a 
hugepage (which could take a much longer time on a very large system), an 
alternative would be to use this process_madvise() mode to induce the 
action up front.  In other words, say "I'm returning this memory to the 
application and it's going to be hot, so back it by a hugepage now rather 
than waiting until later."

It would also be useful for read-only file-backed mappings for text 
segments.  Khugepaged should be happy, it's just less work done by generic 
kthreads that gets charged as an overall tax to everybody.

Thoughts?

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-17  4:24 [RFC] Hugepage collapse in process context David Rientjes
@ 2021-02-17  8:21 ` Michal Hocko
  2021-02-18 13:43   ` Vlastimil Babka
  2021-02-17 15:49 ` Zi Yan
  2021-02-18  8:11 ` Song Liu
  2 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2021-02-17  8:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Matthew Wilcox, Minchan Kim, Vlastimil Babka,
	Chris Kennelly, linux-mm, linux-api

[Cc linux-api]

On Tue 16-02-21 20:24:16, David Rientjes wrote:
> Hi everybody,
> 
> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> That's normally fine as a system-wide setting, but some applications would 
> benefit from a more aggressive approach (as long as they are willing to 
> pay for it).
> 
> Instead of adding priorities for eligible ranges of memory to khugepaged, 
> temporarily speeding khugepaged up for the whole system, or sharding its 
> work for memory belonging to a certain process, one approach would be to 
> allow userspace to induce hugepage collapse.
> 
> The benefit to this approach would be that this is done in process context 
> so its cpu is charged to the process that is inducing the collapse.  
> Khugepaged is not involved.

Yes, this makes a lot of sense to me.

> Idea was to allow userspace to induce hugepage collapse through the new 
> process_madvise() call.  This allows us to collapse hugepages on behalf of 
> current or another process for a vectored set of ranges.

Yes, madvise sounds like a good fit for the purpose.

> This could be done through a new process_madvise() mode *or* it could be a 
> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> to be passed.  For example, MADV_F_SYNC.

Would this MADV_F_SYNC be applicable to other madvise modes? Most
existing madvise modes do not seem to make much sense. We can argue that
MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
sure we want to provide such a strong semantic because it can limit
future reclaim optimizations.

To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-17  4:24 [RFC] Hugepage collapse in process context David Rientjes
  2021-02-17  8:21 ` Michal Hocko
@ 2021-02-17 15:49 ` Zi Yan
  2021-02-18  8:11 ` Song Liu
  2 siblings, 0 replies; 16+ messages in thread
From: Zi Yan @ 2021-02-17 15:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Michal Hocko, Matthew Wilcox, Minchan Kim,
	Vlastimil Babka, Chris Kennelly, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

On 16 Feb 2021, at 23:24, David Rientjes wrote:

> Hi everybody,
>
> Khugepaged is slow by default, it scans at most 4096 pages every 10s.
> That's normally fine as a system-wide setting, but some applications would
> benefit from a more aggressive approach (as long as they are willing to
> pay for it).
>
> Instead of adding priorities for eligible ranges of memory to khugepaged,
> temporarily speeding khugepaged up for the whole system, or sharding its
> work for memory belonging to a certain process, one approach would be to
> allow userspace to induce hugepage collapse.
>
> The benefit to this approach would be that this is done in process context
> so its cpu is charged to the process that is inducing the collapse.
> Khugepaged is not involved.
>
> Idea was to allow userspace to induce hugepage collapse through the new
> process_madvise() call.  This allows us to collapse hugepages on behalf of
> current or another process for a vectored set of ranges.
>
> This could be done through a new process_madvise() mode *or* it could be a
> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter
> to be passed.  For example, MADV_F_SYNC.
>
> When done, this madvise call would allocate a hugepage on the right node
> and attempt to do the collapse in process context just as khugepaged would
> otherwise do.
>
> This would immediately be useful for a malloc implementation, for example,
> that has released its memory back to the system using MADV_DONTNEED and
> will subsequently refault the memory.  Rather than wait for khugepaged to
> come along 30m later, for example, and collapse this memory into a
> hugepage (which could take a much longer time on a very large system), an
> alternative would be to use this process_madvise() mode to induce the
> action up front.  In other words, say "I'm returning this memory to the
> application and it's going to be hot, so back it by a hugepage now rather
> than waiting until later."
>
> It would also be useful for read-only file-backed mappings for text
> segments.  Khugepaged should be happy, it's just less work done by generic
> kthreads that gets charged as an overall tax to everybody.
>
> Thoughts?

The idea sounds great to me.

One question on how it interacts with khugepaged: will the process be excluded
from khugepaged if this process_madvise() is used on it? Since it may save
khugepaged some additional scanning work if someone is actively collapsing
hugepages for this process.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-17  4:24 [RFC] Hugepage collapse in process context David Rientjes
  2021-02-17  8:21 ` Michal Hocko
  2021-02-17 15:49 ` Zi Yan
@ 2021-02-18  8:11 ` Song Liu
  2021-02-18  8:39   ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-02-18  8:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Michal Hocko, Matthew Wilcox, Minchan Kim, Vlastimil Babka,
	Chris Kennelly, Linux MM, Linux API



> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@google.com> wrote:
> 
> Hi everybody,
> 
> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> That's normally fine as a system-wide setting, but some applications would 
> benefit from a more aggressive approach (as long as they are willing to 
> pay for it).
> 
> Instead of adding priorities for eligible ranges of memory to khugepaged, 
> temporarily speeding khugepaged up for the whole system, or sharding its 
> work for memory belonging to a certain process, one approach would be to 
> allow userspace to induce hugepage collapse.
> 
> The benefit to this approach would be that this is done in process context 
> so its cpu is charged to the process that is inducing the collapse.  
> Khugepaged is not involved.
> 
> Idea was to allow userspace to induce hugepage collapse through the new 
> process_madvise() call.  This allows us to collapse hugepages on behalf of 
> current or another process for a vectored set of ranges.
> 
> This could be done through a new process_madvise() mode *or* it could be a 
> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> to be passed.  For example, MADV_F_SYNC.
> 
> When done, this madvise call would allocate a hugepage on the right node 
> and attempt to do the collapse in process context just as khugepaged would 
> otherwise do.

This is very interesting idea. One question, IIUC, the user process will 
block until all small pages in given ranges are collapsed into THPs. What 
would happen if the memory is so fragmented that we cannot allocate that 
many huge pages? Do we need some fail over mechanisms? 

> 
> This would immediately be useful for a malloc implementation, for example, 
> that has released its memory back to the system using MADV_DONTNEED and 
> will subsequently refault the memory.  Rather than wait for khugepaged to 
> come along 30m later, for example, and collapse this memory into a 
> hugepage (which could take a much longer time on a very large system), an 
> alternative would be to use this process_madvise() mode to induce the 
> action up front.  In other words, say "I'm returning this memory to the 
> application and it's going to be hot, so back it by a hugepage now rather 
> than waiting until later."
> 
> It would also be useful for read-only file-backed mappings for text 
> segments.  Khugepaged should be happy, it's just less work done by generic 
> kthreads that gets charged as an overall tax to everybody.

Mixing sync-THP with async-THP (khugepaged) could be useful when there are 
different priorities of THPs. In one of the use cases, we use THP for both 
text and data. The ratio may look like 5x THPs for text, and 2000x THPs for 
data. If the system has fewer than 2005 THPs, we wouldn't wait, but we would 
prioritize THPs for text. With this new mechanism, we can use sync-THP for 
the text, and async-THP for the data. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18  8:11 ` Song Liu
@ 2021-02-18  8:39   ` Michal Hocko
  2021-02-18  9:53     ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2021-02-18  8:39 UTC (permalink / raw)
  To: Song Liu
  Cc: David Rientjes, Alex Shi, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Matthew Wilcox, Minchan Kim, Vlastimil Babka,
	Chris Kennelly, Linux MM, Linux API

On Thu 18-02-21 08:11:13, Song Liu wrote:
> 
> 
> > On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@google.com> wrote:
> > 
> > Hi everybody,
> > 
> > Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> > That's normally fine as a system-wide setting, but some applications would 
> > benefit from a more aggressive approach (as long as they are willing to 
> > pay for it).
> > 
> > Instead of adding priorities for eligible ranges of memory to khugepaged, 
> > temporarily speeding khugepaged up for the whole system, or sharding its 
> > work for memory belonging to a certain process, one approach would be to 
> > allow userspace to induce hugepage collapse.
> > 
> > The benefit to this approach would be that this is done in process context 
> > so its cpu is charged to the process that is inducing the collapse.  
> > Khugepaged is not involved.
> > 
> > Idea was to allow userspace to induce hugepage collapse through the new 
> > process_madvise() call.  This allows us to collapse hugepages on behalf of 
> > current or another process for a vectored set of ranges.
> > 
> > This could be done through a new process_madvise() mode *or* it could be a 
> > flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> > to be passed.  For example, MADV_F_SYNC.
> > 
> > When done, this madvise call would allocate a hugepage on the right node 
> > and attempt to do the collapse in process context just as khugepaged would 
> > otherwise do.
> 
> This is very interesting idea. One question, IIUC, the user process will 
> block until all small pages in given ranges are collapsed into THPs.

Do you mean that PF would be blocked due to exclusive mmap_sem? Or is
there anything else oyu have in mind?

> What 
> would happen if the memory is so fragmented that we cannot allocate that 
> many huge pages? Do we need some fail over mechanisms? 

IIRC khugepaged preallocates pages without holding any locks and I would
expect the same will be done for madvise as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18  8:39   ` Michal Hocko
@ 2021-02-18  9:53     ` Song Liu
  2021-02-18 10:01       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-02-18  9:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Alex Shi, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Matthew Wilcox, Minchan Kim, Vlastimil Babka,
	Chris Kennelly, Linux MM, Linux API



> On Feb 18, 2021, at 12:39 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Thu 18-02-21 08:11:13, Song Liu wrote:
>> 
>> 
>>> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@google.com> wrote:
>>> 
>>> Hi everybody,
>>> 
>>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
>>> That's normally fine as a system-wide setting, but some applications would 
>>> benefit from a more aggressive approach (as long as they are willing to 
>>> pay for it).
>>> 
>>> Instead of adding priorities for eligible ranges of memory to khugepaged, 
>>> temporarily speeding khugepaged up for the whole system, or sharding its 
>>> work for memory belonging to a certain process, one approach would be to 
>>> allow userspace to induce hugepage collapse.
>>> 
>>> The benefit to this approach would be that this is done in process context 
>>> so its cpu is charged to the process that is inducing the collapse.  
>>> Khugepaged is not involved.
>>> 
>>> Idea was to allow userspace to induce hugepage collapse through the new 
>>> process_madvise() call.  This allows us to collapse hugepages on behalf of 
>>> current or another process for a vectored set of ranges.
>>> 
>>> This could be done through a new process_madvise() mode *or* it could be a 
>>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
>>> to be passed.  For example, MADV_F_SYNC.
>>> 
>>> When done, this madvise call would allocate a hugepage on the right node 
>>> and attempt to do the collapse in process context just as khugepaged would 
>>> otherwise do.
>> 
>> This is very interesting idea. One question, IIUC, the user process will 
>> block until all small pages in given ranges are collapsed into THPs.
> 
> Do you mean that PF would be blocked due to exclusive mmap_sem? Or is
> there anything else oyu have in mind?

I was thinking about memory defragmentation when the application asks for
many THPs. Say the application looks like

main()
{
	malloc();
	madvise(HUGE);
	process_madvise();
	
	/* start doing work */
}

IIUC, when process_madvise() finishes, the THPs should be ready. However, 
if defragmentation takes a long time, the process will wait in process_madvise().

Thanks,
Song


> 
>> What 
>> would happen if the memory is so fragmented that we cannot allocate that 
>> many huge pages? Do we need some fail over mechanisms? 
> 
> IIRC khugepaged preallocates pages without holding any locks and I would
> expect the same will be done for madvise as well.
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18  9:53     ` Song Liu
@ 2021-02-18 10:01       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2021-02-18 10:01 UTC (permalink / raw)
  To: Song Liu
  Cc: David Rientjes, Alex Shi, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Matthew Wilcox, Minchan Kim, Vlastimil Babka,
	Chris Kennelly, Linux MM, Linux API

On Thu 18-02-21 09:53:25, Song Liu wrote:
> 
> 
> > On Feb 18, 2021, at 12:39 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Thu 18-02-21 08:11:13, Song Liu wrote:
> >> 
> >> 
> >>> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@google.com> wrote:
> >>> 
> >>> Hi everybody,
> >>> 
> >>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> >>> That's normally fine as a system-wide setting, but some applications would 
> >>> benefit from a more aggressive approach (as long as they are willing to 
> >>> pay for it).
> >>> 
> >>> Instead of adding priorities for eligible ranges of memory to khugepaged, 
> >>> temporarily speeding khugepaged up for the whole system, or sharding its 
> >>> work for memory belonging to a certain process, one approach would be to 
> >>> allow userspace to induce hugepage collapse.
> >>> 
> >>> The benefit to this approach would be that this is done in process context 
> >>> so its cpu is charged to the process that is inducing the collapse.  
> >>> Khugepaged is not involved.
> >>> 
> >>> Idea was to allow userspace to induce hugepage collapse through the new 
> >>> process_madvise() call.  This allows us to collapse hugepages on behalf of 
> >>> current or another process for a vectored set of ranges.
> >>> 
> >>> This could be done through a new process_madvise() mode *or* it could be a 
> >>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> >>> to be passed.  For example, MADV_F_SYNC.
> >>> 
> >>> When done, this madvise call would allocate a hugepage on the right node 
> >>> and attempt to do the collapse in process context just as khugepaged would 
> >>> otherwise do.
> >> 
> >> This is very interesting idea. One question, IIUC, the user process will 
> >> block until all small pages in given ranges are collapsed into THPs.
> > 
> > Do you mean that PF would be blocked due to exclusive mmap_sem? Or is
> > there anything else oyu have in mind?
> 
> I was thinking about memory defragmentation when the application asks for
> many THPs. Say the application looks like
> 
> main()
> {
> 	malloc();
> 	madvise(HUGE);
> 	process_madvise();
> 	
> 	/* start doing work */
> }
> 
> IIUC, when process_madvise() finishes, the THPs should be ready. However, 
> if defragmentation takes a long time, the process will wait in process_madvise().

OK, I see. The operation is definitely free which is to be expected. You
can do the same from a thread which can spend time collapsing THPs.
There are still internal resources that might block others - e.g. the
above mentioned mmap_sem. We can try hard to reduce the lock time but
this is unlikely to be completely free of any interruption of the
workload.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-17  8:21 ` Michal Hocko
@ 2021-02-18 13:43   ` Vlastimil Babka
  2021-02-18 13:52     ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2021-02-18 13:43 UTC (permalink / raw)
  To: Michal Hocko, David Rientjes
  Cc: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Matthew Wilcox, Minchan Kim, Chris Kennelly, linux-mm,
	linux-api, David Hildenbrand

On 2/17/21 9:21 AM, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Tue 16-02-21 20:24:16, David Rientjes wrote:
>> Hi everybody,
>> 
>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
>> That's normally fine as a system-wide setting, but some applications would 
>> benefit from a more aggressive approach (as long as they are willing to 
>> pay for it).
>> 
>> Instead of adding priorities for eligible ranges of memory to khugepaged, 
>> temporarily speeding khugepaged up for the whole system, or sharding its 
>> work for memory belonging to a certain process, one approach would be to 
>> allow userspace to induce hugepage collapse.
>> 
>> The benefit to this approach would be that this is done in process context 
>> so its cpu is charged to the process that is inducing the collapse.  
>> Khugepaged is not involved.
> 
> Yes, this makes a lot of sense to me.
> 
>> Idea was to allow userspace to induce hugepage collapse through the new 
>> process_madvise() call.  This allows us to collapse hugepages on behalf of 
>> current or another process for a vectored set of ranges.
> 
> Yes, madvise sounds like a good fit for the purpose.

Agreed on both points.

>> This could be done through a new process_madvise() mode *or* it could be a 
>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
>> to be passed.  For example, MADV_F_SYNC.
> 
> Would this MADV_F_SYNC be applicable to other madvise modes? Most
> existing madvise modes do not seem to make much sense. We can argue that
> MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
> sure we want to provide such a strong semantic because it can limit
> future reclaim optimizations.
> 
> To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.

I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
MADV_WILLNEED with this semantic? But you are probably more interested in
process_madvise() anyway. There the new flag would make more sense. But there's
also David H.'s proposal for MADV_POPULATE and there might be benefit in
considering both at the same time? Should e.g. MADV_POPULATE with MADV_HUGEPAGE
have the collapse semantics? But would MADV_POPULATE be added to
process_madvise() as well? Just thinking out loud so we don't end up with more
flags than necessary, it's already confusing enough as it is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18 13:43   ` Vlastimil Babka
@ 2021-02-18 13:52     ` David Hildenbrand
  2021-02-18 22:34         ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2021-02-18 13:52 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko, David Rientjes
  Cc: Alex Shi, Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov,
	Song Liu, Matthew Wilcox, Minchan Kim, Chris Kennelly, linux-mm,
	linux-api

On 18.02.21 14:43, Vlastimil Babka wrote:
> On 2/17/21 9:21 AM, Michal Hocko wrote:
>> [Cc linux-api]
>>
>> On Tue 16-02-21 20:24:16, David Rientjes wrote:
>>> Hi everybody,
>>>
>>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.
>>> That's normally fine as a system-wide setting, but some applications would
>>> benefit from a more aggressive approach (as long as they are willing to
>>> pay for it).
>>>
>>> Instead of adding priorities for eligible ranges of memory to khugepaged,
>>> temporarily speeding khugepaged up for the whole system, or sharding its
>>> work for memory belonging to a certain process, one approach would be to
>>> allow userspace to induce hugepage collapse.
>>>
>>> The benefit to this approach would be that this is done in process context
>>> so its cpu is charged to the process that is inducing the collapse.
>>> Khugepaged is not involved.
>>
>> Yes, this makes a lot of sense to me.
>>
>>> Idea was to allow userspace to induce hugepage collapse through the new
>>> process_madvise() call.  This allows us to collapse hugepages on behalf of
>>> current or another process for a vectored set of ranges.
>>
>> Yes, madvise sounds like a good fit for the purpose.
> 
> Agreed on both points.
> 
>>> This could be done through a new process_madvise() mode *or* it could be a
>>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter
>>> to be passed.  For example, MADV_F_SYNC.
>>
>> Would this MADV_F_SYNC be applicable to other madvise modes? Most
>> existing madvise modes do not seem to make much sense. We can argue that
>> MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
>> sure we want to provide such a strong semantic because it can limit
>> future reclaim optimizations.
>>
>> To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
> 
> I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
> MADV_WILLNEED with this semantic? But you are probably more interested in
> process_madvise() anyway. There the new flag would make more sense. But there's
> also David H.'s proposal for MADV_POPULATE and there might be benefit in
> considering both at the same time? Should e.g. MADV_POPULATE with MADV_HUGEPAGE
> have the collapse semantics? But would MADV_POPULATE be added to
> process_madvise() as well? Just thinking out loud so we don't end up with more
> flags than necessary, it's already confusing enough as it is.
> 

Note that madvise() eats only a single value, not flags. Combinations as 
you describe are not possible.

Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the 
mmap lock in write and does not modify the actual VMA, only a mapping.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18 13:52     ` David Hildenbrand
@ 2021-02-18 22:34         ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2021-02-18 22:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Michal Hocko, Alex Shi, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Song Liu, Matthew Wilcox,
	Minchan Kim, Chris Kennelly, linux-mm, linux-api

On Thu, 18 Feb 2021, David Hildenbrand wrote:

> > > > Hi everybody,
> > > > 
> > > > Khugepaged is slow by default, it scans at most 4096 pages every 10s.
> > > > That's normally fine as a system-wide setting, but some applications
> > > > would
> > > > benefit from a more aggressive approach (as long as they are willing to
> > > > pay for it).
> > > > 
> > > > Instead of adding priorities for eligible ranges of memory to
> > > > khugepaged,
> > > > temporarily speeding khugepaged up for the whole system, or sharding its
> > > > work for memory belonging to a certain process, one approach would be to
> > > > allow userspace to induce hugepage collapse.
> > > > 
> > > > The benefit to this approach would be that this is done in process
> > > > context
> > > > so its cpu is charged to the process that is inducing the collapse.
> > > > Khugepaged is not involved.
> > > 
> > > Yes, this makes a lot of sense to me.
> > > 
> > > > Idea was to allow userspace to induce hugepage collapse through the new
> > > > process_madvise() call.  This allows us to collapse hugepages on behalf
> > > > of
> > > > current or another process for a vectored set of ranges.
> > > 
> > > Yes, madvise sounds like a good fit for the purpose.
> > 
> > Agreed on both points.
> > 
> > > > This could be done through a new process_madvise() mode *or* it could be
> > > > a
> > > > flag to MADV_HUGEPAGE since process_madvise() allows for a flag
> > > > parameter
> > > > to be passed.  For example, MADV_F_SYNC.
> > > 
> > > Would this MADV_F_SYNC be applicable to other madvise modes? Most
> > > existing madvise modes do not seem to make much sense. We can argue that
> > > MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
> > > sure we want to provide such a strong semantic because it can limit
> > > future reclaim optimizations.
> > > 
> > > To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
> > 
> > I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
> > MADV_WILLNEED with this semantic? But you are probably more interested in
> > process_madvise() anyway. There the new flag would make more sense. But
> > there's
> > also David H.'s proposal for MADV_POPULATE and there might be benefit in
> > considering both at the same time? Should e.g. MADV_POPULATE with
> > MADV_HUGEPAGE
> > have the collapse semantics? But would MADV_POPULATE be added to
> > process_madvise() as well? Just thinking out loud so we don't end up with
> > more
> > flags than necessary, it's already confusing enough as it is.
> > 
> 
> Note that madvise() eats only a single value, not flags. Combinations as you
> describe are not possible.
> 
> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
> lock in write and does not modify the actual VMA, only a mapping.
> 

Agreed, and happy to see that there's a general consensus for the 
direction.  Benefit of a new madvise mode is that it can be used for 
madvise() as well if you are interested in only a single range of your own 
memory and then it doesn't need to reconcile with any of the already 
overloaded semantics of MADV_HUGEPAGE.

Otherwise, process_madvise() can be used for other processes and/or 
vectored ranges.

Song's use case for this to prioritize thp usage is very important for us 
as well.  I hadn't thought of the madvise(MADV_HUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter 
would allocate the hugepage with khugepaged's gfp mask so it would always 
compact.  But it seems like this would actually be better to use the gfp 
mask that would be used at fault for the vma and left to userspace to 
determine whether that's MADV_HUGEPAGE or not.  Makes sense.

(Userspace could even do madvise(MADV_NOHUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but 
otherwise exclude it from khugepaged's consideration if it were inclined.)

Two other minor points:

 - Currently, process_madvise() doesn't use the flags parameter at all so 
   there's the question of whether we need generalized flags that apply to 
   most madvise modes or whether the flags can be specific to the mode 
   being used.  For example, a natural extension of this new mode would be 
   to determine the hugepage size if we were ever to support synchronous 
   collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)

 - We haven't discussed the future of khugepaged with this new mode: it 
   seems like we could simply implement khugepaged fully in userspace and 
   remove it from the kernel? :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
@ 2021-02-18 22:34         ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2021-02-18 22:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Michal Hocko, Alex Shi, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Song Liu, Matthew Wilcox,
	Minchan Kim, Chris Kennelly, linux-mm, linux-api

On Thu, 18 Feb 2021, David Hildenbrand wrote:

> > > > Hi everybody,
> > > > 
> > > > Khugepaged is slow by default, it scans at most 4096 pages every 10s.
> > > > That's normally fine as a system-wide setting, but some applications
> > > > would
> > > > benefit from a more aggressive approach (as long as they are willing to
> > > > pay for it).
> > > > 
> > > > Instead of adding priorities for eligible ranges of memory to
> > > > khugepaged,
> > > > temporarily speeding khugepaged up for the whole system, or sharding its
> > > > work for memory belonging to a certain process, one approach would be to
> > > > allow userspace to induce hugepage collapse.
> > > > 
> > > > The benefit to this approach would be that this is done in process
> > > > context
> > > > so its cpu is charged to the process that is inducing the collapse.
> > > > Khugepaged is not involved.
> > > 
> > > Yes, this makes a lot of sense to me.
> > > 
> > > > Idea was to allow userspace to induce hugepage collapse through the new
> > > > process_madvise() call.  This allows us to collapse hugepages on behalf
> > > > of
> > > > current or another process for a vectored set of ranges.
> > > 
> > > Yes, madvise sounds like a good fit for the purpose.
> > 
> > Agreed on both points.
> > 
> > > > This could be done through a new process_madvise() mode *or* it could be
> > > > a
> > > > flag to MADV_HUGEPAGE since process_madvise() allows for a flag
> > > > parameter
> > > > to be passed.  For example, MADV_F_SYNC.
> > > 
> > > Would this MADV_F_SYNC be applicable to other madvise modes? Most
> > > existing madvise modes do not seem to make much sense. We can argue that
> > > MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
> > > sure we want to provide such a strong semantic because it can limit
> > > future reclaim optimizations.
> > > 
> > > To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
> > 
> > I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
> > MADV_WILLNEED with this semantic? But you are probably more interested in
> > process_madvise() anyway. There the new flag would make more sense. But
> > there's
> > also David H.'s proposal for MADV_POPULATE and there might be benefit in
> > considering both at the same time? Should e.g. MADV_POPULATE with
> > MADV_HUGEPAGE
> > have the collapse semantics? But would MADV_POPULATE be added to
> > process_madvise() as well? Just thinking out loud so we don't end up with
> > more
> > flags than necessary, it's already confusing enough as it is.
> > 
> 
> Note that madvise() eats only a single value, not flags. Combinations as you
> describe are not possible.
> 
> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
> lock in write and does not modify the actual VMA, only a mapping.
> 

Agreed, and happy to see that there's a general consensus for the 
direction.  Benefit of a new madvise mode is that it can be used for 
madvise() as well if you are interested in only a single range of your own 
memory and then it doesn't need to reconcile with any of the already 
overloaded semantics of MADV_HUGEPAGE.

Otherwise, process_madvise() can be used for other processes and/or 
vectored ranges.

Song's use case for this to prioritize thp usage is very important for us 
as well.  I hadn't thought of the madvise(MADV_HUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter 
would allocate the hugepage with khugepaged's gfp mask so it would always 
compact.  But it seems like this would actually be better to use the gfp 
mask that would be used at fault for the vma and left to userspace to 
determine whether that's MADV_HUGEPAGE or not.  Makes sense.

(Userspace could even do madvise(MADV_NOHUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but 
otherwise exclude it from khugepaged's consideration if it were inclined.)

Two other minor points:

 - Currently, process_madvise() doesn't use the flags parameter at all so 
   there's the question of whether we need generalized flags that apply to 
   most madvise modes or whether the flags can be specific to the mode 
   being used.  For example, a natural extension of this new mode would be 
   to determine the hugepage size if we were ever to support synchronous 
   collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)

 - We haven't discussed the future of khugepaged with this new mode: it 
   seems like we could simply implement khugepaged fully in userspace and 
   remove it from the kernel? :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18 22:34         ` David Rientjes
  (?)
@ 2021-02-19 16:16         ` Zi Yan
  -1 siblings, 0 replies; 16+ messages in thread
From: Zi Yan @ 2021-02-19 16:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: David Hildenbrand, Vlastimil Babka, Michal Hocko, Alex Shi,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Song Liu,
	Matthew Wilcox, Minchan Kim, Chris Kennelly, linux-mm, linux-api

[-- Attachment #1: Type: text/plain, Size: 5289 bytes --]

On 18 Feb 2021, at 17:34, David Rientjes wrote:

> On Thu, 18 Feb 2021, David Hildenbrand wrote:
>
>>>>> Hi everybody,
>>>>>
>>>>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.
>>>>> That's normally fine as a system-wide setting, but some applications
>>>>> would
>>>>> benefit from a more aggressive approach (as long as they are willing to
>>>>> pay for it).
>>>>>
>>>>> Instead of adding priorities for eligible ranges of memory to
>>>>> khugepaged,
>>>>> temporarily speeding khugepaged up for the whole system, or sharding its
>>>>> work for memory belonging to a certain process, one approach would be to
>>>>> allow userspace to induce hugepage collapse.
>>>>>
>>>>> The benefit to this approach would be that this is done in process
>>>>> context
>>>>> so its cpu is charged to the process that is inducing the collapse.
>>>>> Khugepaged is not involved.
>>>>
>>>> Yes, this makes a lot of sense to me.
>>>>
>>>>> Idea was to allow userspace to induce hugepage collapse through the new
>>>>> process_madvise() call.  This allows us to collapse hugepages on behalf
>>>>> of
>>>>> current or another process for a vectored set of ranges.
>>>>
>>>> Yes, madvise sounds like a good fit for the purpose.
>>>
>>> Agreed on both points.
>>>
>>>>> This could be done through a new process_madvise() mode *or* it could be
>>>>> a
>>>>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag
>>>>> parameter
>>>>> to be passed.  For example, MADV_F_SYNC.
>>>>
>>>> Would this MADV_F_SYNC be applicable to other madvise modes? Most
>>>> existing madvise modes do not seem to make much sense. We can argue that
>>>> MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
>>>> sure we want to provide such a strong semantic because it can limit
>>>> future reclaim optimizations.
>>>>
>>>> To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
>>>
>>> I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
>>> MADV_WILLNEED with this semantic? But you are probably more interested in
>>> process_madvise() anyway. There the new flag would make more sense. But
>>> there's
>>> also David H.'s proposal for MADV_POPULATE and there might be benefit in
>>> considering both at the same time? Should e.g. MADV_POPULATE with
>>> MADV_HUGEPAGE
>>> have the collapse semantics? But would MADV_POPULATE be added to
>>> process_madvise() as well? Just thinking out loud so we don't end up with
>>> more
>>> flags than necessary, it's already confusing enough as it is.
>>>
>>
>> Note that madvise() eats only a single value, not flags. Combinations as you
>> describe are not possible.
>>
>> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
>> lock in write and does not modify the actual VMA, only a mapping.
>>
>
> Agreed, and happy to see that there's a general consensus for the
> direction.  Benefit of a new madvise mode is that it can be used for
> madvise() as well if you are interested in only a single range of your own
> memory and then it doesn't need to reconcile with any of the already
> overloaded semantics of MADV_HUGEPAGE.
>
> Otherwise, process_madvise() can be used for other processes and/or
> vectored ranges.
>
> Song's use case for this to prioritize thp usage is very important for us
> as well.  I hadn't thought of the madvise(MADV_HUGEPAGE) +
> madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter
> would allocate the hugepage with khugepaged's gfp mask so it would always
> compact.  But it seems like this would actually be better to use the gfp
> mask that would be used at fault for the vma and left to userspace to
> determine whether that's MADV_HUGEPAGE or not.  Makes sense.
>
> (Userspace could even do madvise(MADV_NOHUGEPAGE) +
> madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but
> otherwise exclude it from khugepaged's consideration if it were inclined.)
>
> Two other minor points:
>
>  - Currently, process_madvise() doesn't use the flags parameter at all so
>    there's the question of whether we need generalized flags that apply to
>    most madvise modes or whether the flags can be specific to the mode
>    being used.  For example, a natural extension of this new mode would be
>    to determine the hugepage size if we were ever to support synchronous
>    collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)

I am very interested in adding support for sync collapse into 1GB THPs.
Here is my recent patches to support 1GB THP on x86: https://lwn.net/Articles/832881/.
Doing sync collapse might be the best way of getting 1GB THPs, when
bumping MAX_ORDER is not good for memory hotplug and getting 1GB pages
from CMA regions, which I proposed in my patchset, seems not to ideal.

>
>  - We haven't discussed the future of khugepaged with this new mode: it
>    seems like we could simply implement khugepaged fully in userspace and
>    remove it from the kernel? :)

I guess the page collapse code from khugepaged can be preserved and reused
for this madvise hugepage collapse, just that we might not need to launch
a kernel daemon to do the work.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-18 22:34         ` David Rientjes
  (?)
  (?)
@ 2021-02-24  9:44         ` Alex Shi
  2021-03-01 20:56             ` David Rientjes
  -1 siblings, 1 reply; 16+ messages in thread
From: Alex Shi @ 2021-02-24  9:44 UTC (permalink / raw)
  To: David Rientjes, David Hildenbrand
  Cc: Vlastimil Babka, Michal Hocko, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Song Liu, Matthew Wilcox, Minchan Kim,
	Chris Kennelly, linux-mm, linux-api



在 2021/2/19 上午6:34, David Rientjes 写道:
> On Thu, 18 Feb 2021, David Hildenbrand wrote:
> 
>>>>> Hi everybody,
>>>>>
>>>>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.
>>>>> That's normally fine as a system-wide setting, but some applications
>>>>> would
>>>>> benefit from a more aggressive approach (as long as they are willing to
>>>>> pay for it).
>>>>>
>>>>> Instead of adding priorities for eligible ranges of memory to
>>>>> khugepaged,
>>>>> temporarily speeding khugepaged up for the whole system, or sharding its
>>>>> work for memory belonging to a certain process, one approach would be to
>>>>> allow userspace to induce hugepage collapse.
>>>>>
>>>>> The benefit to this approach would be that this is done in process
>>>>> context
>>>>> so its cpu is charged to the process that is inducing the collapse.
>>>>> Khugepaged is not involved.
>>>>
>>>> Yes, this makes a lot of sense to me.
>>>>
>>>>> Idea was to allow userspace to induce hugepage collapse through the new
>>>>> process_madvise() call.  This allows us to collapse hugepages on behalf
>>>>> of
>>>>> current or another process for a vectored set of ranges.
>>>>
>>>> Yes, madvise sounds like a good fit for the purpose.
>>>
>>> Agreed on both points.
>>>
>>>>> This could be done through a new process_madvise() mode *or* it could be
>>>>> a
>>>>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag
>>>>> parameter
>>>>> to be passed.  For example, MADV_F_SYNC.
>>>>
>>>> Would this MADV_F_SYNC be applicable to other madvise modes? Most
>>>> existing madvise modes do not seem to make much sense. We can argue that
>>>> MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
>>>> sure we want to provide such a strong semantic because it can limit
>>>> future reclaim optimizations.
>>>>
>>>> To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
>>>
>>> I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
>>> MADV_WILLNEED with this semantic? But you are probably more interested in
>>> process_madvise() anyway. There the new flag would make more sense. But
>>> there's
>>> also David H.'s proposal for MADV_POPULATE and there might be benefit in
>>> considering both at the same time? Should e.g. MADV_POPULATE with
>>> MADV_HUGEPAGE
>>> have the collapse semantics? But would MADV_POPULATE be added to
>>> process_madvise() as well? Just thinking out loud so we don't end up with
>>> more
>>> flags than necessary, it's already confusing enough as it is.
>>>
>>
>> Note that madvise() eats only a single value, not flags. Combinations as you
>> describe are not possible.
>>
>> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
>> lock in write and does not modify the actual VMA, only a mapping.
>>
> 
> Agreed, and happy to see that there's a general consensus for the 
> direction.  Benefit of a new madvise mode is that it can be used for 
> madvise() as well if you are interested in only a single range of your own 
> memory and then it doesn't need to reconcile with any of the already 
> overloaded semantics of MADV_HUGEPAGE.

It's a good idea to let process deal with its own THP policy.
but current applications will miss the benefit w/o changes, and change is
expensive for end users. So except this work, may a per memcg collapse benefit
apps and free for them, we often deploy apps in cgroups on server now.

Thanks
Alex

> 
> Otherwise, process_madvise() can be used for other processes and/or 
> vectored ranges.
> 
> Song's use case for this to prioritize thp usage is very important for us 
> as well.  I hadn't thought of the madvise(MADV_HUGEPAGE) + 
> madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter 
> would allocate the hugepage with khugepaged's gfp mask so it would always 
> compact.  But it seems like this would actually be better to use the gfp 
> mask that would be used at fault for the vma and left to userspace to 
> determine whether that's MADV_HUGEPAGE or not.  Makes sense.
> 
> (Userspace could even do madvise(MADV_NOHUGEPAGE) + 
> madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but 
> otherwise exclude it from khugepaged's consideration if it were inclined.)
> 
> Two other minor points:
> 
>  - Currently, process_madvise() doesn't use the flags parameter at all so 
>    there's the question of whether we need generalized flags that apply to 
>    most madvise modes or whether the flags can be specific to the mode 
>    being used.  For example, a natural extension of this new mode would be 
>    to determine the hugepage size if we were ever to support synchronous 
>    collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)
> 
>  - We haven't discussed the future of khugepaged with this new mode: it 
>    seems like we could simply implement khugepaged fully in userspace and 
>    remove it from the kernel? :)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-02-24  9:44         ` Alex Shi
@ 2021-03-01 20:56             ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2021-03-01 20:56 UTC (permalink / raw)
  To: Alex Shi
  Cc: David Hildenbrand, Vlastimil Babka, Michal Hocko, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Song Liu, Matthew Wilcox,
	Minchan Kim, Chris Kennelly, linux-mm, linux-api

On Wed, 24 Feb 2021, Alex Shi wrote:

> > Agreed, and happy to see that there's a general consensus for the 
> > direction.  Benefit of a new madvise mode is that it can be used for 
> > madvise() as well if you are interested in only a single range of your own 
> > memory and then it doesn't need to reconcile with any of the already 
> > overloaded semantics of MADV_HUGEPAGE.
> 
> It's a good idea to let process deal with its own THP policy.
> but current applications will miss the benefit w/o changes, and change is
> expensive for end users. So except this work, may a per memcg collapse benefit
> apps and free for them, we often deploy apps in cgroups on server now.
> 

Hi Alex,

I'm not sure that I understand: this MADV_COLLAPSE would be possible for 
process_madvise() as well and by passing a vectored set of ranges so a 
process can do this on behalf of other processes (it's the only way that 
we could theoretically move khugepaged to userspace, although that's not 
an explicit end goal).

How would you see this working with memcg involved?  I had thought this 
was entirely orthogonal to any cgroup.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
@ 2021-03-01 20:56             ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2021-03-01 20:56 UTC (permalink / raw)
  To: Alex Shi
  Cc: David Hildenbrand, Vlastimil Babka, Michal Hocko, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Song Liu, Matthew Wilcox,
	Minchan Kim, Chris Kennelly, linux-mm, linux-api

On Wed, 24 Feb 2021, Alex Shi wrote:

> > Agreed, and happy to see that there's a general consensus for the 
> > direction.  Benefit of a new madvise mode is that it can be used for 
> > madvise() as well if you are interested in only a single range of your own 
> > memory and then it doesn't need to reconcile with any of the already 
> > overloaded semantics of MADV_HUGEPAGE.
> 
> It's a good idea to let process deal with its own THP policy.
> but current applications will miss the benefit w/o changes, and change is
> expensive for end users. So except this work, may a per memcg collapse benefit
> apps and free for them, we often deploy apps in cgroups on server now.
> 

Hi Alex,

I'm not sure that I understand: this MADV_COLLAPSE would be possible for 
process_madvise() as well and by passing a vectored set of ranges so a 
process can do this on behalf of other processes (it's the only way that 
we could theoretically move khugepaged to userspace, although that's not 
an explicit end goal).

How would you see this working with memcg involved?  I had thought this 
was entirely orthogonal to any cgroup.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Hugepage collapse in process context
  2021-03-01 20:56             ` David Rientjes
  (?)
@ 2021-03-04 10:52             ` Alex Shi
  -1 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2021-03-04 10:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: David Hildenbrand, Vlastimil Babka, Michal Hocko, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Song Liu, Matthew Wilcox,
	Minchan Kim, Chris Kennelly, linux-mm, linux-api



在 2021/3/2 上午4:56, David Rientjes 写道:
> On Wed, 24 Feb 2021, Alex Shi wrote:
> 
>>> Agreed, and happy to see that there's a general consensus for the 
>>> direction.  Benefit of a new madvise mode is that it can be used for 
>>> madvise() as well if you are interested in only a single range of your own 
>>> memory and then it doesn't need to reconcile with any of the already 
>>> overloaded semantics of MADV_HUGEPAGE.
>>
>> It's a good idea to let process deal with its own THP policy.
>> but current applications will miss the benefit w/o changes, and change is
>> expensive for end users. So except this work, may a per memcg collapse benefit
>> apps and free for them, we often deploy apps in cgroups on server now.
>>
> 
> Hi Alex,
> 
> I'm not sure that I understand: this MADV_COLLAPSE would be possible for 
> process_madvise() as well and by passing a vectored set of ranges so a 
> process can do this on behalf of other processes (it's the only way that 
> we could theoretically move khugepaged to userspace, although that's not 
> an explicit end goal).
> 

Forgive my stupidity, I still can't figure out how process_madvise caller
fill the iovec of other's on a common system. 

> 
> How would you see this working with memcg involved?  I had thought this 
> was entirely orthogonal to any cgroup.
> 

You'r right, it's out of cgroup and better. per cgroup khugepaged could be
a alternative way. but it require a cgroup and not specific on target process.

Thanks
Alex
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-03-04 10:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17  4:24 [RFC] Hugepage collapse in process context David Rientjes
2021-02-17  8:21 ` Michal Hocko
2021-02-18 13:43   ` Vlastimil Babka
2021-02-18 13:52     ` David Hildenbrand
2021-02-18 22:34       ` David Rientjes
2021-02-18 22:34         ` David Rientjes
2021-02-19 16:16         ` Zi Yan
2021-02-24  9:44         ` Alex Shi
2021-03-01 20:56           ` David Rientjes
2021-03-01 20:56             ` David Rientjes
2021-03-04 10:52             ` Alex Shi
2021-02-17 15:49 ` Zi Yan
2021-02-18  8:11 ` Song Liu
2021-02-18  8:39   ` Michal Hocko
2021-02-18  9:53     ` Song Liu
2021-02-18 10:01       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.