[BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-08 19:10 ` Alex Thorlton
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-08 19:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

Hey everyone,

I've run into a some frustrating behavior from the khugepaged thread,
that I'm hoping to get sorted out.  It appears that if you pin
khugepaged to a cpuset (i.e. node 0), and it begins scanning/collapsing
pages for a process on a cpuset that doesn't have any memory nodes in
common with kugepaged (i.e. node 1), then the collapsed pages will all
be allocated khugepaged's node (in this case node 0), clearly breaking
the cpuset boundary set up for the process in question.

I'm aware that there are some known issues with khugepaged performing
off-node allocations in certain situations, but I believe this is a bit
of a special circumstance since, in this situation, there's no way for
khugepaged to perform an allocation on the desired node.

The problem really stems from the way that we determine the allowed
memory nodes in get_page_from_freelist.  When we call down to
cpuset_zone_allowed_softwall, we check current->mems_allowed to
determine what nodes we're allowed on.  In the case of khugepaged, we'll
be making allocations for the mm of the process we're collapsing for,
but we'll be checking the mems_allowed of khugepaged, which can
obviously cause some problems.

Is this particular bug a known issue?  I've been trying to come up with
a simple way to fix the bug, but it's a bit difficult since we no longer
have a way to trace back to the task_struct that we're collapsing for
once we've reached get_page_from_freelist.  I'm wondering if we might
want to make the cpuset check higher up in the call-chain and then pass
that nodemask down instead of sending a NULL nodemask, as we end up
doing in many (most?) situations.  I can think of several problems with
that approach as well, but it's all I've come up with so far.

The obvious workaround is to not isolate khugepaged to a cpuset, but
since we're allowed to do so, I think the thread should probably behave
appropriately when pinned to a cpuset.

Any input on this issue is greatly appreciated.  Thanks, guys!

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-08 19:10 ` Alex Thorlton
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-08 19:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

Hey everyone,

I've run into a some frustrating behavior from the khugepaged thread,
that I'm hoping to get sorted out.  It appears that if you pin
khugepaged to a cpuset (i.e. node 0), and it begins scanning/collapsing
pages for a process on a cpuset that doesn't have any memory nodes in
common with kugepaged (i.e. node 1), then the collapsed pages will all
be allocated khugepaged's node (in this case node 0), clearly breaking
the cpuset boundary set up for the process in question.

I'm aware that there are some known issues with khugepaged performing
off-node allocations in certain situations, but I believe this is a bit
of a special circumstance since, in this situation, there's no way for
khugepaged to perform an allocation on the desired node.

The problem really stems from the way that we determine the allowed
memory nodes in get_page_from_freelist.  When we call down to
cpuset_zone_allowed_softwall, we check current->mems_allowed to
determine what nodes we're allowed on.  In the case of khugepaged, we'll
be making allocations for the mm of the process we're collapsing for,
but we'll be checking the mems_allowed of khugepaged, which can
obviously cause some problems.

Is this particular bug a known issue?  I've been trying to come up with
a simple way to fix the bug, but it's a bit difficult since we no longer
have a way to trace back to the task_struct that we're collapsing for
once we've reached get_page_from_freelist.  I'm wondering if we might
want to make the cpuset check higher up in the call-chain and then pass
that nodemask down instead of sending a NULL nodemask, as we end up
doing in many (most?) situations.  I can think of several problems with
that approach as well, but it's all I've come up with so far.

The obvious workaround is to not isolate khugepaged to a cpuset, but
since we're allowed to do so, I think the thread should probably behave
appropriately when pinned to a cpuset.

Any input on this issue is greatly appreciated.  Thanks, guys!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-08 19:10 ` Alex Thorlton
@ 2014-10-10  9:20   ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-10  9:20 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On Wed, Oct 08, 2014 at 02:10:50PM -0500, Alex Thorlton wrote:

> Is this particular bug a known issue?

Its not unexpected for me.

> I've been trying to come up with
> a simple way to fix the bug, but it's a bit difficult since we no longer
> have a way to trace back to the task_struct that we're collapsing for
> once we've reached get_page_from_freelist.  I'm wondering if we might
> want to make the cpuset check higher up in the call-chain and then pass
> that nodemask down instead of sending a NULL nodemask, as we end up
> doing in many (most?) situations.  I can think of several problems with
> that approach as well, but it's all I've come up with so far.
> 
> The obvious workaround is to not isolate khugepaged to a cpuset, but
> since we're allowed to do so, I think the thread should probably behave
> appropriately when pinned to a cpuset.
> 
> Any input on this issue is greatly appreciated.  Thanks, guys!

So for the numa thing we do everything from the affected tasks context.
There was a lot of arguments early on that that could never really work,
but here we are.

Should we convert khugepaged to the same? Drive the whole thing from
task_work? That would make this issue naturally go away.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-10  9:20   ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-10  9:20 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On Wed, Oct 08, 2014 at 02:10:50PM -0500, Alex Thorlton wrote:

> Is this particular bug a known issue?

Its not unexpected for me.

> I've been trying to come up with
> a simple way to fix the bug, but it's a bit difficult since we no longer
> have a way to trace back to the task_struct that we're collapsing for
> once we've reached get_page_from_freelist.  I'm wondering if we might
> want to make the cpuset check higher up in the call-chain and then pass
> that nodemask down instead of sending a NULL nodemask, as we end up
> doing in many (most?) situations.  I can think of several problems with
> that approach as well, but it's all I've come up with so far.
> 
> The obvious workaround is to not isolate khugepaged to a cpuset, but
> since we're allowed to do so, I think the thread should probably behave
> appropriately when pinned to a cpuset.
> 
> Any input on this issue is greatly appreciated.  Thanks, guys!

So for the numa thing we do everything from the affected tasks context.
There was a lot of arguments early on that that could never really work,
but here we are.

Should we convert khugepaged to the same? Drive the whole thing from
task_work? That would make this issue naturally go away.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-10  9:20   ` Peter Zijlstra
@ 2014-10-10 18:56     ` Alex Thorlton
  -1 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-10 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> So for the numa thing we do everything from the affected tasks context.
> There was a lot of arguments early on that that could never really work,
> but here we are.
>
> Should we convert khugepaged to the same? Drive the whole thing from
> task_work? That would make this issue naturally go away.

That seems like a reasonable idea to me, but that will change the way
that the compaction scans work right now, by quite a bit.  As I'm sure
you're aware, the way it works now is we tack our mm onto the
khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
other ways to get on there - I can't remember), then when khugepaged
wakes up it scans through each mm on the list until it hits the maximum
number of pages to scan on each pass.

If we move the compaction scan over to a task_work style function, we'll
only be able to scan the one task's mm at a time.  While the underlying
compaction infrastructure can function more or less the same, the timing
of when these scans occur, and exactly what the scans cover, will have
to change.  If we go for the most rudimentary approach, the scans will
occur each time a thread is about to return to userland after faulting
in a THP (we'll just replace the khugepaged_enter call with a
task_work_add), and will cover the mm for the current task.  A slightly
more advanced approach would involve a timer to ensure that scans don't
occur too often, as is currently handled by
khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
the fact that we'll lose the multi-mm scanning functionality our
khugepaged_scan list provides, but maybe that's not a huge issue.

Before I run off and start writing patches, here's a brief summary of
what I think we could do here:

1) Dissolve the khugepaged thread and related structs/timers (I'm
   expecting some backlash on this one).
2) Replace khugepged_enter calls with calls to task_work_add(work,
   our_new_scan_function) - new scan function will look almost exactly
   like khugepaged_scan_mm_slot.
3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
   checked during/before our_new_scan_function to ensure that we're not
   scanning more often than necessary.  Also, set up progress markers to
   limit the number of pages scanned in a single pass.

By doing this, scans will get triggered each time a thread that has
faulted THPs is about to return to userland execution, throttled by our
new timer/progress indicators.  The major benefit here is that scans
will now occur in the desired task's context.

Let me know if you anybody sees any major flaws in this approach.

Thanks a lot for your input, Peter!

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-10 18:56     ` Alex Thorlton
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-10 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> So for the numa thing we do everything from the affected tasks context.
> There was a lot of arguments early on that that could never really work,
> but here we are.
>
> Should we convert khugepaged to the same? Drive the whole thing from
> task_work? That would make this issue naturally go away.

That seems like a reasonable idea to me, but that will change the way
that the compaction scans work right now, by quite a bit.  As I'm sure
you're aware, the way it works now is we tack our mm onto the
khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
other ways to get on there - I can't remember), then when khugepaged
wakes up it scans through each mm on the list until it hits the maximum
number of pages to scan on each pass.

If we move the compaction scan over to a task_work style function, we'll
only be able to scan the one task's mm at a time.  While the underlying
compaction infrastructure can function more or less the same, the timing
of when these scans occur, and exactly what the scans cover, will have
to change.  If we go for the most rudimentary approach, the scans will
occur each time a thread is about to return to userland after faulting
in a THP (we'll just replace the khugepaged_enter call with a
task_work_add), and will cover the mm for the current task.  A slightly
more advanced approach would involve a timer to ensure that scans don't
occur too often, as is currently handled by
khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
the fact that we'll lose the multi-mm scanning functionality our
khugepaged_scan list provides, but maybe that's not a huge issue.

Before I run off and start writing patches, here's a brief summary of
what I think we could do here:

1) Dissolve the khugepaged thread and related structs/timers (I'm
   expecting some backlash on this one).
2) Replace khugepged_enter calls with calls to task_work_add(work,
   our_new_scan_function) - new scan function will look almost exactly
   like khugepaged_scan_mm_slot.
3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
   checked during/before our_new_scan_function to ensure that we're not
   scanning more often than necessary.  Also, set up progress markers to
   limit the number of pages scanned in a single pass.

By doing this, scans will get triggered each time a thread that has
faulted THPs is about to return to userland execution, throttled by our
new timer/progress indicators.  The major benefit here is that scans
will now occur in the desired task's context.

Let me know if you anybody sees any major flaws in this approach.

Thanks a lot for your input, Peter!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-10 18:56     ` Alex Thorlton
@ 2014-10-10 21:57       ` Vlastimil Babka
  -1 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2014-10-10 21:57 UTC (permalink / raw)
  To: Alex Thorlton, Peter Zijlstra
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On 10.10.2014 20:56, Alex Thorlton wrote:
> On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
>> So for the numa thing we do everything from the affected tasks context.
>> There was a lot of arguments early on that that could never really work,
>> but here we are.
>>
>> Should we convert khugepaged to the same? Drive the whole thing from
>> task_work? That would make this issue naturally go away.

Mel was suggesting to me few weeks ago that this would be desirable and 
I was
planning to look at this soonish. But if you volunteer, great :)

> If we move the compaction scan over to a task_work style function, we'll
> only be able to scan the one task's mm at a time.  While the underlying
> compaction infrastructure can function more or less the same, the timing
> of when these scans occur, and exactly what the scans cover, will have
> to change.  If we go for the most rudimentary approach, the scans will
> occur each time a thread is about to return to userland after faulting
> in a THP (we'll just replace the khugepaged_enter call with a

I don't understand the motivation of doing this after "faulting in a 
THP"? If you already
have a THP then maybe it's useful to scan for more candidates for a THP, 
but why
would a THP fault be the trigger?

> task_work_add), and will cover the mm for the current task.  A slightly
> more advanced approach would involve a timer to ensure that scans don't
> occur too often, as is currently handled by

Maybe not a timer, but just a timestamp of last scan and the "wait time" 
to compare
against current timestamp. The wait time could be extended and reduced 
based on
how successful the scanner was in the recent past (somewhat similar to the
deferred compaction mechanism).

> khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> the fact that we'll lose the multi-mm scanning functionality our
> khugepaged_scan list provides, but maybe that's not a huge issue.

It should be actually a benefit, as you can tune the scanning frequency 
per mm,
to avoid useless scaning (see above) and also you don't scan tasks that 
are sleeping
at all.

> Before I run off and start writing patches, here's a brief summary of
> what I think we could do here:
>
> 1) Dissolve the khugepaged thread and related structs/timers (I'm
>     expecting some backlash on this one).
> 2) Replace khugepged_enter calls with calls to task_work_add(work,
>     our_new_scan_function) - new scan function will look almost exactly
>     like khugepaged_scan_mm_slot.
> 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
>     checked during/before our_new_scan_function to ensure that we're not
>     scanning more often than necessary.  Also, set up progress markers to
>     limit the number of pages scanned in a single pass.

Hm tuning the "number of pages in single pass" could be another way to 
change
the scanning frequency based on recent history.

> By doing this, scans will get triggered each time a thread that has
> faulted THPs is about to return to userland execution, throttled by our
> new timer/progress indicators.  The major benefit here is that scans
> will now occur in the desired task's context.
>
> Let me know if you anybody sees any major flaws in this approach.

Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
similarly enough that a single scanner could handle both the NUMA and THP
bits to save time?

Vlastimil

> Thanks a lot for your input, Peter!
>
> - Alex
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-10 21:57       ` Vlastimil Babka
  0 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2014-10-10 21:57 UTC (permalink / raw)
  To: Alex Thorlton, Peter Zijlstra
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On 10.10.2014 20:56, Alex Thorlton wrote:
> On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
>> So for the numa thing we do everything from the affected tasks context.
>> There was a lot of arguments early on that that could never really work,
>> but here we are.
>>
>> Should we convert khugepaged to the same? Drive the whole thing from
>> task_work? That would make this issue naturally go away.

Mel was suggesting to me few weeks ago that this would be desirable and 
I was
planning to look at this soonish. But if you volunteer, great :)

> If we move the compaction scan over to a task_work style function, we'll
> only be able to scan the one task's mm at a time.  While the underlying
> compaction infrastructure can function more or less the same, the timing
> of when these scans occur, and exactly what the scans cover, will have
> to change.  If we go for the most rudimentary approach, the scans will
> occur each time a thread is about to return to userland after faulting
> in a THP (we'll just replace the khugepaged_enter call with a

I don't understand the motivation of doing this after "faulting in a 
THP"? If you already
have a THP then maybe it's useful to scan for more candidates for a THP, 
but why
would a THP fault be the trigger?

> task_work_add), and will cover the mm for the current task.  A slightly
> more advanced approach would involve a timer to ensure that scans don't
> occur too often, as is currently handled by

Maybe not a timer, but just a timestamp of last scan and the "wait time" 
to compare
against current timestamp. The wait time could be extended and reduced 
based on
how successful the scanner was in the recent past (somewhat similar to the
deferred compaction mechanism).

> khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> the fact that we'll lose the multi-mm scanning functionality our
> khugepaged_scan list provides, but maybe that's not a huge issue.

It should be actually a benefit, as you can tune the scanning frequency 
per mm,
to avoid useless scaning (see above) and also you don't scan tasks that 
are sleeping
at all.

> Before I run off and start writing patches, here's a brief summary of
> what I think we could do here:
>
> 1) Dissolve the khugepaged thread and related structs/timers (I'm
>     expecting some backlash on this one).
> 2) Replace khugepged_enter calls with calls to task_work_add(work,
>     our_new_scan_function) - new scan function will look almost exactly
>     like khugepaged_scan_mm_slot.
> 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
>     checked during/before our_new_scan_function to ensure that we're not
>     scanning more often than necessary.  Also, set up progress markers to
>     limit the number of pages scanned in a single pass.

Hm tuning the "number of pages in single pass" could be another way to 
change
the scanning frequency based on recent history.

> By doing this, scans will get triggered each time a thread that has
> faulted THPs is about to return to userland execution, throttled by our
> new timer/progress indicators.  The major benefit here is that scans
> will now occur in the desired task's context.
>
> Let me know if you anybody sees any major flaws in this approach.

Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
similarly enough that a single scanner could handle both the NUMA and THP
bits to save time?

Vlastimil

> Thanks a lot for your input, Peter!
>
> - Alex
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-08 19:10 ` Alex Thorlton
@ 2014-10-14 11:48   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2014-10-14 11:48 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Wed, Oct 08, 2014 at 02:10:50PM -0500, Alex Thorlton wrote:
> Hey everyone,
> 
> I've run into a some frustrating behavior from the khugepaged thread,
> that I'm hoping to get sorted out.  It appears that if you pin
> khugepaged to a cpuset (i.e. node 0),

Why whould you want to pin khugpeaged? Is there a valid use-case?
Looks like userspace shoots to its leg.

> and it begins scanning/collapsing pages for a process on a cpuset that
> doesn't have any memory nodes in common with kugepaged (i.e. node 1),
> then the collapsed pages will all be allocated khugepaged's node (in
> this case node 0), clearly breaking the cpuset boundary set up for the
> process in question.
> 
> I'm aware that there are some known issues with khugepaged performing
> off-node allocations in certain situations, but I believe this is a bit
> of a special circumstance since, in this situation, there's no way for
> khugepaged to perform an allocation on the desired node.
> 
> The problem really stems from the way that we determine the allowed
> memory nodes in get_page_from_freelist.  When we call down to
> cpuset_zone_allowed_softwall, we check current->mems_allowed to
> determine what nodes we're allowed on.  In the case of khugepaged, we'll
> be making allocations for the mm of the process we're collapsing for,
> but we'll be checking the mems_allowed of khugepaged, which can
> obviously cause some problems.

Is there a reason why we should respect cpuset limitation for kernel
threads?

Should we bypass cpuset for PF_KTHREAD completely?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 736d8e1b6381..03a74878ad46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1960,6 +1960,9 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 zonelist_scan:
        zonelist_rescan = false;
 
+       /* Bypass cpuset limitation if allocate from kernel thread context */
+       if (current->flags & PF_KTHREAD)
+               alloc_flags &= ~ALLOC_CPUSET;
        /*
         * Scan zonelist, looking for a zone with enough free.
         * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-14 11:48   ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2014-10-14 11:48 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Wed, Oct 08, 2014 at 02:10:50PM -0500, Alex Thorlton wrote:
> Hey everyone,
> 
> I've run into a some frustrating behavior from the khugepaged thread,
> that I'm hoping to get sorted out.  It appears that if you pin
> khugepaged to a cpuset (i.e. node 0),

Why whould you want to pin khugpeaged? Is there a valid use-case?
Looks like userspace shoots to its leg.

> and it begins scanning/collapsing pages for a process on a cpuset that
> doesn't have any memory nodes in common with kugepaged (i.e. node 1),
> then the collapsed pages will all be allocated khugepaged's node (in
> this case node 0), clearly breaking the cpuset boundary set up for the
> process in question.
> 
> I'm aware that there are some known issues with khugepaged performing
> off-node allocations in certain situations, but I believe this is a bit
> of a special circumstance since, in this situation, there's no way for
> khugepaged to perform an allocation on the desired node.
> 
> The problem really stems from the way that we determine the allowed
> memory nodes in get_page_from_freelist.  When we call down to
> cpuset_zone_allowed_softwall, we check current->mems_allowed to
> determine what nodes we're allowed on.  In the case of khugepaged, we'll
> be making allocations for the mm of the process we're collapsing for,
> but we'll be checking the mems_allowed of khugepaged, which can
> obviously cause some problems.

Is there a reason why we should respect cpuset limitation for kernel
threads?

Should we bypass cpuset for PF_KTHREAD completely?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 736d8e1b6381..03a74878ad46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1960,6 +1960,9 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 zonelist_scan:
        zonelist_rescan = false;
 
+       /* Bypass cpuset limitation if allocate from kernel thread context */
+       if (current->flags & PF_KTHREAD)
+               alloc_flags &= ~ALLOC_CPUSET;
        /*
         * Scan zonelist, looking for a zone with enough free.
         * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-14 11:48   ` Kirill A. Shutemov
@ 2014-10-14 14:54     ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-14 14:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 02:48:28PM +0300, Kirill A. Shutemov wrote:

> Why whould you want to pin khugpeaged? Is there a valid use-case?
> Looks like userspace shoots to its leg.

Its just bad design to put so much work in another context. But the
use-case is isolating other cpus.

> Is there a reason why we should respect cpuset limitation for kernel
> threads?

Yes, because we want to allow isolating CPUs from 'random' activity.

> Should we bypass cpuset for PF_KTHREAD completely?

No. That'll break stuff.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-14 14:54     ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-14 14:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 02:48:28PM +0300, Kirill A. Shutemov wrote:

> Why whould you want to pin khugpeaged? Is there a valid use-case?
> Looks like userspace shoots to its leg.

Its just bad design to put so much work in another context. But the
use-case is isolating other cpus.

> Is there a reason why we should respect cpuset limitation for kernel
> threads?

Yes, because we want to allow isolating CPUs from 'random' activity.

> Should we bypass cpuset for PF_KTHREAD completely?

No. That'll break stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-10 21:57       ` Vlastimil Babka
@ 2014-10-14 14:58         ` Alex Thorlton
  -1 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-14 14:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Thorlton, Peter Zijlstra, linux-kernel, Andrew Morton,
	Mel Gorman, Rik van Riel, Ingo Molnar, Kirill A. Shutemov,
	Hugh Dickins, Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:57:09PM +0200, Vlastimil Babka wrote:
> On 10.10.2014 20:56, Alex Thorlton wrote:
> >On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> >>So for the numa thing we do everything from the affected tasks context.
> >>There was a lot of arguments early on that that could never really work,
> >>but here we are.
> >>
> >>Should we convert khugepaged to the same? Drive the whole thing from
> >>task_work? That would make this issue naturally go away.
> 
> Mel was suggesting to me few weeks ago that this would be desirable
> and I was
> planning to look at this soonish. But if you volunteer, great :)

Good to know that somebody else has already suggested this- and yes, I'll
volunteer to do the initial footwork. Hoping to have something here in a
few days.

> >If we move the compaction scan over to a task_work style function, we'll
> >only be able to scan the one task's mm at a time.  While the underlying
> >compaction infrastructure can function more or less the same, the timing
> >of when these scans occur, and exactly what the scans cover, will have
> >to change.  If we go for the most rudimentary approach, the scans will
> >occur each time a thread is about to return to userland after faulting
> >in a THP (we'll just replace the khugepaged_enter call with a
> 
> I don't understand the motivation of doing this after "faulting in a
> THP"? If you already
> have a THP then maybe it's useful to scan for more candidates for a
> THP, but why
> would a THP fault be the trigger?

I might be playing it a bit fast and loose with my terminology here.
The curent bit of code that would add an mm to the scan list is in the
THP fault path - see the call to khugepaged_enter in
do_huge_pmd_anonymous_page.  That call doesn't have much to do with a
THP fault, but it does add the faulting mm onto the khugepaged scan
list.  I was just suggesting that this is an appropriate place (at
least initially), to put the code that will add our new scan function to
the numa_work list.

> >task_work_add), and will cover the mm for the current task.  A slightly
> >more advanced approach would involve a timer to ensure that scans don't
> >occur too often, as is currently handled by
> 
> Maybe not a timer, but just a timestamp of last scan and the "wait
> time" to compare
> against current timestamp. The wait time could be extended and
> reduced based on
> how successful the scanner was in the recent past (somewhat similar to the
> deferred compaction mechanism).

That's what I meant :)  Again, I'm probably abusing some terminology
here.

> >khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> >the fact that we'll lose the multi-mm scanning functionality our
> >khugepaged_scan list provides, but maybe that's not a huge issue.
> 
> It should be actually a benefit, as you can tune the scanning
> frequency per mm,
> to avoid useless scaning (see above) and also you don't scan tasks
> that are sleeping
> at all.

I didn't think about the fact that we avoid scanning sleeping tasks.
The per-mm granularity could definitely be plus as well.

> 
> >Before I run off and start writing patches, here's a brief summary of
> >what I think we could do here:
> >
> >1) Dissolve the khugepaged thread and related structs/timers (I'm
> >    expecting some backlash on this one).
> >2) Replace khugepged_enter calls with calls to task_work_add(work,
> >    our_new_scan_function) - new scan function will look almost exactly
> >    like khugepaged_scan_mm_slot.
> >3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
> >    checked during/before our_new_scan_function to ensure that we're not
> >    scanning more often than necessary.  Also, set up progress markers to
> >    limit the number of pages scanned in a single pass.
> 
> Hm tuning the "number of pages in single pass" could be another way
> to change
> the scanning frequency based on recent history.
> 
> >By doing this, scans will get triggered each time a thread that has
> >faulted THPs is about to return to userland execution, throttled by our
> >new timer/progress indicators.  The major benefit here is that scans
> >will now occur in the desired task's context.
> >
> >Let me know if you anybody sees any major flaws in this approach.
> 
> Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
> similarly enough that a single scanner could handle both the NUMA and THP
> bits to save time?

That's a good point, but I wonder if that might make things a bit
unclear when it comes to what task work is being done when.  I like the
idea of the numa_work list because we can just tack individual functions
on there and have things get done sequentially, whenever necessary,
before a task returns to userland.  If we start trying to combine
everything into one task work function we run the risk of trying to
handle too much in one place, since we'd be doing both checks before
returning to userland, even if only one is needed.  I also think this
sort of goes against the idea of having a list of work functions to
perform - if we're going to merge everytng into one location, why not
just have a big task_work() function that gets called every time a task
is returning to userland?  I'm not saying I hate the idea - if we really
save a lot of time by combining stuff into one scan, then I'm not opposed
to it, but I'd like to discuss it thoroughly before we go that route.

Thanks, Vlastimil!

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-14 14:58         ` Alex Thorlton
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-14 14:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Thorlton, Peter Zijlstra, linux-kernel, Andrew Morton,
	Mel Gorman, Rik van Riel, Ingo Molnar, Kirill A. Shutemov,
	Hugh Dickins, Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:57:09PM +0200, Vlastimil Babka wrote:
> On 10.10.2014 20:56, Alex Thorlton wrote:
> >On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> >>So for the numa thing we do everything from the affected tasks context.
> >>There was a lot of arguments early on that that could never really work,
> >>but here we are.
> >>
> >>Should we convert khugepaged to the same? Drive the whole thing from
> >>task_work? That would make this issue naturally go away.
> 
> Mel was suggesting to me few weeks ago that this would be desirable
> and I was
> planning to look at this soonish. But if you volunteer, great :)

Good to know that somebody else has already suggested this- and yes, I'll
volunteer to do the initial footwork. Hoping to have something here in a
few days.

> >If we move the compaction scan over to a task_work style function, we'll
> >only be able to scan the one task's mm at a time.  While the underlying
> >compaction infrastructure can function more or less the same, the timing
> >of when these scans occur, and exactly what the scans cover, will have
> >to change.  If we go for the most rudimentary approach, the scans will
> >occur each time a thread is about to return to userland after faulting
> >in a THP (we'll just replace the khugepaged_enter call with a
> 
> I don't understand the motivation of doing this after "faulting in a
> THP"? If you already
> have a THP then maybe it's useful to scan for more candidates for a
> THP, but why
> would a THP fault be the trigger?

I might be playing it a bit fast and loose with my terminology here.
The curent bit of code that would add an mm to the scan list is in the
THP fault path - see the call to khugepaged_enter in
do_huge_pmd_anonymous_page.  That call doesn't have much to do with a
THP fault, but it does add the faulting mm onto the khugepaged scan
list.  I was just suggesting that this is an appropriate place (at
least initially), to put the code that will add our new scan function to
the numa_work list.

> >task_work_add), and will cover the mm for the current task.  A slightly
> >more advanced approach would involve a timer to ensure that scans don't
> >occur too often, as is currently handled by
> 
> Maybe not a timer, but just a timestamp of last scan and the "wait
> time" to compare
> against current timestamp. The wait time could be extended and
> reduced based on
> how successful the scanner was in the recent past (somewhat similar to the
> deferred compaction mechanism).

That's what I meant :)  Again, I'm probably abusing some terminology
here.

> >khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> >the fact that we'll lose the multi-mm scanning functionality our
> >khugepaged_scan list provides, but maybe that's not a huge issue.
> 
> It should be actually a benefit, as you can tune the scanning
> frequency per mm,
> to avoid useless scaning (see above) and also you don't scan tasks
> that are sleeping
> at all.

I didn't think about the fact that we avoid scanning sleeping tasks.
The per-mm granularity could definitely be plus as well.

> 
> >Before I run off and start writing patches, here's a brief summary of
> >what I think we could do here:
> >
> >1) Dissolve the khugepaged thread and related structs/timers (I'm
> >    expecting some backlash on this one).
> >2) Replace khugepged_enter calls with calls to task_work_add(work,
> >    our_new_scan_function) - new scan function will look almost exactly
> >    like khugepaged_scan_mm_slot.
> >3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
> >    checked during/before our_new_scan_function to ensure that we're not
> >    scanning more often than necessary.  Also, set up progress markers to
> >    limit the number of pages scanned in a single pass.
> 
> Hm tuning the "number of pages in single pass" could be another way
> to change
> the scanning frequency based on recent history.
> 
> >By doing this, scans will get triggered each time a thread that has
> >faulted THPs is about to return to userland execution, throttled by our
> >new timer/progress indicators.  The major benefit here is that scans
> >will now occur in the desired task's context.
> >
> >Let me know if you anybody sees any major flaws in this approach.
> 
> Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
> similarly enough that a single scanner could handle both the NUMA and THP
> bits to save time?

That's a good point, but I wonder if that might make things a bit
unclear when it comes to what task work is being done when.  I like the
idea of the numa_work list because we can just tack individual functions
on there and have things get done sequentially, whenever necessary,
before a task returns to userland.  If we start trying to combine
everything into one task work function we run the risk of trying to
handle too much in one place, since we'd be doing both checks before
returning to userland, even if only one is needed.  I also think this
sort of goes against the idea of having a list of work functions to
perform - if we're going to merge everytng into one location, why not
just have a big task_work() function that gets called every time a task
is returning to userland?  I'm not saying I hate the idea - if we really
save a lot of time by combining stuff into one scan, then I'm not opposed
to it, but I'd like to discuss it thoroughly before we go that route.

Thanks, Vlastimil!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-14 14:54     ` Peter Zijlstra
@ 2014-10-14 15:31       ` Rik van Riel
  -1 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2014-10-14 15:31 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On 10/14/2014 10:54 AM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2014 at 02:48:28PM +0300, Kirill A. Shutemov wrote:
>
>> Why whould you want to pin khugpeaged? Is there a valid use-case?
>> Looks like userspace shoots to its leg.
>
> Its just bad design to put so much work in another context. But the
> use-case is isolating other cpus.
>
>> Is there a reason why we should respect cpuset limitation for kernel
>> threads?
>
> Yes, because we want to allow isolating CPUs from 'random' activity.
>
>> Should we bypass cpuset for PF_KTHREAD completely?
>
> No. That'll break stuff.

Agreed on the above.

I suspect the sane thing to do is allow a cpuset
limited task to still allocate pages elsewhere,
if it explicitly asks for it.

Ask for something potentially stupid? Get it :)

In many of the cases where something asks for memory
in a specific location, it has a good reason to do
so, and there's no reason to deny it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-14 15:31       ` Rik van Riel
  0 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2014-10-14 15:31 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On 10/14/2014 10:54 AM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2014 at 02:48:28PM +0300, Kirill A. Shutemov wrote:
>
>> Why whould you want to pin khugpeaged? Is there a valid use-case?
>> Looks like userspace shoots to its leg.
>
> Its just bad design to put so much work in another context. But the
> use-case is isolating other cpus.
>
>> Is there a reason why we should respect cpuset limitation for kernel
>> threads?
>
> Yes, because we want to allow isolating CPUs from 'random' activity.
>
>> Should we bypass cpuset for PF_KTHREAD completely?
>
> No. That'll break stuff.

Agreed on the above.

I suspect the sane thing to do is allow a cpuset
limited task to still allocate pages elsewhere,
if it explicitly asks for it.

Ask for something potentially stupid? Get it :)

In many of the cases where something asks for memory
in a specific location, it has a good reason to do
so, and there's no reason to deny it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-14 14:54     ` Peter Zijlstra
@ 2014-10-14 17:38       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2014-10-14 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 04:54:35PM +0200, Peter Zijlstra wrote:
> > Is there a reason why we should respect cpuset limitation for kernel
> > threads?
> 
> Yes, because we want to allow isolating CPUs from 'random' activity.

Okay, it makes sense for cpus_allowed. But we're talking about
mems_allowed, right?
 
> 
> > Should we bypass cpuset for PF_KTHREAD completely?
> 
> No. That'll break stuff.

Like what?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-14 17:38       ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2014-10-14 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 04:54:35PM +0200, Peter Zijlstra wrote:
> > Is there a reason why we should respect cpuset limitation for kernel
> > threads?
> 
> Yes, because we want to allow isolating CPUs from 'random' activity.

Okay, it makes sense for cpus_allowed. But we're talking about
mems_allowed, right?
 
> 
> > Should we bypass cpuset for PF_KTHREAD completely?
> 
> No. That'll break stuff.

Like what?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-14 17:38       ` Kirill A. Shutemov
@ 2014-10-21 10:17         ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 08:38:37PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 14, 2014 at 04:54:35PM +0200, Peter Zijlstra wrote:
> > > Is there a reason why we should respect cpuset limitation for kernel
> > > threads?
> > 
> > Yes, because we want to allow isolating CPUs from 'random' activity.
> 
> Okay, it makes sense for cpus_allowed. But we're talking about
> mems_allowed, right?
>  
> > 
> > > Should we bypass cpuset for PF_KTHREAD completely?
> > 
> > No. That'll break stuff.
> 
> Like what?

Like using cpusets for what they were designed for? We very much want to
allow moving kernel threads into limited cpusets in order to avoid
perturbing the 'important' work done elsewhere.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-21 10:17         ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 14, 2014 at 08:38:37PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 14, 2014 at 04:54:35PM +0200, Peter Zijlstra wrote:
> > > Is there a reason why we should respect cpuset limitation for kernel
> > > threads?
> > 
> > Yes, because we want to allow isolating CPUs from 'random' activity.
> 
> Okay, it makes sense for cpus_allowed. But we're talking about
> mems_allowed, right?
>  
> > 
> > > Should we bypass cpuset for PF_KTHREAD completely?
> > 
> > No. That'll break stuff.
> 
> Like what?

Like using cpusets for what they were designed for? We very much want to
allow moving kernel threads into limited cpusets in order to avoid
perturbing the 'important' work done elsewhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-10 18:56     ` Alex Thorlton
@ 2014-10-21 10:55       ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:55 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 01:56:20PM -0500, Alex Thorlton wrote:
> On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> > So for the numa thing we do everything from the affected tasks context.
> > There was a lot of arguments early on that that could never really work,
> > but here we are.
> >
> > Should we convert khugepaged to the same? Drive the whole thing from
> > task_work? That would make this issue naturally go away.
> 
> That seems like a reasonable idea to me, but that will change the way
> that the compaction scans work right now, by quite a bit.  As I'm sure
> you're aware, the way it works now is we tack our mm onto the
> khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
> other ways to get on there - I can't remember), then when khugepaged
> wakes up it scans through each mm on the list until it hits the maximum
> number of pages to scan on each pass.
> 
> If we move the compaction scan over to a task_work style function, we'll
> only be able to scan the one task's mm at a time.  While the underlying
> compaction infrastructure can function more or less the same, the timing
> of when these scans occur, and exactly what the scans cover, will have
> to change.  If we go for the most rudimentary approach, the scans will
> occur each time a thread is about to return to userland after faulting
> in a THP (we'll just replace the khugepaged_enter call with a
> task_work_add), and will cover the mm for the current task.  A slightly
> more advanced approach would involve a timer to ensure that scans don't
> occur too often, as is currently handled by
> khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> the fact that we'll lose the multi-mm scanning functionality our
> khugepaged_scan list provides, but maybe that's not a huge issue.

Right, can't see that being a problem, in fact its a bonus, because
we'll stop scanning completely if there's no tasks running what so ever,
so we get the power aware thing for free.

Also if you drive it like the numa scanning, you end up scanning tasks
proportional to their runtime, there's no point scanning and collapsing
pages for tasks than hardly ever run anyhow, that only sucks limited
resources (large page allocations) for no win.

> Before I run off and start writing patches, here's a brief summary of
> what I think we could do here:
> 
> 1) Dissolve the khugepaged thread and related structs/timers (I'm
>    expecting some backlash on this one).
> 2) Replace khugepged_enter calls with calls to task_work_add(work,
>    our_new_scan_function) - new scan function will look almost exactly
>    like khugepaged_scan_mm_slot.
> 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
>    checked during/before our_new_scan_function to ensure that we're not
>    scanning more often than necessary.  Also, set up progress markers to
>    limit the number of pages scanned in a single pass.
> 
> By doing this, scans will get triggered each time a thread that has
> faulted THPs is about to return to userland execution, throttled by our
> new timer/progress indicators.  The major benefit here is that scans
> will now occur in the desired task's context.
> 
> Let me know if you anybody sees any major flaws in this approach.

I would suggest you have a look at task_tick_numa(), maybe we can make
something that works for both.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-21 10:55       ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:55 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Mel Gorman, Rik van Riel,
	Ingo Molnar, Kirill A. Shutemov, Hugh Dickins, Bob Liu,
	Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 01:56:20PM -0500, Alex Thorlton wrote:
> On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> > So for the numa thing we do everything from the affected tasks context.
> > There was a lot of arguments early on that that could never really work,
> > but here we are.
> >
> > Should we convert khugepaged to the same? Drive the whole thing from
> > task_work? That would make this issue naturally go away.
> 
> That seems like a reasonable idea to me, but that will change the way
> that the compaction scans work right now, by quite a bit.  As I'm sure
> you're aware, the way it works now is we tack our mm onto the
> khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
> other ways to get on there - I can't remember), then when khugepaged
> wakes up it scans through each mm on the list until it hits the maximum
> number of pages to scan on each pass.
> 
> If we move the compaction scan over to a task_work style function, we'll
> only be able to scan the one task's mm at a time.  While the underlying
> compaction infrastructure can function more or less the same, the timing
> of when these scans occur, and exactly what the scans cover, will have
> to change.  If we go for the most rudimentary approach, the scans will
> occur each time a thread is about to return to userland after faulting
> in a THP (we'll just replace the khugepaged_enter call with a
> task_work_add), and will cover the mm for the current task.  A slightly
> more advanced approach would involve a timer to ensure that scans don't
> occur too often, as is currently handled by
> khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> the fact that we'll lose the multi-mm scanning functionality our
> khugepaged_scan list provides, but maybe that's not a huge issue.

Right, can't see that being a problem, in fact its a bonus, because
we'll stop scanning completely if there's no tasks running what so ever,
so we get the power aware thing for free.

Also if you drive it like the numa scanning, you end up scanning tasks
proportional to their runtime, there's no point scanning and collapsing
pages for tasks than hardly ever run anyhow, that only sucks limited
resources (large page allocations) for no win.

> Before I run off and start writing patches, here's a brief summary of
> what I think we could do here:
> 
> 1) Dissolve the khugepaged thread and related structs/timers (I'm
>    expecting some backlash on this one).
> 2) Replace khugepged_enter calls with calls to task_work_add(work,
>    our_new_scan_function) - new scan function will look almost exactly
>    like khugepaged_scan_mm_slot.
> 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
>    checked during/before our_new_scan_function to ensure that we're not
>    scanning more often than necessary.  Also, set up progress markers to
>    limit the number of pages scanned in a single pass.
> 
> By doing this, scans will get triggered each time a thread that has
> faulted THPs is about to return to userland execution, throttled by our
> new timer/progress indicators.  The major benefit here is that scans
> will now occur in the desired task's context.
> 
> Let me know if you anybody sees any major flaws in this approach.

I would suggest you have a look at task_tick_numa(), maybe we can make
something that works for both.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-10 21:57       ` Vlastimil Babka
@ 2014-10-21 10:59         ` Peter Zijlstra
  -1 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:57:09PM +0200, Vlastimil Babka wrote:
> Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
> similarly enough that a single scanner could handle both the NUMA and THP
> bits to save time?

IIRC the THP thing doesn't need the fault thing, which makes it an
entirely different beast. Then again, we do walk the actual page-tables
through change_protection, but changing that means we have to duplicate
all that code, then again, maybe the current THP stuff already carries
something like that.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-21 10:59         ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Fri, Oct 10, 2014 at 11:57:09PM +0200, Vlastimil Babka wrote:
> Hm I haven't seen the code yet, but is perhaps the NUMA scanning working
> similarly enough that a single scanner could handle both the NUMA and THP
> bits to save time?

IIRC the THP thing doesn't need the fault thing, which makes it an
entirely different beast. Then again, we do walk the actual page-tables
through change_protection, but changing that means we have to duplicate
all that code, then again, maybe the current THP stuff already carries
something like that.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
  2014-10-21 10:55       ` Peter Zijlstra
@ 2014-10-21 16:25         ` Alex Thorlton
  -1 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-21 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 21, 2014 at 12:55:42PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 10, 2014 at 01:56:20PM -0500, Alex Thorlton wrote:
> > On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> > > So for the numa thing we do everything from the affected tasks context.
> > > There was a lot of arguments early on that that could never really work,
> > > but here we are.
> > >
> > > Should we convert khugepaged to the same? Drive the whole thing from
> > > task_work? That would make this issue naturally go away.
> > 
> > That seems like a reasonable idea to me, but that will change the way
> > that the compaction scans work right now, by quite a bit.  As I'm sure
> > you're aware, the way it works now is we tack our mm onto the
> > khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
> > other ways to get on there - I can't remember), then when khugepaged
> > wakes up it scans through each mm on the list until it hits the maximum
> > number of pages to scan on each pass.
> > 
> > If we move the compaction scan over to a task_work style function, we'll
> > only be able to scan the one task's mm at a time.  While the underlying
> > compaction infrastructure can function more or less the same, the timing
> > of when these scans occur, and exactly what the scans cover, will have
> > to change.  If we go for the most rudimentary approach, the scans will
> > occur each time a thread is about to return to userland after faulting
> > in a THP (we'll just replace the khugepaged_enter call with a
> > task_work_add), and will cover the mm for the current task.  A slightly
> > more advanced approach would involve a timer to ensure that scans don't
> > occur too often, as is currently handled by
> > khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> > the fact that we'll lose the multi-mm scanning functionality our
> > khugepaged_scan list provides, but maybe that's not a huge issue.
> 
> Right, can't see that being a problem, in fact its a bonus, because
> we'll stop scanning completely if there's no tasks running what so ever,
> so we get the power aware thing for free.

Good point.  Another plus to runninng the scans from the current task's
context.

> Also if you drive it like the numa scanning, you end up scanning tasks
> proportional to their runtime, there's no point scanning and collapsing
> pages for tasks than hardly ever run anyhow, that only sucks limited
> resources (large page allocations) for no win.

Yep - it's currently driven somewhat similar to numa scanning, but there
might be a bit more overhead to the way I'm doing it now vs. the numa
scans.  What currently happens is __khugepaged_enter makes a call to
task_work_add in the places where it used to add the current tasks mm to
the khugepaged_scan list/mm_slot hash, throttled by a timer similar to
alloc_sleep_millisecs.  The reason I say there may be more overhead is
because we potentially make the call to __khugepaged_enter and check the
timer once per page fault, which could definitely slow things down.

> > Before I run off and start writing patches, here's a brief summary of
> > what I think we could do here:
> > 
> > 1) Dissolve the khugepaged thread and related structs/timers (I'm
> >    expecting some backlash on this one).
> > 2) Replace khugepged_enter calls with calls to task_work_add(work,
> >    our_new_scan_function) - new scan function will look almost exactly
> >    like khugepaged_scan_mm_slot.
> > 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
> >    checked during/before our_new_scan_function to ensure that we're not
> >    scanning more often than necessary.  Also, set up progress markers to
> >    limit the number of pages scanned in a single pass.
> > 
> > By doing this, scans will get triggered each time a thread that has
> > faulted THPs is about to return to userland execution, throttled by our
> > new timer/progress indicators.  The major benefit here is that scans
> > will now occur in the desired task's context.
> > 
> > Let me know if you anybody sees any major flaws in this approach.
> 
> I would suggest you have a look at task_tick_numa(), maybe we can make
> something that works for both.

I originally looked at task_tick_numa when I started working on my new
code, but I think what's happening in there is fundamentally different
from what we're trying to achieve here.  In
task_tick_numa/task_numa_work, we're marking pages as inaccessible so
that they trigger a minor fault later on, i.e. generating some
information that will result in an action later on.  Whereas, from the
page collapse point of view, our task_work function is taking an action
based on some information that was generated during a page fault.  Sort
of an opposite action, right?  Not saying that there isn't a way that
they can work together, but the end goals of the two processes are
different enough that I wonder if it's appropriate to try to merge them
together.

I've got a rough draft of this code up and running now.  I'm gathering
some information on how it performs vs. the standard khugepaged.  Once
I've got all the information put together I'll post the code and the
test results so that we can compare, and start looking into areas that
can be improved.

Thanks, Peter!

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
@ 2014-10-21 16:25         ` Alex Thorlton
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Thorlton @ 2014-10-21 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Mel Gorman,
	Rik van Riel, Ingo Molnar, Kirill A. Shutemov, Hugh Dickins,
	Bob Liu, Johannes Weiner, linux-mm

On Tue, Oct 21, 2014 at 12:55:42PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 10, 2014 at 01:56:20PM -0500, Alex Thorlton wrote:
> > On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> > > So for the numa thing we do everything from the affected tasks context.
> > > There was a lot of arguments early on that that could never really work,
> > > but here we are.
> > >
> > > Should we convert khugepaged to the same? Drive the whole thing from
> > > task_work? That would make this issue naturally go away.
> > 
> > That seems like a reasonable idea to me, but that will change the way
> > that the compaction scans work right now, by quite a bit.  As I'm sure
> > you're aware, the way it works now is we tack our mm onto the
> > khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
> > other ways to get on there - I can't remember), then when khugepaged
> > wakes up it scans through each mm on the list until it hits the maximum
> > number of pages to scan on each pass.
> > 
> > If we move the compaction scan over to a task_work style function, we'll
> > only be able to scan the one task's mm at a time.  While the underlying
> > compaction infrastructure can function more or less the same, the timing
> > of when these scans occur, and exactly what the scans cover, will have
> > to change.  If we go for the most rudimentary approach, the scans will
> > occur each time a thread is about to return to userland after faulting
> > in a THP (we'll just replace the khugepaged_enter call with a
> > task_work_add), and will cover the mm for the current task.  A slightly
> > more advanced approach would involve a timer to ensure that scans don't
> > occur too often, as is currently handled by
> > khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
> > the fact that we'll lose the multi-mm scanning functionality our
> > khugepaged_scan list provides, but maybe that's not a huge issue.
> 
> Right, can't see that being a problem, in fact its a bonus, because
> we'll stop scanning completely if there's no tasks running what so ever,
> so we get the power aware thing for free.

Good point.  Another plus to runninng the scans from the current task's
context.

> Also if you drive it like the numa scanning, you end up scanning tasks
> proportional to their runtime, there's no point scanning and collapsing
> pages for tasks than hardly ever run anyhow, that only sucks limited
> resources (large page allocations) for no win.

Yep - it's currently driven somewhat similar to numa scanning, but there
might be a bit more overhead to the way I'm doing it now vs. the numa
scans.  What currently happens is __khugepaged_enter makes a call to
task_work_add in the places where it used to add the current tasks mm to
the khugepaged_scan list/mm_slot hash, throttled by a timer similar to
alloc_sleep_millisecs.  The reason I say there may be more overhead is
because we potentially make the call to __khugepaged_enter and check the
timer once per page fault, which could definitely slow things down.

> > Before I run off and start writing patches, here's a brief summary of
> > what I think we could do here:
> > 
> > 1) Dissolve the khugepaged thread and related structs/timers (I'm
> >    expecting some backlash on this one).
> > 2) Replace khugepged_enter calls with calls to task_work_add(work,
> >    our_new_scan_function) - new scan function will look almost exactly
> >    like khugepaged_scan_mm_slot.
> > 3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
> >    checked during/before our_new_scan_function to ensure that we're not
> >    scanning more often than necessary.  Also, set up progress markers to
> >    limit the number of pages scanned in a single pass.
> > 
> > By doing this, scans will get triggered each time a thread that has
> > faulted THPs is about to return to userland execution, throttled by our
> > new timer/progress indicators.  The major benefit here is that scans
> > will now occur in the desired task's context.
> > 
> > Let me know if you anybody sees any major flaws in this approach.
> 
> I would suggest you have a look at task_tick_numa(), maybe we can make
> something that works for both.

I originally looked at task_tick_numa when I started working on my new
code, but I think what's happening in there is fundamentally different
from what we're trying to achieve here.  In
task_tick_numa/task_numa_work, we're marking pages as inaccessible so
that they trigger a minor fault later on, i.e. generating some
information that will result in an action later on.  Whereas, from the
page collapse point of view, our task_work function is taking an action
based on some information that was generated during a page fault.  Sort
of an opposite action, right?  Not saying that there isn't a way that
they can work together, but the end goals of the two processes are
different enough that I wonder if it's appropriate to try to merge them
together.

I've got a rough draft of this code up and running now.  I'm gathering
some information on how it performs vs. the standard khugepaged.  Once
I've got all the information put together I'll post the code and the
test results so that we can compare, and start looking into areas that
can be improved.

Thanks, Peter!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-10-21 16:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08 19:10 [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset Alex Thorlton
2014-10-08 19:10 ` Alex Thorlton
2014-10-10  9:20 ` Peter Zijlstra
2014-10-10  9:20   ` Peter Zijlstra
2014-10-10 18:56   ` Alex Thorlton
2014-10-10 18:56     ` Alex Thorlton
2014-10-10 21:57     ` Vlastimil Babka
2014-10-10 21:57       ` Vlastimil Babka
2014-10-14 14:58       ` Alex Thorlton
2014-10-14 14:58         ` Alex Thorlton
2014-10-21 10:59       ` Peter Zijlstra
2014-10-21 10:59         ` Peter Zijlstra
2014-10-21 10:55     ` Peter Zijlstra
2014-10-21 10:55       ` Peter Zijlstra
2014-10-21 16:25       ` Alex Thorlton
2014-10-21 16:25         ` Alex Thorlton
2014-10-14 11:48 ` Kirill A. Shutemov
2014-10-14 11:48   ` Kirill A. Shutemov
2014-10-14 14:54   ` Peter Zijlstra
2014-10-14 14:54     ` Peter Zijlstra
2014-10-14 15:31     ` Rik van Riel
2014-10-14 15:31       ` Rik van Riel
2014-10-14 17:38     ` Kirill A. Shutemov
2014-10-14 17:38       ` Kirill A. Shutemov
2014-10-21 10:17       ` Peter Zijlstra
2014-10-21 10:17         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.