linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] wmark based pro-active compaction
@ 2016-12-30 13:14 Michal Hocko
  2016-12-30 14:06 ` Mel Gorman
  2017-03-08 14:56 ` Vlastimil Babka
  0 siblings, 2 replies; 8+ messages in thread
From: Michal Hocko @ 2016-12-30 13:14 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Vlastimil Babka, Mel Gorman, David Rientjes

Hi,
I didn't originally want to send this proposal because Vlastimil is
planning to do some work in this area so I've expected him to send
something similar. But the recent discussion about the THP defrag
options pushed me to send out my thoughts.

So what is the problem? The demand for high order pages is growing and
that seems to be the general trend. The problem is that while they can
bring performance benefit they can get be really expensive to allocate
especially when we enter the direct compaction. So we really want to
prevent from expensive path and defer as much as possible to the
background. A huge step forward was kcompactd introduced by Vlastimil.
We are still not there yet though, because it might be already quite
late when we wakeup_kcompactd(). The memory might be already fragmented
when we hit there. Moreover we do not have any way to actually tell
which orders we do care about.

Therefore I believe we need a watermark based pro-active compaction
which would keep the background compaction busy as long as we have
less pages of the configured order. kcompactd should wake up
periodically, I think, and check for the status so that we can catch
the fragmentation before we get low on memory.
The interface could look something like:
/proc/sys/vm/compact_wmark
time_period order count

There are many details that would have to be solved of course - e.g. do
not burn cycles pointlessly when we know that no further progress can be
made etc... but in principle the idea show work.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko
@ 2016-12-30 14:06 ` Mel Gorman
  2017-01-05  9:53   ` Vlastimil Babka
  2017-03-08 14:56 ` Vlastimil Babka
  1 sibling, 1 reply; 8+ messages in thread
From: Mel Gorman @ 2016-12-30 14:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm, Vlastimil Babka, David Rientjes

On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote:
> Hi,
> I didn't originally want to send this proposal because Vlastimil is
> planning to do some work in this area so I've expected him to send
> something similar. But the recent discussion about the THP defrag
> options pushed me to send out my thoughts.
> 
> So what is the problem? The demand for high order pages is growing and
> that seems to be the general trend. The problem is that while they can
> bring performance benefit they can get be really expensive to allocate
> especially when we enter the direct compaction. So we really want to
> prevent from expensive path and defer as much as possible to the
> background. A huge step forward was kcompactd introduced by Vlastimil.
> We are still not there yet though, because it might be already quite
> late when we wakeup_kcompactd(). The memory might be already fragmented
> when we hit there. Moreover we do not have any way to actually tell
> which orders we do care about.
> 
> Therefore I believe we need a watermark based pro-active compaction
> which would keep the background compaction busy as long as we have
> less pages of the configured order. kcompactd should wake up
> periodically, I think, and check for the status so that we can catch
> the fragmentation before we get low on memory.
> The interface could look something like:
> /proc/sys/vm/compact_wmark
> time_period order count
> 
> There are many details that would have to be solved of course - e.g. do
> not burn cycles pointlessly when we know that no further progress can be
> made etc... but in principle the idea show work.

I'd be very interested in this. I'd also like to add to the list to revisit
the concept of pre-emptively moving movable pages from pageblocks stolen for
unmovable pages to reduce future events that degrade fragmentation. Before
the Christmas I was mulling over whether it would be appropriate to have a
workqueue of pageblocks that need "cleaning". This could be either instead
of or in conjunction with wmark-based compaction.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2016-12-30 14:06 ` Mel Gorman
@ 2017-01-05  9:53   ` Vlastimil Babka
  2017-01-05 10:27     ` Michal Hocko
  2017-01-13  7:03     ` Joonsoo Kim
  0 siblings, 2 replies; 8+ messages in thread
From: Vlastimil Babka @ 2017-01-05  9:53 UTC (permalink / raw)
  To: Mel Gorman, Michal Hocko
  Cc: lsf-pc, linux-mm, David Rientjes, Joonsoo Kim, Johannes Weiner

[CC Joonsoo and Johannes]

On 12/30/2016 03:06 PM, Mel Gorman wrote:
> On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote:
>> Hi,
>> I didn't originally want to send this proposal because Vlastimil is
>> planning to do some work in this area so I've expected him to send
>> something similar. But the recent discussion about the THP defrag
>> options pushed me to send out my thoughts.

No problem.

>> So what is the problem? The demand for high order pages is growing and
>> that seems to be the general trend. The problem is that while they can
>> bring performance benefit they can get be really expensive to allocate
>> especially when we enter the direct compaction. So we really want to
>> prevent from expensive path and defer as much as possible to the
>> background. A huge step forward was kcompactd introduced by Vlastimil.
>> We are still not there yet though, because it might be already quite
>> late when we wakeup_kcompactd(). The memory might be already fragmented
>> when we hit there.

Right.

>> Moreover we do not have any way to actually tell
>> which orders we do care about.

Who is "we" here? The system admin?

>> Therefore I believe we need a watermark based pro-active compaction
>> which would keep the background compaction busy as long as we have
>> less pages of the configured order.

Again, configured by what, admin? I would rather try to avoid tunables
here, if possible. While THP is quite well known example with stable
order, the pressure for other orders is rather implementation specific
(drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually
mapped stacks, although that example is about non-costly order). Would
the admin be expected to study the implementation to know which orders
are needed, or react to page allocation failure reports? Neither sounds
nice.

>> kcompactd should wake up
>> periodically, I think, and check for the status so that we can catch
>> the fragmentation before we get low on memory.
>> The interface could look something like:
>> /proc/sys/vm/compact_wmark
>> time_period order count

IMHO it would be better if the system could auto-tune this, e.g. by
counting high-order alloc failures/needs for direct compaction per order
between wakeups, and trying to bring them to zero.

>> There are many details that would have to be solved of course - e.g. do
>> not burn cycles pointlessly when we know that no further progress can be
>> made etc... but in principle the idea show work.

Yeah with auto-tuning there's even more inputs to consider and
parameters that would be auto-adjusted based on them. Right now I can
think of:

Inputs
- the per-order "pressure" (e.g. the failures/direct compactions above)
  - ideally somehow including the "importance". That might be the
trickiest part when wanting to avoid tunables. THP failures might be
least important, allocations with expensive or no fallback most
important. Probably not just simple relation between order. Hopefully
gfp flags such as __GFP_NORETRY and __GFP_REPEAT can help here? Without
such metric, everything will easily be dominated by THP pressure.
- recent compaction efficiency (as you mentioned above)

Parameters
- wake up period for kcompactd
- target per-order goals for kcompactd
- lowest efficiency where it's still considered worth to compact?

An important question: how to evaluate this? Metrics should be feasible
(improved success rate, % of compaction that was handled by kcompactd
and not direct compaction...), but what are the good testcases?

> I'd be very interested in this. I'd also like to add to the list to revisit
> the concept of pre-emptively moving movable pages from pageblocks stolen for
> unmovable pages to reduce future events that degrade fragmentation. Before
> the Christmas I was mulling over whether it would be appropriate to have a
> workqueue of pageblocks that need "cleaning". This could be either instead
> of or in conjunction with wmark-based compaction.

Yes, that could be useful as well.

Ideally I would also revisit the topic of compaction mechanism (migrate
and free scanners) itself. It's been shown that they usually meet in the
1/3 or 1/2 of zone, which means the rest of the zone is only
defragmented by "plugging free holes" by migrated pages, although it
might actually contain pageblocks more suitable for migrating from, than
the first part of the zone. It's also expensive for the free scanner to
actually find free pages, according to the stats.

Some approaches were proposed in recent years, but never got far as it's
always some kind of a trade-off (this partially goes back to the problem
of evaluation, often limited to stress-highalloc from mmtests):

- "pivot" based approach where scanners' starting point changes and
isn't always zone boundaries [1]
- both scanners scan whole zone moving in the same direction, just
making sure they don't operate on the same pageblock at the same time [2]
- replacing free scanner by directly taking free pages from freelist

However, the problem with this subtopic is that it might be too much
specialized for the full MM room.

[1] https://lkml.org/lkml/2015/1/19/158
[2] https://lkml.org/lkml/2015/6/24/706
[3] https://lkml.org/lkml/2015/12/3/63

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2017-01-05  9:53   ` Vlastimil Babka
@ 2017-01-05 10:27     ` Michal Hocko
  2017-01-06  8:57       ` Vlastimil Babka
  2017-01-13  7:03     ` Joonsoo Kim
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2017-01-05 10:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, lsf-pc, linux-mm, David Rientjes, Joonsoo Kim,
	Johannes Weiner

On Thu 05-01-17 10:53:59, Vlastimil Babka wrote:
> [CC Joonsoo and Johannes]
> 
> On 12/30/2016 03:06 PM, Mel Gorman wrote:
> > On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote:
> >> Hi,
> >> I didn't originally want to send this proposal because Vlastimil is
> >> planning to do some work in this area so I've expected him to send
> >> something similar. But the recent discussion about the THP defrag
> >> options pushed me to send out my thoughts.
> 
> No problem.
> 
> >> So what is the problem? The demand for high order pages is growing and
> >> that seems to be the general trend. The problem is that while they can
> >> bring performance benefit they can get be really expensive to allocate
> >> especially when we enter the direct compaction. So we really want to
> >> prevent from expensive path and defer as much as possible to the
> >> background. A huge step forward was kcompactd introduced by Vlastimil.
> >> We are still not there yet though, because it might be already quite
> >> late when we wakeup_kcompactd(). The memory might be already fragmented
> >> when we hit there.
> 
> Right.
> 
> >> Moreover we do not have any way to actually tell
> >> which orders we do care about.
> 
> Who is "we" here? The system admin?

yes

> >> Therefore I believe we need a watermark based pro-active compaction
> >> which would keep the background compaction busy as long as we have
> >> less pages of the configured order.
> 
> Again, configured by what, admin? I would rather try to avoid tunables
> here, if possible. While THP is quite well known example with stable
> order, the pressure for other orders is rather implementation specific
> (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually
> mapped stacks, although that example is about non-costly order). Would
> the admin be expected to study the implementation to know which orders
> are needed, or react to page allocation failure reports? Neither sounds
> nice.

That is a good question but I expect that there are more users than THP
which use stable orders. E.g. networking stack tends to depend on the
packet size. A tracepoint with some histogram output would tell us what
is the requested orders distribution.

> >> kcompactd should wake up
> >> periodically, I think, and check for the status so that we can catch
> >> the fragmentation before we get low on memory.
> >> The interface could look something like:
> >> /proc/sys/vm/compact_wmark
> >> time_period order count
> 
> IMHO it would be better if the system could auto-tune this, e.g. by
> counting high-order alloc failures/needs for direct compaction per order
> between wakeups, and trying to bring them to zero.

auto-tunning is usually preferable I am just wondering how the admin can
tell what is still the system load price he is willing to pay. I suspect
we will see growing number of opportunistic high order requests over
time and  auto tunning shouldn't try to accomodate with it without
any bounds. There is still some cost/benefit to be evaluated from the
system level point of view which I am afraid is hard to achive from the
kcompactd POV.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2017-01-05 10:27     ` Michal Hocko
@ 2017-01-06  8:57       ` Vlastimil Babka
  0 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2017-01-06  8:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, lsf-pc, linux-mm, David Rientjes, Joonsoo Kim,
	Johannes Weiner

On 01/05/2017 11:27 AM, Michal Hocko wrote:
> On Thu 05-01-17 10:53:59, Vlastimil Babka wrote:
>>>> Therefore I believe we need a watermark based pro-active compaction
>>>> which would keep the background compaction busy as long as we have
>>>> less pages of the configured order.
>>
>> Again, configured by what, admin? I would rather try to avoid tunables
>> here, if possible. While THP is quite well known example with stable
>> order, the pressure for other orders is rather implementation specific
>> (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually
>> mapped stacks, although that example is about non-costly order). Would
>> the admin be expected to study the implementation to know which orders
>> are needed, or react to page allocation failure reports? Neither sounds
>> nice.
> 
> That is a good question but I expect that there are more users than THP
> which use stable orders. E.g. networking stack tends to depend on the
> packet size. A tracepoint with some histogram output would tell us what
> is the requested orders distribution.

Maybe, but there might be also multiple users of the same order but
different "importance"...

>>>> kcompactd should wake up
>>>> periodically, I think, and check for the status so that we can catch
>>>> the fragmentation before we get low on memory.
>>>> The interface could look something like:
>>>> /proc/sys/vm/compact_wmark
>>>> time_period order count
>>
>> IMHO it would be better if the system could auto-tune this, e.g. by
>> counting high-order alloc failures/needs for direct compaction per order
>> between wakeups, and trying to bring them to zero.
> 
> auto-tunning is usually preferable I am just wondering how the admin can
> tell what is still the system load price he is willing to pay. I suspect
> we will see growing number of opportunistic high order requests over
> time and  auto tunning shouldn't try to accomodate with it without
> any bounds.There is still some cost/benefit to be evaluated from the
> system level point of view which I am afraid is hard to achive from the
> kcompactd POV.

That's why I mentioned that importance should be judged somehow.
Opportunistic requests should be recognizable by their gfp flags, so
hopefully there's a way. I wouldn't mind some general tunable(s) to
express how much effort to give to "important" allocations and
opportunistic ones, but rather not in such implementation-detail form as
"time_period order count".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2017-01-05  9:53   ` Vlastimil Babka
  2017-01-05 10:27     ` Michal Hocko
@ 2017-01-13  7:03     ` Joonsoo Kim
  2017-01-19 14:18       ` Vlastimil Babka
  1 sibling, 1 reply; 8+ messages in thread
From: Joonsoo Kim @ 2017-01-13  7:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Michal Hocko, lsf-pc, linux-mm, David Rientjes,
	Johannes Weiner

Hello,

I'm also interested in this topic.

lkml.kernel.org/r/1430119421-13536-3-git-send-email-iamjoonsoo.kim@lge.com

On Thu, Jan 05, 2017 at 10:53:59AM +0100, Vlastimil Babka wrote:
> [CC Joonsoo and Johannes]
> 
> On 12/30/2016 03:06 PM, Mel Gorman wrote:
> > On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote:
> >> Hi,
> >> I didn't originally want to send this proposal because Vlastimil is
> >> planning to do some work in this area so I've expected him to send
> >> something similar. But the recent discussion about the THP defrag
> >> options pushed me to send out my thoughts.
> 
> No problem.
> 
> >> So what is the problem? The demand for high order pages is growing and
> >> that seems to be the general trend. The problem is that while they can
> >> bring performance benefit they can get be really expensive to allocate
> >> especially when we enter the direct compaction. So we really want to
> >> prevent from expensive path and defer as much as possible to the
> >> background. A huge step forward was kcompactd introduced by Vlastimil.
> >> We are still not there yet though, because it might be already quite
> >> late when we wakeup_kcompactd(). The memory might be already fragmented
> >> when we hit there.
> 
> Right.

Before we talk about pro-active compaction, I'd like to know the
usecase that really needs pro-active compaction. For THP, IMHO, it's
better not to do pro-active compaction, because high-order page made
by pro-active compaction could be broken before it is used. And,
THP page can be setup lately by THP daemon. Benefit of pro-active
compaction would not compensate overhead of it in this case. I guess
that almost cases that have a fallback would hit this category.

For the order lower than costly order, system would have such a
freepage usually. So, my question is pro-active compaction is really
needed even if it's cost is really high? Reason I ask this question is
that I tested some patches to do pro-active compaction and found that
cost looks too much high. I heard that someone want this feature but
I'm not sure they will use it with this high cost. Anyway, I will post
some patches for pro-active compaction, soon.

> 
> >> Moreover we do not have any way to actually tell
> >> which orders we do care about.
> 
> Who is "we" here? The system admin?
> 
> >> Therefore I believe we need a watermark based pro-active compaction
> >> which would keep the background compaction busy as long as we have
> >> less pages of the configured order.
> 
> Again, configured by what, admin? I would rather try to avoid tunables
> here, if possible. While THP is quite well known example with stable
> order, the pressure for other orders is rather implementation specific
> (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually
> mapped stacks, although that example is about non-costly order). Would
> the admin be expected to study the implementation to know which orders
> are needed, or react to page allocation failure reports? Neither sounds
> nice.

It would be really good if we can auto tune this. My patches mentioned
above just use tunables that you don't like. :)

> >> kcompactd should wake up
> >> periodically, I think, and check for the status so that we can catch
> >> the fragmentation before we get low on memory.
> >> The interface could look something like:
> >> /proc/sys/vm/compact_wmark
> >> time_period order count
> 
> IMHO it would be better if the system could auto-tune this, e.g. by
> counting high-order alloc failures/needs for direct compaction per order
> between wakeups, and trying to bring them to zero.
> 
> >> There are many details that would have to be solved of course - e.g. do
> >> not burn cycles pointlessly when we know that no further progress can be
> >> made etc... but in principle the idea show work.
> 
> Yeah with auto-tuning there's even more inputs to consider and
> parameters that would be auto-adjusted based on them. Right now I can
> think of:
> 
> Inputs
> - the per-order "pressure" (e.g. the failures/direct compactions above)
>   - ideally somehow including the "importance". That might be the
> trickiest part when wanting to avoid tunables. THP failures might be
> least important, allocations with expensive or no fallback most
> important. Probably not just simple relation between order. Hopefully
> gfp flags such as __GFP_NORETRY and __GFP_REPEAT can help here? Without
> such metric, everything will easily be dominated by THP pressure.
> - recent compaction efficiency (as you mentioned above)
> 
> Parameters
> - wake up period for kcompactd
> - target per-order goals for kcompactd
> - lowest efficiency where it's still considered worth to compact?
> 
> An important question: how to evaluate this? Metrics should be feasible
> (improved success rate, % of compaction that was handled by kcompactd
> and not direct compaction...), but what are the good testcases?

Usecase should be defined first? Anyway, I hope that new testcase would
be finished in short time. stress-highalloc test takes too much time
to test various ideas.

> > I'd be very interested in this. I'd also like to add to the list to revisit
> > the concept of pre-emptively moving movable pages from pageblocks stolen for
> > unmovable pages to reduce future events that degrade fragmentation. Before
> > the Christmas I was mulling over whether it would be appropriate to have a
> > workqueue of pageblocks that need "cleaning". This could be either instead
> > of or in conjunction with wmark-based compaction.
> 
> Yes, that could be useful as well.

I tried this one, too. :)

lkml.kernel.org/r/1430119421-13536-3-git-send-email-iamjoonsoo.kim@lge.com

My approach is to maintain dedicated thread and it migrates all movable
pages in the pageblock stolen for unmovable allocation to prevent the
future event that unmovable allocation happens in another movable
pageblock. This patches help to prevent it but not perfectly since it
sometimes cannot catch-up allocation speed. With pro-active compaction,
we may prevent it, too.

> 
> Ideally I would also revisit the topic of compaction mechanism (migrate
> and free scanners) itself. It's been shown that they usually meet in the

+1

> 1/3 or 1/2 of zone, which means the rest of the zone is only
> defragmented by "plugging free holes" by migrated pages, although it
> might actually contain pageblocks more suitable for migrating from, than
> the first part of the zone. It's also expensive for the free scanner to
> actually find free pages, according to the stats.

Scalable approach would be [3] since it finds freepage by O(1) unlike
others that are O(N).

> 
> Some approaches were proposed in recent years, but never got far as it's
> always some kind of a trade-off (this partially goes back to the problem
> of evaluation, often limited to stress-highalloc from mmtests):
> 
> - "pivot" based approach where scanners' starting point changes and
> isn't always zone boundaries [1]
> - both scanners scan whole zone moving in the same direction, just
> making sure they don't operate on the same pageblock at the same time [2]
> - replacing free scanner by directly taking free pages from freelist
> 
> However, the problem with this subtopic is that it might be too much
> specialized for the full MM room.

Right. :)

Thanks.

> 
> [1] https://lkml.org/lkml/2015/1/19/158
> [2] https://lkml.org/lkml/2015/6/24/706
> [3] https://lkml.org/lkml/2015/12/3/63

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2017-01-13  7:03     ` Joonsoo Kim
@ 2017-01-19 14:18       ` Vlastimil Babka
  0 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2017-01-19 14:18 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Michal Hocko, lsf-pc, linux-mm, David Rientjes,
	Johannes Weiner

On 01/13/2017 08:03 AM, Joonsoo Kim wrote:
>>>> So what is the problem? The demand for high order pages is growing and
>>>> that seems to be the general trend. The problem is that while they can
>>>> bring performance benefit they can get be really expensive to allocate
>>>> especially when we enter the direct compaction. So we really want to
>>>> prevent from expensive path and defer as much as possible to the
>>>> background. A huge step forward was kcompactd introduced by Vlastimil.
>>>> We are still not there yet though, because it might be already quite
>>>> late when we wakeup_kcompactd(). The memory might be already fragmented
>>>> when we hit there.
>>
>> Right.
> 
> Before we talk about pro-active compaction, I'd like to know the
> usecase that really needs pro-active compaction. For THP, IMHO, it's
> better not to do pro-active compaction, because high-order page made
> by pro-active compaction could be broken before it is used. And,

I agree that THP should be given lower priority, but wouldn't rule it
out completely.

> THP page can be setup lately by THP daemon. Benefit of pro-active
> compaction would not compensate overhead of it in this case.

khugepaged can only help in the longer term, but we can still help
shorter-lived processes

> I guess
> that almost cases that have a fallback would hit this category.

Yes, ideally we can derive this info from the GFP flags and prioritize
accordingly.

> For the order lower than costly order, system would have such a
> freepage usually. So, my question is pro-active compaction is really
> needed even if it's cost is really high? Reason I ask this question is
> that I tested some patches to do pro-active compaction and found that
> cost looks too much high. I heard that someone want this feature but
> I'm not sure they will use it with this high cost. Anyway, I will post
> some patches for pro-active compaction, soon.

David Rientjes mentioned their workloads benefit from background
compaction in the discussion about THP's "defrag" setting.

[...]

>> Parameters
>> - wake up period for kcompactd
>> - target per-order goals for kcompactd
>> - lowest efficiency where it's still considered worth to compact?
>>
>> An important question: how to evaluate this? Metrics should be feasible
>> (improved success rate, % of compaction that was handled by kcompactd
>> and not direct compaction...), but what are the good testcases?
> 
> Usecase should be defined first? Anyway, I hope that new testcase would
> be finished in short time. stress-highalloc test takes too much time
> to test various ideas.

Yeah, that too. But mainly it's too artificial.

>>
>> Ideally I would also revisit the topic of compaction mechanism (migrate
>> and free scanners) itself. It's been shown that they usually meet in the
> 
> +1
> 
>> 1/3 or 1/2 of zone, which means the rest of the zone is only
>> defragmented by "plugging free holes" by migrated pages, although it
>> might actually contain pageblocks more suitable for migrating from, than
>> the first part of the zone. It's also expensive for the free scanner to
>> actually find free pages, according to the stats.
> 
> Scalable approach would be [3] since it finds freepage by O(1) unlike
> others that are O(N).

There's however the issue that we need to skip (or potentially isolate
on a private list) freepages that lie in the area we are migrating from,
which is potentially O(N) where N is NR_FREE. This gets worse with
multiple compactors so we might have to e.g. reuse the pageblock skip
bits to indicate to others to go away, and rely on too_many_isolated()
or something similar to limit the number of concurrent compactors.

>>
>> Some approaches were proposed in recent years, but never got far as it's
>> always some kind of a trade-off (this partially goes back to the problem
>> of evaluation, often limited to stress-highalloc from mmtests):
>>
>> - "pivot" based approach where scanners' starting point changes and
>> isn't always zone boundaries [1]
>> - both scanners scan whole zone moving in the same direction, just
>> making sure they don't operate on the same pageblock at the same time [2]
>> - replacing free scanner by directly taking free pages from freelist
>>
>> However, the problem with this subtopic is that it might be too much
>> specialized for the full MM room.
> 
> Right. :)
> 
> Thanks.
> 
>>
>> [1] https://lkml.org/lkml/2015/1/19/158
>> [2] https://lkml.org/lkml/2015/6/24/706
>> [3] https://lkml.org/lkml/2015/12/3/63
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] wmark based pro-active compaction
  2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko
  2016-12-30 14:06 ` Mel Gorman
@ 2017-03-08 14:56 ` Vlastimil Babka
  1 sibling, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2017-03-08 14:56 UTC (permalink / raw)
  To: Michal Hocko, lsf-pc; +Cc: linux-mm, Mel Gorman, David Rientjes, Joonsoo Kim

On 12/30/2016 02:14 PM, Michal Hocko wrote:
> Hi,
> I didn't originally want to send this proposal because Vlastimil is
> planning to do some work in this area so I've expected him to send
> something similar. But the recent discussion about the THP defrag
> options pushed me to send out my thoughts.
> 
> So what is the problem? The demand for high order pages is growing and
> that seems to be the general trend. The problem is that while they can
> bring performance benefit they can get be really expensive to allocate
> especially when we enter the direct compaction. So we really want to
> prevent from expensive path and defer as much as possible to the
> background. A huge step forward was kcompactd introduced by Vlastimil.
> We are still not there yet though, because it might be already quite
> late when we wakeup_kcompactd(). The memory might be already fragmented
> when we hit there. Moreover we do not have any way to actually tell
> which orders we do care about.
> 
> Therefore I believe we need a watermark based pro-active compaction
> which would keep the background compaction busy as long as we have
> less pages of the configured order. kcompactd should wake up
> periodically, I think, and check for the status so that we can catch
> the fragmentation before we get low on memory.
> The interface could look something like:
> /proc/sys/vm/compact_wmark
> time_period order count
> 
> There are many details that would have to be solved of course - e.g. do
> not burn cycles pointlessly when we know that no further progress can be
> made etc... but in principle the idea show work.
 
OK, LSF/MM is near, so I'll post my approach up for discussion. It's
very RFC state, I've worked at it last year and now I just rebased it to
4.11-rc1 and updated some comments. Maybe I'll manage to do some tests
before LSF/MM, but no guarantees. Comments welcome.

----8<----
From: Vlastimil Babka <vbabka@suse.cz>
Subject: [RFC] mm: make kcompactd more proactive

Kcompactd activity is currently tied to kswapd - it is woken up when kswapd
goes to sleep, and compacts to make a single high-order page available, of
order that was used to wake up kswapd. This leaves the rest of free pages
fragmented and results in direct compaction when the demand for fresh
high-order pages is higher than a single page per kswapd cycle.

Another extreme would be to let kcompactd compact whole zone the same way as
manual compaction from /proc interface. This would be wasteful if the resulting
high-order pages would be split down to base pages for allocations.

This patch aims to adjust the kcompactd effort through observed demand for
high-order pages. This is done by hooking into alloc_pages_slowpath() and
counting (per each order > 0) allocation attempts that would pass the order-0
watermarks, but don't have the high-order page available. This demand is
(currently) recorded per node and then redistributed per zones in each node
according to their relative sizes.

Kcompactd then uses a different termination criteria than direct compaction.
It checks whether for each order, the recorded number of attempted allocations
would fit within the free pages of that order of with possible splitting of
higher orders, assuming there would be no allocations of other orders. This
should make kcompactd effort reflect the high-order demand.

In the worst case, the demand is so high that kcompactd will in fact compact
the whole zone and would have to be run with higher frequency than kswapd to
make a larger difference. That possibility can be explored later.
---
 include/linux/compaction.h |   6 ++
 include/linux/mmzone.h     |   2 +
 mm/compaction.c            | 165 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c            |  12 ++++
 4 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 0d8415820fc3..b342a80bde17 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -176,6 +176,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
+extern void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac);
 
 #else
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
@@ -224,6 +226,10 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i
 {
 }
 
+static inline void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac)
+{
+}
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..0943849620ae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -478,6 +478,7 @@ struct zone {
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
 	int			compact_order_failed;
+	unsigned int		compact_free_target[MAX_ORDER];
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -635,6 +636,7 @@ typedef struct pglist_data {
 	enum zone_type kcompactd_classzone_idx;
 	wait_queue_head_t kcompactd_wait;
 	struct task_struct *kcompactd;
+	atomic_t compact_free_target[MAX_ORDER];
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
diff --git a/mm/compaction.c b/mm/compaction.c
index 247a7c421014..8c68ca64c670 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -21,6 +21,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/page_owner.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1276,6 +1277,34 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
+static bool kcompactd_zone_balanced(struct zone *zone)
+{
+	unsigned int order;
+	unsigned long sum_nr_free = 0;
+
+	//TODO: we should consider whether kcompactd should give up when
+	//NR_FREE_PAGES drops below some point between low and high wmark,
+	//or somehow scale down the free target
+
+	for (order = MAX_ORDER - 1; order > 0; order--) {
+		unsigned long nr_free;
+
+		nr_free = zone->free_area[order].nr_free;
+		sum_nr_free += nr_free;
+
+		if (sum_nr_free < zone->compact_free_target[order])
+			return false;
+
+		/*
+		 * Each free page of current order can fit two pages of the
+		 * lower order.
+		 */
+		sum_nr_free <<= 1UL;
+	}
+
+	return true;
+}
+
 static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -1315,6 +1344,14 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
 							cc->alloc_flags))
 		return COMPACT_CONTINUE;
 
+	/*
+	 * Compaction that's neither direct nor is_via_compact_memory() has to
+	 * be from kcompactd, which has different criteria.
+	 */
+	if (!cc->direct_compaction)
+		return kcompactd_zone_balanced(zone) ?
+			COMPACT_SUCCESS : COMPACT_CONTINUE;
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		struct free_area *area = &zone->free_area[order];
@@ -1869,7 +1906,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 	struct zone *zone;
 	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
 
-	for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 		zone = &pgdat->node_zones[zoneid];
 
 		if (!populated_zone(zone))
@@ -1878,11 +1915,130 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
 					classzone_idx) == COMPACT_CONTINUE)
 			return true;
+
+		// TODO: potentially unsuitable due to low free memory
+		if (!kcompactd_zone_balanced(zone))
+			return true;
 	}
 
 	return false;
 }
 
+void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac)
+{
+	struct zone *zone;
+	struct zoneref *zref;
+	// FIXME: too large for stack?
+	nodemask_t nodes_done = NODE_MASK_NONE;
+
+	// FIXME: spread over nodes instead of increasing all?
+	for_each_zone_zonelist_nodemask(zone, zref, ac->zonelist,
+					ac->high_zoneidx, ac->nodemask) {
+		unsigned long mark;
+		int nid = zone_to_nid(zone);
+
+		if (node_isset(nid, nodes_done))
+			continue;
+
+		if (cpusets_enabled() &&
+				(alloc_flags & ALLOC_CPUSET) &&
+				!cpuset_zone_allowed(zone, gfp_mask))
+			continue;
+
+		/* The high-order allocation should succeed on this node */
+		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		if (zone_watermark_ok(zone, order, mark,
+				       ac_classzone_idx(ac), alloc_flags)) {
+			node_set(nid, nodes_done);
+			continue;
+		}
+
+		/*
+		 * High-order allocation wouldn't succeed. If order-0
+		 * allocations of same total size would pass the watermarks,
+		 * we know it's due to fragmentation, and kcompactd trying
+		 * harder could help.
+		 */
+		mark += (1UL << order) - 1;
+		if (zone_watermark_ok(zone, 0, mark, ac_classzone_idx(ac),
+								alloc_flags)) {
+			/*
+			 * TODO: consider prioritizing based on gfp_mask, e.g.
+			 * THP faults are opportunistic and should not result
+			 * in perpetual kcompactd activity. Allocation attempts
+			 * without easy fallback should be more important.
+			 */
+			atomic_inc(&NODE_DATA(nid)->compact_free_target[order]);
+			node_set(nid, nodes_done);
+		}
+	}
+}
+
+static void kcompactd_adjust_free_targets(pg_data_t *pgdat)
+{
+	unsigned long managed_pages = 0;
+	unsigned long high_wmark = 0;
+	int zoneid, order;
+	struct zone *zone;
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		zone = &pgdat->node_zones[zoneid];
+
+		if (!populated_zone(zone))
+			continue;
+
+		managed_pages += zone->managed_pages;
+		high_wmark += high_wmark_pages(zone);
+	}
+
+	if (!managed_pages)
+		return;
+
+	for (order = 1; order < MAX_ORDER; order++) {
+		unsigned long target;
+
+		target = atomic_read(&pgdat->compact_free_target[order]);
+
+		/*
+		 * Limit the target by high wmark worth of pages, otherwise
+		 * kcompactd can't achieve it anyway.
+		 */
+		if ((target << order) > high_wmark) {
+			target = high_wmark >> order;
+			atomic_set(&pgdat->compact_free_target[order], target);
+		}
+
+		if (!target)
+			continue;
+
+		/* Distribute the target among zones */
+		for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+
+			unsigned long zone_target = target;
+
+			zone = &pgdat->node_zones[zoneid];
+
+			if (!populated_zone(zone))
+				continue;
+
+			/* For a single zone on node, take a shortcut */
+			if (managed_pages == zone->managed_pages) {
+				zone->compact_free_target[order] = zone_target;
+				continue;
+			}
+
+			/* Take proportion of zone's page to whole node */
+			zone_target *= zone->managed_pages;
+			/* Round up for remainder of at least 1/2 */
+			zone_target += managed_pages >> 1;
+			zone_target /= managed_pages;
+
+			zone->compact_free_target[order] = zone_target;
+		}
+	}
+}
+
 static void kcompactd_do_work(pg_data_t *pgdat)
 {
 	/*
@@ -1905,7 +2061,9 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 							cc.classzone_idx);
 	count_compact_event(KCOMPACTD_WAKE);
 
-	for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
+	kcompactd_adjust_free_targets(pgdat);
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 		int status;
 
 		zone = &pgdat->node_zones[zoneid];
@@ -1915,8 +2073,9 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (compaction_deferred(zone, cc.order))
 			continue;
 
-		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+		if ((compaction_suitable(zone, cc.order, 0, zoneid) !=
 							COMPACT_CONTINUE)
+					&& kcompactd_zone_balanced(zone))
 			continue;
 
 		cc.nr_freepages = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa64d2ffdc5..740bcb0ac382 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3697,6 +3697,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/*
+	 * If it looks like increased kcompactd effort could have spared
+	 * us from direct compaction (or allocation failure if we cannot
+	 * compact), increase kcompactd's target.
+	 */
+	if (order > 0)
+		kcompactd_inc_free_target(gfp_mask, order, alloc_flags, ac);
+
+	/*
 	 * For costly allocations, try direct compaction first, as it's likely
 	 * that we have enough base pages and don't need to reclaim. Don't try
 	 * that for allocations that are allowed to ignore watermarks, as the
@@ -5946,6 +5954,7 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
  */
 static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 {
+	int i;
 	enum zone_type j;
 	int nid = pgdat->node_id;
 	int ret;
@@ -5965,6 +5974,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 #ifdef CONFIG_COMPACTION
 	init_waitqueue_head(&pgdat->kcompactd_wait);
+	for (i = 0; i < MAX_ORDER; i++)
+		//FIXME: I can't use ATOMIC_INIT, can I?
+		atomic_set(&pgdat->compact_free_target[i], 0);
 #endif
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
-- 
2.12.0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-03-08 14:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko
2016-12-30 14:06 ` Mel Gorman
2017-01-05  9:53   ` Vlastimil Babka
2017-01-05 10:27     ` Michal Hocko
2017-01-06  8:57       ` Vlastimil Babka
2017-01-13  7:03     ` Joonsoo Kim
2017-01-19 14:18       ` Vlastimil Babka
2017-03-08 14:56 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).