* [LSF/MM TOPIC] wmark based pro-active compaction @ 2016-12-30 13:14 Michal Hocko 2016-12-30 14:06 ` Mel Gorman 2017-03-08 14:56 ` Vlastimil Babka 0 siblings, 2 replies; 8+ messages in thread From: Michal Hocko @ 2016-12-30 13:14 UTC (permalink / raw) To: lsf-pc; +Cc: linux-mm, Vlastimil Babka, Mel Gorman, David Rientjes Hi, I didn't originally want to send this proposal because Vlastimil is planning to do some work in this area so I've expected him to send something similar. But the recent discussion about the THP defrag options pushed me to send out my thoughts. So what is the problem? The demand for high order pages is growing and that seems to be the general trend. The problem is that while they can bring performance benefit they can get be really expensive to allocate especially when we enter the direct compaction. So we really want to prevent from expensive path and defer as much as possible to the background. A huge step forward was kcompactd introduced by Vlastimil. We are still not there yet though, because it might be already quite late when we wakeup_kcompactd(). The memory might be already fragmented when we hit there. Moreover we do not have any way to actually tell which orders we do care about. Therefore I believe we need a watermark based pro-active compaction which would keep the background compaction busy as long as we have less pages of the configured order. kcompactd should wake up periodically, I think, and check for the status so that we can catch the fragmentation before we get low on memory. The interface could look something like: /proc/sys/vm/compact_wmark time_period order count There are many details that would have to be solved of course - e.g. do not burn cycles pointlessly when we know that no further progress can be made etc... but in principle the idea show work. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko @ 2016-12-30 14:06 ` Mel Gorman 2017-01-05 9:53 ` Vlastimil Babka 2017-03-08 14:56 ` Vlastimil Babka 1 sibling, 1 reply; 8+ messages in thread From: Mel Gorman @ 2016-12-30 14:06 UTC (permalink / raw) To: Michal Hocko; +Cc: lsf-pc, linux-mm, Vlastimil Babka, David Rientjes On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote: > Hi, > I didn't originally want to send this proposal because Vlastimil is > planning to do some work in this area so I've expected him to send > something similar. But the recent discussion about the THP defrag > options pushed me to send out my thoughts. > > So what is the problem? The demand for high order pages is growing and > that seems to be the general trend. The problem is that while they can > bring performance benefit they can get be really expensive to allocate > especially when we enter the direct compaction. So we really want to > prevent from expensive path and defer as much as possible to the > background. A huge step forward was kcompactd introduced by Vlastimil. > We are still not there yet though, because it might be already quite > late when we wakeup_kcompactd(). The memory might be already fragmented > when we hit there. Moreover we do not have any way to actually tell > which orders we do care about. > > Therefore I believe we need a watermark based pro-active compaction > which would keep the background compaction busy as long as we have > less pages of the configured order. kcompactd should wake up > periodically, I think, and check for the status so that we can catch > the fragmentation before we get low on memory. > The interface could look something like: > /proc/sys/vm/compact_wmark > time_period order count > > There are many details that would have to be solved of course - e.g. do > not burn cycles pointlessly when we know that no further progress can be > made etc... but in principle the idea show work. I'd be very interested in this. I'd also like to add to the list to revisit the concept of pre-emptively moving movable pages from pageblocks stolen for unmovable pages to reduce future events that degrade fragmentation. Before the Christmas I was mulling over whether it would be appropriate to have a workqueue of pageblocks that need "cleaning". This could be either instead of or in conjunction with wmark-based compaction. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2016-12-30 14:06 ` Mel Gorman @ 2017-01-05 9:53 ` Vlastimil Babka 2017-01-05 10:27 ` Michal Hocko 2017-01-13 7:03 ` Joonsoo Kim 0 siblings, 2 replies; 8+ messages in thread From: Vlastimil Babka @ 2017-01-05 9:53 UTC (permalink / raw) To: Mel Gorman, Michal Hocko Cc: lsf-pc, linux-mm, David Rientjes, Joonsoo Kim, Johannes Weiner [CC Joonsoo and Johannes] On 12/30/2016 03:06 PM, Mel Gorman wrote: > On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote: >> Hi, >> I didn't originally want to send this proposal because Vlastimil is >> planning to do some work in this area so I've expected him to send >> something similar. But the recent discussion about the THP defrag >> options pushed me to send out my thoughts. No problem. >> So what is the problem? The demand for high order pages is growing and >> that seems to be the general trend. The problem is that while they can >> bring performance benefit they can get be really expensive to allocate >> especially when we enter the direct compaction. So we really want to >> prevent from expensive path and defer as much as possible to the >> background. A huge step forward was kcompactd introduced by Vlastimil. >> We are still not there yet though, because it might be already quite >> late when we wakeup_kcompactd(). The memory might be already fragmented >> when we hit there. Right. >> Moreover we do not have any way to actually tell >> which orders we do care about. Who is "we" here? The system admin? >> Therefore I believe we need a watermark based pro-active compaction >> which would keep the background compaction busy as long as we have >> less pages of the configured order. Again, configured by what, admin? I would rather try to avoid tunables here, if possible. While THP is quite well known example with stable order, the pressure for other orders is rather implementation specific (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually mapped stacks, although that example is about non-costly order). Would the admin be expected to study the implementation to know which orders are needed, or react to page allocation failure reports? Neither sounds nice. >> kcompactd should wake up >> periodically, I think, and check for the status so that we can catch >> the fragmentation before we get low on memory. >> The interface could look something like: >> /proc/sys/vm/compact_wmark >> time_period order count IMHO it would be better if the system could auto-tune this, e.g. by counting high-order alloc failures/needs for direct compaction per order between wakeups, and trying to bring them to zero. >> There are many details that would have to be solved of course - e.g. do >> not burn cycles pointlessly when we know that no further progress can be >> made etc... but in principle the idea show work. Yeah with auto-tuning there's even more inputs to consider and parameters that would be auto-adjusted based on them. Right now I can think of: Inputs - the per-order "pressure" (e.g. the failures/direct compactions above) - ideally somehow including the "importance". That might be the trickiest part when wanting to avoid tunables. THP failures might be least important, allocations with expensive or no fallback most important. Probably not just simple relation between order. Hopefully gfp flags such as __GFP_NORETRY and __GFP_REPEAT can help here? Without such metric, everything will easily be dominated by THP pressure. - recent compaction efficiency (as you mentioned above) Parameters - wake up period for kcompactd - target per-order goals for kcompactd - lowest efficiency where it's still considered worth to compact? An important question: how to evaluate this? Metrics should be feasible (improved success rate, % of compaction that was handled by kcompactd and not direct compaction...), but what are the good testcases? > I'd be very interested in this. I'd also like to add to the list to revisit > the concept of pre-emptively moving movable pages from pageblocks stolen for > unmovable pages to reduce future events that degrade fragmentation. Before > the Christmas I was mulling over whether it would be appropriate to have a > workqueue of pageblocks that need "cleaning". This could be either instead > of or in conjunction with wmark-based compaction. Yes, that could be useful as well. Ideally I would also revisit the topic of compaction mechanism (migrate and free scanners) itself. It's been shown that they usually meet in the 1/3 or 1/2 of zone, which means the rest of the zone is only defragmented by "plugging free holes" by migrated pages, although it might actually contain pageblocks more suitable for migrating from, than the first part of the zone. It's also expensive for the free scanner to actually find free pages, according to the stats. Some approaches were proposed in recent years, but never got far as it's always some kind of a trade-off (this partially goes back to the problem of evaluation, often limited to stress-highalloc from mmtests): - "pivot" based approach where scanners' starting point changes and isn't always zone boundaries [1] - both scanners scan whole zone moving in the same direction, just making sure they don't operate on the same pageblock at the same time [2] - replacing free scanner by directly taking free pages from freelist However, the problem with this subtopic is that it might be too much specialized for the full MM room. [1] https://lkml.org/lkml/2015/1/19/158 [2] https://lkml.org/lkml/2015/6/24/706 [3] https://lkml.org/lkml/2015/12/3/63 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2017-01-05 9:53 ` Vlastimil Babka @ 2017-01-05 10:27 ` Michal Hocko 2017-01-06 8:57 ` Vlastimil Babka 2017-01-13 7:03 ` Joonsoo Kim 1 sibling, 1 reply; 8+ messages in thread From: Michal Hocko @ 2017-01-05 10:27 UTC (permalink / raw) To: Vlastimil Babka Cc: Mel Gorman, lsf-pc, linux-mm, David Rientjes, Joonsoo Kim, Johannes Weiner On Thu 05-01-17 10:53:59, Vlastimil Babka wrote: > [CC Joonsoo and Johannes] > > On 12/30/2016 03:06 PM, Mel Gorman wrote: > > On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote: > >> Hi, > >> I didn't originally want to send this proposal because Vlastimil is > >> planning to do some work in this area so I've expected him to send > >> something similar. But the recent discussion about the THP defrag > >> options pushed me to send out my thoughts. > > No problem. > > >> So what is the problem? The demand for high order pages is growing and > >> that seems to be the general trend. The problem is that while they can > >> bring performance benefit they can get be really expensive to allocate > >> especially when we enter the direct compaction. So we really want to > >> prevent from expensive path and defer as much as possible to the > >> background. A huge step forward was kcompactd introduced by Vlastimil. > >> We are still not there yet though, because it might be already quite > >> late when we wakeup_kcompactd(). The memory might be already fragmented > >> when we hit there. > > Right. > > >> Moreover we do not have any way to actually tell > >> which orders we do care about. > > Who is "we" here? The system admin? yes > >> Therefore I believe we need a watermark based pro-active compaction > >> which would keep the background compaction busy as long as we have > >> less pages of the configured order. > > Again, configured by what, admin? I would rather try to avoid tunables > here, if possible. While THP is quite well known example with stable > order, the pressure for other orders is rather implementation specific > (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually > mapped stacks, although that example is about non-costly order). Would > the admin be expected to study the implementation to know which orders > are needed, or react to page allocation failure reports? Neither sounds > nice. That is a good question but I expect that there are more users than THP which use stable orders. E.g. networking stack tends to depend on the packet size. A tracepoint with some histogram output would tell us what is the requested orders distribution. > >> kcompactd should wake up > >> periodically, I think, and check for the status so that we can catch > >> the fragmentation before we get low on memory. > >> The interface could look something like: > >> /proc/sys/vm/compact_wmark > >> time_period order count > > IMHO it would be better if the system could auto-tune this, e.g. by > counting high-order alloc failures/needs for direct compaction per order > between wakeups, and trying to bring them to zero. auto-tunning is usually preferable I am just wondering how the admin can tell what is still the system load price he is willing to pay. I suspect we will see growing number of opportunistic high order requests over time and auto tunning shouldn't try to accomodate with it without any bounds. There is still some cost/benefit to be evaluated from the system level point of view which I am afraid is hard to achive from the kcompactd POV. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2017-01-05 10:27 ` Michal Hocko @ 2017-01-06 8:57 ` Vlastimil Babka 0 siblings, 0 replies; 8+ messages in thread From: Vlastimil Babka @ 2017-01-06 8:57 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, lsf-pc, linux-mm, David Rientjes, Joonsoo Kim, Johannes Weiner On 01/05/2017 11:27 AM, Michal Hocko wrote: > On Thu 05-01-17 10:53:59, Vlastimil Babka wrote: >>>> Therefore I believe we need a watermark based pro-active compaction >>>> which would keep the background compaction busy as long as we have >>>> less pages of the configured order. >> >> Again, configured by what, admin? I would rather try to avoid tunables >> here, if possible. While THP is quite well known example with stable >> order, the pressure for other orders is rather implementation specific >> (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually >> mapped stacks, although that example is about non-costly order). Would >> the admin be expected to study the implementation to know which orders >> are needed, or react to page allocation failure reports? Neither sounds >> nice. > > That is a good question but I expect that there are more users than THP > which use stable orders. E.g. networking stack tends to depend on the > packet size. A tracepoint with some histogram output would tell us what > is the requested orders distribution. Maybe, but there might be also multiple users of the same order but different "importance"... >>>> kcompactd should wake up >>>> periodically, I think, and check for the status so that we can catch >>>> the fragmentation before we get low on memory. >>>> The interface could look something like: >>>> /proc/sys/vm/compact_wmark >>>> time_period order count >> >> IMHO it would be better if the system could auto-tune this, e.g. by >> counting high-order alloc failures/needs for direct compaction per order >> between wakeups, and trying to bring them to zero. > > auto-tunning is usually preferable I am just wondering how the admin can > tell what is still the system load price he is willing to pay. I suspect > we will see growing number of opportunistic high order requests over > time and auto tunning shouldn't try to accomodate with it without > any bounds.There is still some cost/benefit to be evaluated from the > system level point of view which I am afraid is hard to achive from the > kcompactd POV. That's why I mentioned that importance should be judged somehow. Opportunistic requests should be recognizable by their gfp flags, so hopefully there's a way. I wouldn't mind some general tunable(s) to express how much effort to give to "important" allocations and opportunistic ones, but rather not in such implementation-detail form as "time_period order count". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2017-01-05 9:53 ` Vlastimil Babka 2017-01-05 10:27 ` Michal Hocko @ 2017-01-13 7:03 ` Joonsoo Kim 2017-01-19 14:18 ` Vlastimil Babka 1 sibling, 1 reply; 8+ messages in thread From: Joonsoo Kim @ 2017-01-13 7:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Mel Gorman, Michal Hocko, lsf-pc, linux-mm, David Rientjes, Johannes Weiner Hello, I'm also interested in this topic. lkml.kernel.org/r/1430119421-13536-3-git-send-email-iamjoonsoo.kim@lge.com On Thu, Jan 05, 2017 at 10:53:59AM +0100, Vlastimil Babka wrote: > [CC Joonsoo and Johannes] > > On 12/30/2016 03:06 PM, Mel Gorman wrote: > > On Fri, Dec 30, 2016 at 02:14:12PM +0100, Michal Hocko wrote: > >> Hi, > >> I didn't originally want to send this proposal because Vlastimil is > >> planning to do some work in this area so I've expected him to send > >> something similar. But the recent discussion about the THP defrag > >> options pushed me to send out my thoughts. > > No problem. > > >> So what is the problem? The demand for high order pages is growing and > >> that seems to be the general trend. The problem is that while they can > >> bring performance benefit they can get be really expensive to allocate > >> especially when we enter the direct compaction. So we really want to > >> prevent from expensive path and defer as much as possible to the > >> background. A huge step forward was kcompactd introduced by Vlastimil. > >> We are still not there yet though, because it might be already quite > >> late when we wakeup_kcompactd(). The memory might be already fragmented > >> when we hit there. > > Right. Before we talk about pro-active compaction, I'd like to know the usecase that really needs pro-active compaction. For THP, IMHO, it's better not to do pro-active compaction, because high-order page made by pro-active compaction could be broken before it is used. And, THP page can be setup lately by THP daemon. Benefit of pro-active compaction would not compensate overhead of it in this case. I guess that almost cases that have a fallback would hit this category. For the order lower than costly order, system would have such a freepage usually. So, my question is pro-active compaction is really needed even if it's cost is really high? Reason I ask this question is that I tested some patches to do pro-active compaction and found that cost looks too much high. I heard that someone want this feature but I'm not sure they will use it with this high cost. Anyway, I will post some patches for pro-active compaction, soon. > > >> Moreover we do not have any way to actually tell > >> which orders we do care about. > > Who is "we" here? The system admin? > > >> Therefore I believe we need a watermark based pro-active compaction > >> which would keep the background compaction busy as long as we have > >> less pages of the configured order. > > Again, configured by what, admin? I would rather try to avoid tunables > here, if possible. While THP is quite well known example with stable > order, the pressure for other orders is rather implementation specific > (drivers, SLAB/SLUB) and may change with kernel versions (e.g. virtually > mapped stacks, although that example is about non-costly order). Would > the admin be expected to study the implementation to know which orders > are needed, or react to page allocation failure reports? Neither sounds > nice. It would be really good if we can auto tune this. My patches mentioned above just use tunables that you don't like. :) > >> kcompactd should wake up > >> periodically, I think, and check for the status so that we can catch > >> the fragmentation before we get low on memory. > >> The interface could look something like: > >> /proc/sys/vm/compact_wmark > >> time_period order count > > IMHO it would be better if the system could auto-tune this, e.g. by > counting high-order alloc failures/needs for direct compaction per order > between wakeups, and trying to bring them to zero. > > >> There are many details that would have to be solved of course - e.g. do > >> not burn cycles pointlessly when we know that no further progress can be > >> made etc... but in principle the idea show work. > > Yeah with auto-tuning there's even more inputs to consider and > parameters that would be auto-adjusted based on them. Right now I can > think of: > > Inputs > - the per-order "pressure" (e.g. the failures/direct compactions above) > - ideally somehow including the "importance". That might be the > trickiest part when wanting to avoid tunables. THP failures might be > least important, allocations with expensive or no fallback most > important. Probably not just simple relation between order. Hopefully > gfp flags such as __GFP_NORETRY and __GFP_REPEAT can help here? Without > such metric, everything will easily be dominated by THP pressure. > - recent compaction efficiency (as you mentioned above) > > Parameters > - wake up period for kcompactd > - target per-order goals for kcompactd > - lowest efficiency where it's still considered worth to compact? > > An important question: how to evaluate this? Metrics should be feasible > (improved success rate, % of compaction that was handled by kcompactd > and not direct compaction...), but what are the good testcases? Usecase should be defined first? Anyway, I hope that new testcase would be finished in short time. stress-highalloc test takes too much time to test various ideas. > > I'd be very interested in this. I'd also like to add to the list to revisit > > the concept of pre-emptively moving movable pages from pageblocks stolen for > > unmovable pages to reduce future events that degrade fragmentation. Before > > the Christmas I was mulling over whether it would be appropriate to have a > > workqueue of pageblocks that need "cleaning". This could be either instead > > of or in conjunction with wmark-based compaction. > > Yes, that could be useful as well. I tried this one, too. :) lkml.kernel.org/r/1430119421-13536-3-git-send-email-iamjoonsoo.kim@lge.com My approach is to maintain dedicated thread and it migrates all movable pages in the pageblock stolen for unmovable allocation to prevent the future event that unmovable allocation happens in another movable pageblock. This patches help to prevent it but not perfectly since it sometimes cannot catch-up allocation speed. With pro-active compaction, we may prevent it, too. > > Ideally I would also revisit the topic of compaction mechanism (migrate > and free scanners) itself. It's been shown that they usually meet in the +1 > 1/3 or 1/2 of zone, which means the rest of the zone is only > defragmented by "plugging free holes" by migrated pages, although it > might actually contain pageblocks more suitable for migrating from, than > the first part of the zone. It's also expensive for the free scanner to > actually find free pages, according to the stats. Scalable approach would be [3] since it finds freepage by O(1) unlike others that are O(N). > > Some approaches were proposed in recent years, but never got far as it's > always some kind of a trade-off (this partially goes back to the problem > of evaluation, often limited to stress-highalloc from mmtests): > > - "pivot" based approach where scanners' starting point changes and > isn't always zone boundaries [1] > - both scanners scan whole zone moving in the same direction, just > making sure they don't operate on the same pageblock at the same time [2] > - replacing free scanner by directly taking free pages from freelist > > However, the problem with this subtopic is that it might be too much > specialized for the full MM room. Right. :) Thanks. > > [1] https://lkml.org/lkml/2015/1/19/158 > [2] https://lkml.org/lkml/2015/6/24/706 > [3] https://lkml.org/lkml/2015/12/3/63 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2017-01-13 7:03 ` Joonsoo Kim @ 2017-01-19 14:18 ` Vlastimil Babka 0 siblings, 0 replies; 8+ messages in thread From: Vlastimil Babka @ 2017-01-19 14:18 UTC (permalink / raw) To: Joonsoo Kim Cc: Mel Gorman, Michal Hocko, lsf-pc, linux-mm, David Rientjes, Johannes Weiner On 01/13/2017 08:03 AM, Joonsoo Kim wrote: >>>> So what is the problem? The demand for high order pages is growing and >>>> that seems to be the general trend. The problem is that while they can >>>> bring performance benefit they can get be really expensive to allocate >>>> especially when we enter the direct compaction. So we really want to >>>> prevent from expensive path and defer as much as possible to the >>>> background. A huge step forward was kcompactd introduced by Vlastimil. >>>> We are still not there yet though, because it might be already quite >>>> late when we wakeup_kcompactd(). The memory might be already fragmented >>>> when we hit there. >> >> Right. > > Before we talk about pro-active compaction, I'd like to know the > usecase that really needs pro-active compaction. For THP, IMHO, it's > better not to do pro-active compaction, because high-order page made > by pro-active compaction could be broken before it is used. And, I agree that THP should be given lower priority, but wouldn't rule it out completely. > THP page can be setup lately by THP daemon. Benefit of pro-active > compaction would not compensate overhead of it in this case. khugepaged can only help in the longer term, but we can still help shorter-lived processes > I guess > that almost cases that have a fallback would hit this category. Yes, ideally we can derive this info from the GFP flags and prioritize accordingly. > For the order lower than costly order, system would have such a > freepage usually. So, my question is pro-active compaction is really > needed even if it's cost is really high? Reason I ask this question is > that I tested some patches to do pro-active compaction and found that > cost looks too much high. I heard that someone want this feature but > I'm not sure they will use it with this high cost. Anyway, I will post > some patches for pro-active compaction, soon. David Rientjes mentioned their workloads benefit from background compaction in the discussion about THP's "defrag" setting. [...] >> Parameters >> - wake up period for kcompactd >> - target per-order goals for kcompactd >> - lowest efficiency where it's still considered worth to compact? >> >> An important question: how to evaluate this? Metrics should be feasible >> (improved success rate, % of compaction that was handled by kcompactd >> and not direct compaction...), but what are the good testcases? > > Usecase should be defined first? Anyway, I hope that new testcase would > be finished in short time. stress-highalloc test takes too much time > to test various ideas. Yeah, that too. But mainly it's too artificial. >> >> Ideally I would also revisit the topic of compaction mechanism (migrate >> and free scanners) itself. It's been shown that they usually meet in the > > +1 > >> 1/3 or 1/2 of zone, which means the rest of the zone is only >> defragmented by "plugging free holes" by migrated pages, although it >> might actually contain pageblocks more suitable for migrating from, than >> the first part of the zone. It's also expensive for the free scanner to >> actually find free pages, according to the stats. > > Scalable approach would be [3] since it finds freepage by O(1) unlike > others that are O(N). There's however the issue that we need to skip (or potentially isolate on a private list) freepages that lie in the area we are migrating from, which is potentially O(N) where N is NR_FREE. This gets worse with multiple compactors so we might have to e.g. reuse the pageblock skip bits to indicate to others to go away, and rely on too_many_isolated() or something similar to limit the number of concurrent compactors. >> >> Some approaches were proposed in recent years, but never got far as it's >> always some kind of a trade-off (this partially goes back to the problem >> of evaluation, often limited to stress-highalloc from mmtests): >> >> - "pivot" based approach where scanners' starting point changes and >> isn't always zone boundaries [1] >> - both scanners scan whole zone moving in the same direction, just >> making sure they don't operate on the same pageblock at the same time [2] >> - replacing free scanner by directly taking free pages from freelist >> >> However, the problem with this subtopic is that it might be too much >> specialized for the full MM room. > > Right. :) > > Thanks. > >> >> [1] https://lkml.org/lkml/2015/1/19/158 >> [2] https://lkml.org/lkml/2015/6/24/706 >> [3] https://lkml.org/lkml/2015/12/3/63 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] wmark based pro-active compaction 2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko 2016-12-30 14:06 ` Mel Gorman @ 2017-03-08 14:56 ` Vlastimil Babka 1 sibling, 0 replies; 8+ messages in thread From: Vlastimil Babka @ 2017-03-08 14:56 UTC (permalink / raw) To: Michal Hocko, lsf-pc; +Cc: linux-mm, Mel Gorman, David Rientjes, Joonsoo Kim On 12/30/2016 02:14 PM, Michal Hocko wrote: > Hi, > I didn't originally want to send this proposal because Vlastimil is > planning to do some work in this area so I've expected him to send > something similar. But the recent discussion about the THP defrag > options pushed me to send out my thoughts. > > So what is the problem? The demand for high order pages is growing and > that seems to be the general trend. The problem is that while they can > bring performance benefit they can get be really expensive to allocate > especially when we enter the direct compaction. So we really want to > prevent from expensive path and defer as much as possible to the > background. A huge step forward was kcompactd introduced by Vlastimil. > We are still not there yet though, because it might be already quite > late when we wakeup_kcompactd(). The memory might be already fragmented > when we hit there. Moreover we do not have any way to actually tell > which orders we do care about. > > Therefore I believe we need a watermark based pro-active compaction > which would keep the background compaction busy as long as we have > less pages of the configured order. kcompactd should wake up > periodically, I think, and check for the status so that we can catch > the fragmentation before we get low on memory. > The interface could look something like: > /proc/sys/vm/compact_wmark > time_period order count > > There are many details that would have to be solved of course - e.g. do > not burn cycles pointlessly when we know that no further progress can be > made etc... but in principle the idea show work. OK, LSF/MM is near, so I'll post my approach up for discussion. It's very RFC state, I've worked at it last year and now I just rebased it to 4.11-rc1 and updated some comments. Maybe I'll manage to do some tests before LSF/MM, but no guarantees. Comments welcome. ----8<---- From: Vlastimil Babka <vbabka@suse.cz> Subject: [RFC] mm: make kcompactd more proactive Kcompactd activity is currently tied to kswapd - it is woken up when kswapd goes to sleep, and compacts to make a single high-order page available, of order that was used to wake up kswapd. This leaves the rest of free pages fragmented and results in direct compaction when the demand for fresh high-order pages is higher than a single page per kswapd cycle. Another extreme would be to let kcompactd compact whole zone the same way as manual compaction from /proc interface. This would be wasteful if the resulting high-order pages would be split down to base pages for allocations. This patch aims to adjust the kcompactd effort through observed demand for high-order pages. This is done by hooking into alloc_pages_slowpath() and counting (per each order > 0) allocation attempts that would pass the order-0 watermarks, but don't have the high-order page available. This demand is (currently) recorded per node and then redistributed per zones in each node according to their relative sizes. Kcompactd then uses a different termination criteria than direct compaction. It checks whether for each order, the recorded number of attempted allocations would fit within the free pages of that order of with possible splitting of higher orders, assuming there would be no allocations of other orders. This should make kcompactd effort reflect the high-order demand. In the worst case, the demand is so high that kcompactd will in fact compact the whole zone and would have to be run with higher frequency than kswapd to make a larger difference. That possibility can be explored later. --- include/linux/compaction.h | 6 ++ include/linux/mmzone.h | 2 + mm/compaction.c | 165 ++++++++++++++++++++++++++++++++++++++++++++- mm/page_alloc.c | 12 ++++ 4 files changed, 182 insertions(+), 3 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 0d8415820fc3..b342a80bde17 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -176,6 +176,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, extern int kcompactd_run(int nid); extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); +extern void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order, + int alloc_flags, struct alloc_context *ac); #else static inline void reset_isolation_suitable(pg_data_t *pgdat) @@ -224,6 +226,10 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i { } +static inline void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order, + int alloc_flags, struct alloc_context *ac) +{ +} #endif /* CONFIG_COMPACTION */ #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8e02b3750fe0..0943849620ae 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -478,6 +478,7 @@ struct zone { unsigned int compact_considered; unsigned int compact_defer_shift; int compact_order_failed; + unsigned int compact_free_target[MAX_ORDER]; #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA @@ -635,6 +636,7 @@ typedef struct pglist_data { enum zone_type kcompactd_classzone_idx; wait_queue_head_t kcompactd_wait; struct task_struct *kcompactd; + atomic_t compact_free_target[MAX_ORDER]; #endif #ifdef CONFIG_NUMA_BALANCING /* Lock serializing the migrate rate limiting window */ diff --git a/mm/compaction.c b/mm/compaction.c index 247a7c421014..8c68ca64c670 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -21,6 +21,7 @@ #include <linux/kthread.h> #include <linux/freezer.h> #include <linux/page_owner.h> +#include <linux/cpuset.h> #include "internal.h" #ifdef CONFIG_COMPACTION @@ -1276,6 +1277,34 @@ static inline bool is_via_compact_memory(int order) return order == -1; } +static bool kcompactd_zone_balanced(struct zone *zone) +{ + unsigned int order; + unsigned long sum_nr_free = 0; + + //TODO: we should consider whether kcompactd should give up when + //NR_FREE_PAGES drops below some point between low and high wmark, + //or somehow scale down the free target + + for (order = MAX_ORDER - 1; order > 0; order--) { + unsigned long nr_free; + + nr_free = zone->free_area[order].nr_free; + sum_nr_free += nr_free; + + if (sum_nr_free < zone->compact_free_target[order]) + return false; + + /* + * Each free page of current order can fit two pages of the + * lower order. + */ + sum_nr_free <<= 1UL; + } + + return true; +} + static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc, const int migratetype) { @@ -1315,6 +1344,14 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_ cc->alloc_flags)) return COMPACT_CONTINUE; + /* + * Compaction that's neither direct nor is_via_compact_memory() has to + * be from kcompactd, which has different criteria. + */ + if (!cc->direct_compaction) + return kcompactd_zone_balanced(zone) ? + COMPACT_SUCCESS : COMPACT_CONTINUE; + /* Direct compactor: Is a suitable page free? */ for (order = cc->order; order < MAX_ORDER; order++) { struct free_area *area = &zone->free_area[order]; @@ -1869,7 +1906,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) struct zone *zone; enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx; - for (zoneid = 0; zoneid <= classzone_idx; zoneid++) { + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { zone = &pgdat->node_zones[zoneid]; if (!populated_zone(zone)) @@ -1878,11 +1915,130 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0, classzone_idx) == COMPACT_CONTINUE) return true; + + // TODO: potentially unsuitable due to low free memory + if (!kcompactd_zone_balanced(zone)) + return true; } return false; } +void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order, + int alloc_flags, struct alloc_context *ac) +{ + struct zone *zone; + struct zoneref *zref; + // FIXME: too large for stack? + nodemask_t nodes_done = NODE_MASK_NONE; + + // FIXME: spread over nodes instead of increasing all? + for_each_zone_zonelist_nodemask(zone, zref, ac->zonelist, + ac->high_zoneidx, ac->nodemask) { + unsigned long mark; + int nid = zone_to_nid(zone); + + if (node_isset(nid, nodes_done)) + continue; + + if (cpusets_enabled() && + (alloc_flags & ALLOC_CPUSET) && + !cpuset_zone_allowed(zone, gfp_mask)) + continue; + + /* The high-order allocation should succeed on this node */ + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; + if (zone_watermark_ok(zone, order, mark, + ac_classzone_idx(ac), alloc_flags)) { + node_set(nid, nodes_done); + continue; + } + + /* + * High-order allocation wouldn't succeed. If order-0 + * allocations of same total size would pass the watermarks, + * we know it's due to fragmentation, and kcompactd trying + * harder could help. + */ + mark += (1UL << order) - 1; + if (zone_watermark_ok(zone, 0, mark, ac_classzone_idx(ac), + alloc_flags)) { + /* + * TODO: consider prioritizing based on gfp_mask, e.g. + * THP faults are opportunistic and should not result + * in perpetual kcompactd activity. Allocation attempts + * without easy fallback should be more important. + */ + atomic_inc(&NODE_DATA(nid)->compact_free_target[order]); + node_set(nid, nodes_done); + } + } +} + +static void kcompactd_adjust_free_targets(pg_data_t *pgdat) +{ + unsigned long managed_pages = 0; + unsigned long high_wmark = 0; + int zoneid, order; + struct zone *zone; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + zone = &pgdat->node_zones[zoneid]; + + if (!populated_zone(zone)) + continue; + + managed_pages += zone->managed_pages; + high_wmark += high_wmark_pages(zone); + } + + if (!managed_pages) + return; + + for (order = 1; order < MAX_ORDER; order++) { + unsigned long target; + + target = atomic_read(&pgdat->compact_free_target[order]); + + /* + * Limit the target by high wmark worth of pages, otherwise + * kcompactd can't achieve it anyway. + */ + if ((target << order) > high_wmark) { + target = high_wmark >> order; + atomic_set(&pgdat->compact_free_target[order], target); + } + + if (!target) + continue; + + /* Distribute the target among zones */ + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + + unsigned long zone_target = target; + + zone = &pgdat->node_zones[zoneid]; + + if (!populated_zone(zone)) + continue; + + /* For a single zone on node, take a shortcut */ + if (managed_pages == zone->managed_pages) { + zone->compact_free_target[order] = zone_target; + continue; + } + + /* Take proportion of zone's page to whole node */ + zone_target *= zone->managed_pages; + /* Round up for remainder of at least 1/2 */ + zone_target += managed_pages >> 1; + zone_target /= managed_pages; + + zone->compact_free_target[order] = zone_target; + } + } +} + static void kcompactd_do_work(pg_data_t *pgdat) { /* @@ -1905,7 +2061,9 @@ static void kcompactd_do_work(pg_data_t *pgdat) cc.classzone_idx); count_compact_event(KCOMPACTD_WAKE); - for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) { + kcompactd_adjust_free_targets(pgdat); + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { int status; zone = &pgdat->node_zones[zoneid]; @@ -1915,8 +2073,9 @@ static void kcompactd_do_work(pg_data_t *pgdat) if (compaction_deferred(zone, cc.order)) continue; - if (compaction_suitable(zone, cc.order, 0, zoneid) != + if ((compaction_suitable(zone, cc.order, 0, zoneid) != COMPACT_CONTINUE) + && kcompactd_zone_balanced(zone)) continue; cc.nr_freepages = 0; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa64d2ffdc5..740bcb0ac382 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3697,6 +3697,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto got_pg; /* + * If it looks like increased kcompactd effort could have spared + * us from direct compaction (or allocation failure if we cannot + * compact), increase kcompactd's target. + */ + if (order > 0) + kcompactd_inc_free_target(gfp_mask, order, alloc_flags, ac); + + /* * For costly allocations, try direct compaction first, as it's likely * that we have enough base pages and don't need to reclaim. Don't try * that for allocations that are allowed to ignore watermarks, as the @@ -5946,6 +5954,7 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages, */ static void __paginginit free_area_init_core(struct pglist_data *pgdat) { + int i; enum zone_type j; int nid = pgdat->node_id; int ret; @@ -5965,6 +5974,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) init_waitqueue_head(&pgdat->pfmemalloc_wait); #ifdef CONFIG_COMPACTION init_waitqueue_head(&pgdat->kcompactd_wait); + for (i = 0; i < MAX_ORDER; i++) + //FIXME: I can't use ATOMIC_INIT, can I? + atomic_set(&pgdat->compact_free_target[i], 0); #endif pgdat_page_ext_init(pgdat); spin_lock_init(&pgdat->lru_lock); -- 2.12.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-03-08 14:56 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-30 13:14 [LSF/MM TOPIC] wmark based pro-active compaction Michal Hocko 2016-12-30 14:06 ` Mel Gorman 2017-01-05 9:53 ` Vlastimil Babka 2017-01-05 10:27 ` Michal Hocko 2017-01-06 8:57 ` Vlastimil Babka 2017-01-13 7:03 ` Joonsoo Kim 2017-01-19 14:18 ` Vlastimil Babka 2017-03-08 14:56 ` Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).