linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm, compaction: drain pcps for zone when kcompactd fails
@ 2018-03-01 11:42 David Rientjes
  2018-03-01 12:23 ` Vlastimil Babka
  2018-03-01 23:27 ` Andrew Morton
  0 siblings, 2 replies; 8+ messages in thread
From: David Rientjes @ 2018-03-01 11:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

It's possible for buddy pages to become stranded on pcps that, if drained,
could be merged with other buddy pages on the zone's free area to form
large order pages, including up to MAX_ORDER.

Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL, where 'B' indicates a buddy page and 'S'
indicates a slab page, which the migration scanner is attempting to
defragment (and doing it well, absent coalescing up to cc.order):

109954  1       _______S________________________________________________________
109955  2       __________B_____________________________________________________
109957  1       ________________________________________________________________
109958  1       __________B_____________________________________________________
109959  7       ________________________________________________________________
109960  1       __________B_____________________________________________________
109961  9       ________________________________________________________________
10996a  1       __________B_____________________________________________________
10996b  3       ________________________________________________________________
10996e  1       __________B_____________________________________________________
10996f  1       ________________________________________________________________
109970  1       __________B_____________________________________________________
109971  f       ________________________________________________________________
...
109f88  1       __________B_____________________________________________________
109f89  3       ________________________________________________________________
109f8c  1       __________B_____________________________________________________
109f8d  2       ________________________________________________________________
109f8f  2       __________B_____________________________________________________
109f91  f       ________________________________________________________________
109fa0  1       __________B_____________________________________________________
109fa1  7       ________________________________________________________________
109fa8  1       __________B_____________________________________________________
109fa9  1       ________________________________________________________________
109faa  1       __________B_____________________________________________________
109fab  1       _______S________________________________________________________

These buddy pages, spanning 1,621 pages, could be coalesced and allow for
three transparent hugepages to be dynamically allocated.  Totaling all
hugepage length spans that could be coalesced, this could yield over 400
hugepages on the zone's free area when at the time this /proc/kpageflags
was collected, there was _no_ order-9 or order-10 pages available for
allocation even after triggering compaction through procfs.

When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur.  Compaction for that order will subsequently
be deferred, which acts as a ratelimit on this drain.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1987,6 +1987,14 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (status == COMPACT_SUCCESS) {
 			compaction_defer_reset(zone, cc.order, false);
 		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
+			/*
+			 * Buddy pages may become stranded on pcps that could
+			 * otherwise coalesce on the zone's free area for
+			 * order >= cc.order.  This is ratelimited by the
+			 * upcoming deferral.
+			 */
+			drain_all_pages(zone);
+
 			/*
 			 * We use sync migration mode here, so we defer like
 			 * sync direct compaction does.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 11:42 [patch] mm, compaction: drain pcps for zone when kcompactd fails David Rientjes
@ 2018-03-01 12:23 ` Vlastimil Babka
  2018-03-01 13:05   ` David Rientjes
  2018-03-02 17:28   ` Matthew Wilcox
  2018-03-01 23:27 ` Andrew Morton
  1 sibling, 2 replies; 8+ messages in thread
From: Vlastimil Babka @ 2018-03-01 12:23 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On 03/01/2018 12:42 PM, David Rientjes wrote:
> It's possible for buddy pages to become stranded on pcps that, if drained,
> could be merged with other buddy pages on the zone's free area to form
> large order pages, including up to MAX_ORDER.
> 
> Consider a verbose example using the tools/vm/page-types tool at the
> beginning of a ZONE_NORMAL, where 'B' indicates a buddy page and 'S'
> indicates a slab page, which the migration scanner is attempting to
> defragment (and doing it well, absent coalescing up to cc.order):

How can the migration scanner defragment a slab page?

> 109954  1       _______S________________________________________________________
> 109955  2       __________B_____________________________________________________
> 109957  1       ________________________________________________________________
> 109958  1       __________B_____________________________________________________
> 109959  7       ________________________________________________________________
> 109960  1       __________B_____________________________________________________
> 109961  9       ________________________________________________________________
> 10996a  1       __________B_____________________________________________________
> 10996b  3       ________________________________________________________________
> 10996e  1       __________B_____________________________________________________
> 10996f  1       ________________________________________________________________
> 109970  1       __________B_____________________________________________________
> 109971  f       ________________________________________________________________
> ...
> 109f88  1       __________B_____________________________________________________
> 109f89  3       ________________________________________________________________
> 109f8c  1       __________B_____________________________________________________
> 109f8d  2       ________________________________________________________________
> 109f8f  2       __________B_____________________________________________________
> 109f91  f       ________________________________________________________________
> 109fa0  1       __________B_____________________________________________________
> 109fa1  7       ________________________________________________________________
> 109fa8  1       __________B_____________________________________________________
> 109fa9  1       ________________________________________________________________
> 109faa  1       __________B_____________________________________________________
> 109fab  1       _______S________________________________________________________
> 
> These buddy pages, spanning 1,621 pages, could be coalesced and allow for
> three transparent hugepages to be dynamically allocated.  Totaling all
> hugepage length spans that could be coalesced, this could yield over 400
> hugepages on the zone's free area when at the time this /proc/kpageflags

I don't understand the numbers here. With order-9 hugepages it's 512
pages per hugepage. If the buddy pages span 1621 pages, how can they
yield 400 hugepages?

> was collected, there was _no_ order-9 or order-10 pages available for
> allocation even after triggering compaction through procfs.
> 
> When kcompactd fails to defragment memory such that a cc.order page can
> be allocated, drain all pcps for the zone back to the buddy allocator so
> this stranding cannot occur.  Compaction for that order will subsequently
> be deferred, which acts as a ratelimit on this drain.

I don't mind the change given the ratelimit, but what difference was
observed in practice?

BTW I wonder if we could be smarter and quicker about the drains. Let a
pcp struct page be easily recognized as such, and store the cpu number
in there. Migration scanner could then maintain a cpumask, and recognize
if the only missing pages for coalescing a cc->order block are on the
pcplists, and then do a targeted drain.
But that only makes sense to implement if it can make a noticeable
difference to offset the additional overhead, of course.

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1987,6 +1987,14 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  		if (status == COMPACT_SUCCESS) {
>  			compaction_defer_reset(zone, cc.order, false);
>  		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
> +			/*
> +			 * Buddy pages may become stranded on pcps that could
> +			 * otherwise coalesce on the zone's free area for
> +			 * order >= cc.order.  This is ratelimited by the
> +			 * upcoming deferral.
> +			 */
> +			drain_all_pages(zone);
> +
>  			/*
>  			 * We use sync migration mode here, so we defer like
>  			 * sync direct compaction does.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 12:23 ` Vlastimil Babka
@ 2018-03-01 13:05   ` David Rientjes
  2018-03-02 10:28     ` Vlastimil Babka
  2018-03-02 17:28   ` Matthew Wilcox
  1 sibling, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-01 13:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On Thu, 1 Mar 2018, Vlastimil Babka wrote:

> On 03/01/2018 12:42 PM, David Rientjes wrote:
> > It's possible for buddy pages to become stranded on pcps that, if drained,
> > could be merged with other buddy pages on the zone's free area to form
> > large order pages, including up to MAX_ORDER.
> > 
> > Consider a verbose example using the tools/vm/page-types tool at the
> > beginning of a ZONE_NORMAL, where 'B' indicates a buddy page and 'S'
> > indicates a slab page, which the migration scanner is attempting to
> > defragment (and doing it well, absent coalescing up to cc.order):
> 
> How can the migration scanner defragment a slab page?
> 

Hi Vlastimil,

It doesn't, I'm showing an entire span of buddy pages that could be 
coalesced into order >= 9 pages, so I thought to include the border pages.  
This was simply the first lengthy span I saw, it's by no means the 
longest.

> > 109954  1       _______S________________________________________________________
> > 109955  2       __________B_____________________________________________________
> > 109957  1       ________________________________________________________________
> > 109958  1       __________B_____________________________________________________
> > 109959  7       ________________________________________________________________
> > 109960  1       __________B_____________________________________________________
> > 109961  9       ________________________________________________________________
> > 10996a  1       __________B_____________________________________________________
> > 10996b  3       ________________________________________________________________
> > 10996e  1       __________B_____________________________________________________
> > 10996f  1       ________________________________________________________________
> > 109970  1       __________B_____________________________________________________
> > 109971  f       ________________________________________________________________
> > ...
> > 109f88  1       __________B_____________________________________________________
> > 109f89  3       ________________________________________________________________
> > 109f8c  1       __________B_____________________________________________________
> > 109f8d  2       ________________________________________________________________
> > 109f8f  2       __________B_____________________________________________________
> > 109f91  f       ________________________________________________________________
> > 109fa0  1       __________B_____________________________________________________
> > 109fa1  7       ________________________________________________________________
> > 109fa8  1       __________B_____________________________________________________
> > 109fa9  1       ________________________________________________________________
> > 109faa  1       __________B_____________________________________________________
> > 109fab  1       _______S________________________________________________________
> > 
> > These buddy pages, spanning 1,621 pages, could be coalesced and allow for
> > three transparent hugepages to be dynamically allocated.  Totaling all
> > hugepage length spans that could be coalesced, this could yield over 400
> > hugepages on the zone's free area when at the time this /proc/kpageflags
> 
> I don't understand the numbers here. With order-9 hugepages it's 512
> pages per hugepage. If the buddy pages span 1621 pages, how can they
> yield 400 hugepages?
> 

The above span is 0x109faa - 0x109955 = 1,621 pages as an example which 
could be coalesced into three transparent hugepages as stated if not 
stranded on pcps and rather on the zone's free area.  For this system, 
running the numbers on the extremely large /proc/kpageflags, I identified 
spans >512 pages which could be coalesced into >400 hugepages if pcps were 
drained.

Check this out:

Node 1 MemTotal:       132115772 kB
Node 1 MemFree:        125468300 kB

Free pages count per migrate type at order         0      1      2      3      4      5      6      7      8      9     10 
Node    1, zone   Normal, type      Unmovable  18418  24325  10190   5545   1893    976    487    259     20      0      0 
Node    1, zone   Normal, type        Movable 172691 177791 145558 125810 101482  82792  67745  58527  49923      0      0 
Node    1, zone   Normal, type    Reclaimable   3909   4828   3505   2543   1246    410     47      5      0      0      0 
Node    1, zone   Normal, type  Memcg_Reserve      0      0      0      0      0      0      0      0      0      0      0 
Node    1, zone   Normal, type        Reserve      0      0      0      0      0      0      0      0      0      0      0 

I can't avoid cringing at that.  There is fragmentation as the result of 
slab pages being allocated, which we have additional patches for that I'll 
propose soon after I gather more data, but for this system I gathered 
/proc/kpageflags and found >400 hugepages that could be coalesced from 
this zone if pcps were drained.

> > was collected, there was _no_ order-9 or order-10 pages available for
> > allocation even after triggering compaction through procfs.
> > 
> > When kcompactd fails to defragment memory such that a cc.order page can
> > be allocated, drain all pcps for the zone back to the buddy allocator so
> > this stranding cannot occur.  Compaction for that order will subsequently
> > be deferred, which acts as a ratelimit on this drain.
> 
> I don't mind the change given the ratelimit, but what difference was
> observed in practice?
> 

It's hard to make a direct correlation given the workloads that are 
scheduled over this set of machines; it takes more than two weeks to get a 
system this fragmented (the uptime from the above examples is ~34 days) so 
any comparison between unpatched and patched kernels depends very heavily 
on what happened over those 34 days and doesn't yield useful results.  The 
significant data is that from collecting /proc/kpageflags at this moment 
in time, I can identify 400 spans of >=512 buddy pages.  The reason we 
have no order-9 and order-10 memory is that these buddy pages cannot be 
coalesced because of stranding on pcps.

I wouldn't consider this a separate type of fragmentation such as 
allocating one slab page from a pageblock that doesn't allow a hugepage to 
be allocated.  Rather, it's a byproduct of a fast page allocator that 
utilizes pcps for super fast allocation and freeing, which results in 
stranding as a side effect.

The change here is to drain all pcp pages so they have a chance to be 
coalesced into high-order pages in the chance that compaction fails on a 
fragmented system, such as in the above examples.  I'm most interested in 
the kcompactd failure case because that's where it would be most useful 
for our configurations, but I would understand if we'd want a similar 
change for direct compaction (it would reasonably be done any time we 
choose to defer).

> BTW I wonder if we could be smarter and quicker about the drains. Let a
> pcp struct page be easily recognized as such, and store the cpu number
> in there. Migration scanner could then maintain a cpumask, and recognize
> if the only missing pages for coalescing a cc->order block are on the
> pcplists, and then do a targeted drain.
> But that only makes sense to implement if it can make a noticeable
> difference to offset the additional overhead, of course.
> 

Right, that sounds doable with the extra overhead that I'm not sure is 
warranted in this case.  We could certainly have the migration scanner 
look at every buddy page on the pageblock and set the cpu in a cpumask if 
we store it as part of the pcp, and flag it if anything is non-buddy in 
the case of cc.order >= 9.  It's more complicated for smaller orders.  
Then isolate_migratepages() would store the cpumask for such blocks and 
eventually drain pcps from those cpus only if compaction fails.  Most of 
the time it won't fail until the extreme presented above, in which case 
all that work wouldn't be valuable.  In the cases that it is useful, I 
found that doing drain_all_pages(zone) is most beneficial for the 
side-effect of also freeing pcp pages back to MIGRATE_UNMOVABLE pageblocks 
on the zone's free area to avoid falling back to MIGRATE_MOVABLE 
pageblocks when pages from MIGRATE_UNMOVABLE pageblocks are similarly 
stranded on pcps.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 11:42 [patch] mm, compaction: drain pcps for zone when kcompactd fails David Rientjes
  2018-03-01 12:23 ` Vlastimil Babka
@ 2018-03-01 23:27 ` Andrew Morton
  2018-03-01 23:38   ` David Rientjes
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2018-03-01 23:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On Thu, 1 Mar 2018 03:42:04 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> It's possible for buddy pages to become stranded on pcps that, if drained,
> could be merged with other buddy pages on the zone's free area to form
> large order pages, including up to MAX_ORDER.

I grabbed this as-is.  Perhaps you could send along a new changelog so
that others won't be asking the same questions as Vlastimil?

The patch has no reviews or acks at this time...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 23:27 ` Andrew Morton
@ 2018-03-01 23:38   ` David Rientjes
  2018-03-06 23:57     ` David Rientjes
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-01 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On Thu, 1 Mar 2018, Andrew Morton wrote:

> On Thu, 1 Mar 2018 03:42:04 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > It's possible for buddy pages to become stranded on pcps that, if drained,
> > could be merged with other buddy pages on the zone's free area to form
> > large order pages, including up to MAX_ORDER.
> 
> I grabbed this as-is.  Perhaps you could send along a new changelog so
> that others won't be asking the same questions as Vlastimil?
> 
> The patch has no reviews or acks at this time...
> 

Thanks.

As mentioned in my response to Vlastimil, I think the case could also be 
made that we should do drain_all_pages(zone) in try_to_compact_pages() 
when we defer for direct compactors.  It would be great to have feedback 
from those on the cc on that point, the patch in general, and then I can 
send an update.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 13:05   ` David Rientjes
@ 2018-03-02 10:28     ` Vlastimil Babka
  0 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2018-03-02 10:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On 03/01/2018 02:05 PM, David Rientjes wrote:
> On Thu, 1 Mar 2018, Vlastimil Babka wrote:
> 
>> On 03/01/2018 12:42 PM, David Rientjes wrote:
>>> Consider a verbose example using the tools/vm/page-types tool at the
>>> beginning of a ZONE_NORMAL, where 'B' indicates a buddy page and 'S'
>>> indicates a slab page, which the migration scanner is attempting to
>>> defragment (and doing it well, absent coalescing up to cc.order):
>>
>> How can the migration scanner defragment a slab page?
>>
> 
> Hi Vlastimil,

Hi David,

> It doesn't, I'm showing an entire span of buddy pages that could be 
> coalesced into order >= 9 pages, so I thought to include the border pages.
Sure. But "slab page, which the migration scanner is attempting to
defragment (and doing it well..." sounds rather confusing, so please
reword it :)

...

>>> These buddy pages, spanning 1,621 pages, could be coalesced and allow for
>>> three transparent hugepages to be dynamically allocated.  Totaling all
>>> hugepage length spans that could be coalesced, this could yield over 400
>>> hugepages on the zone's free area when at the time this /proc/kpageflags
>>
>> I don't understand the numbers here. With order-9 hugepages it's 512
>> pages per hugepage. If the buddy pages span 1621 pages, how can they
>> yield 400 hugepages?
>>
> 
> The above span is 0x109faa - 0x109955 = 1,621 pages as an example which 
> could be coalesced into three transparent hugepages as stated if not 
> stranded on pcps and rather on the zone's free area.  For this system, 
> running the numbers on the extremely large /proc/kpageflags, I identified 
> spans >512 pages which could be coalesced into >400 hugepages if pcps were 
> drained.

Looks like I just didn't read that part properly, looks clear to me now.

> Check this out:
> 
> Node 1 MemTotal:       132115772 kB
> Node 1 MemFree:        125468300 kB
> 
> Free pages count per migrate type at order         0      1      2      3      4      5      6      7      8      9     10 
> Node    1, zone   Normal, type      Unmovable  18418  24325  10190   5545   1893    976    487    259     20      0      0 
> Node    1, zone   Normal, type        Movable 172691 177791 145558 125810 101482  82792  67745  58527  49923      0      0 
> Node    1, zone   Normal, type    Reclaimable   3909   4828   3505   2543   1246    410     47      5      0      0      0 
> Node    1, zone   Normal, type  Memcg_Reserve      0      0      0      0      0      0      0      0      0      0      0 
> Node    1, zone   Normal, type        Reserve      0      0      0      0      0      0      0      0      0      0      0 
> 
> I can't avoid cringing at that.  There is fragmentation as the result of 
> slab pages being allocated, which we have additional patches for that I'll 
> propose soon after I gather more data, but for this system I gathered 
> /proc/kpageflags and found >400 hugepages that could be coalesced from 
> this zone if pcps were drained.

Yeah that doesn't look very nice :(

>>> was collected, there was _no_ order-9 or order-10 pages available for
>>> allocation even after triggering compaction through procfs.
>>>
>>> When kcompactd fails to defragment memory such that a cc.order page can
>>> be allocated, drain all pcps for the zone back to the buddy allocator so
>>> this stranding cannot occur.  Compaction for that order will subsequently
>>> be deferred, which acts as a ratelimit on this drain.
>>
>> I don't mind the change given the ratelimit, but what difference was
>> observed in practice?
>>
> 
> It's hard to make a direct correlation given the workloads that are 
> scheduled over this set of machines; it takes more than two weeks to get a 
> system this fragmented (the uptime from the above examples is ~34 days) so 
> any comparison between unpatched and patched kernels depends very heavily 
> on what happened over those 34 days and doesn't yield useful results.  The 
> significant data is that from collecting /proc/kpageflags at this moment 
> in time, I can identify 400 spans of >=512 buddy pages.  The reason we 
> have no order-9 and order-10 memory is that these buddy pages cannot be 
> coalesced because of stranding on pcps.
> 
> I wouldn't consider this a separate type of fragmentation such as 
> allocating one slab page from a pageblock that doesn't allow a hugepage to 
> be allocated.  Rather, it's a byproduct of a fast page allocator that 
> utilizes pcps for super fast allocation and freeing, which results in 
> stranding as a side effect.
> 
> The change here is to drain all pcp pages so they have a chance to be 
> coalesced into high-order pages in the chance that compaction fails on a 
> fragmented system, such as in the above examples.  I'm most interested in 
> the kcompactd failure case because that's where it would be most useful 

Right, I'm fine with the patch for kcompactd. You can add
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> for our configurations, but I would understand if we'd want a similar 
> change for direct compaction (it would reasonably be done any time we 
> choose to defer).

I don't think that's needed. kcompactd should be woken up for any such
direct compaction anyway. Also with multiple parallel direct compaction
attempts, the drain might be too frequent even with deferring, without
any gain from it?

>> BTW I wonder if we could be smarter and quicker about the drains. Let a
>> pcp struct page be easily recognized as such, and store the cpu number
>> in there. Migration scanner could then maintain a cpumask, and recognize
>> if the only missing pages for coalescing a cc->order block are on the
>> pcplists, and then do a targeted drain.
>> But that only makes sense to implement if it can make a noticeable
>> difference to offset the additional overhead, of course.
>>
> 
> Right, that sounds doable with the extra overhead that I'm not sure is 
> warranted in this case.  We could certainly have the migration scanner 
> look at every buddy page on the pageblock

IIRC the pages on pcplists don't have PageBuddy, just page_count of 0
and migratetype in page->index.

> and set the cpu in a cpumask if 
> we store it as part of the pcp, and flag it if anything is non-buddy in 
> the case of cc.order >= 9.  It's more complicated for smaller orders.  
> Then isolate_migratepages() would store the cpumask for such blocks and 
> eventually drain pcps from those cpus only if compaction fails. 

My idea was that compaction would maintain an initially empty cpumask in
compact_control, when detecting a page that sits on some cpu's pcplist,
add it to the mask. Also a do_drain flag initially true, that would be
set to false if e.g. a slab page is encountered (cannot be migrated and
is not already free). If the flag is still true and mask non-empty as we
finish a block, drain the cpus in the mask.

> Most of 
> the time it won't fail until the extreme presented above, in which case 
> all that work wouldn't be valuable.

Yeah, maybe the kcompactd change will take care of most such cases already.

> In the cases that it is useful, I 
> found that doing drain_all_pages(zone) is most beneficial for the 
> side-effect of also freeing pcp pages back to MIGRATE_UNMOVABLE pageblocks 
> on the zone's free area to avoid falling back to MIGRATE_MOVABLE 
> pageblocks when pages from MIGRATE_UNMOVABLE pageblocks are similarly 
> stranded on pcps.

Interesting, there might be some value in trying to drain those before
falling back (independently of this patch).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 12:23 ` Vlastimil Babka
  2018-03-01 13:05   ` David Rientjes
@ 2018-03-02 17:28   ` Matthew Wilcox
  1 sibling, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2018-03-02 17:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Joonsoo Kim,
	linux-kernel, linux-mm

On Thu, Mar 01, 2018 at 01:23:34PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 12:42 PM, David Rientjes wrote:
> > It's possible for buddy pages to become stranded on pcps that, if drained,
> > could be merged with other buddy pages on the zone's free area to form
> > large order pages, including up to MAX_ORDER.
> 
> BTW I wonder if we could be smarter and quicker about the drains. Let a
> pcp struct page be easily recognized as such, and store the cpu number
> in there. Migration scanner could then maintain a cpumask, and recognize
> if the only missing pages for coalescing a cc->order block are on the
> pcplists, and then do a targeted drain.
> But that only makes sense to implement if it can make a noticeable
> difference to offset the additional overhead, of course.

Perhaps we should turn this around ... rather than waiting for the
coalescer to come along, when we're about to put a page on the pcp list,
check whether its buddy is PageBuddy().  If so, send it to the buddy
allocator so it can get merged instead of putting it on the pcp list.

I can see the negatives of that; if you're in a situation where you've
got a 2^12 block free and allocate one page, that's 12 splits.  Then you
free the page and it does 12 joins.  Then you allocate again and do 12
splits ...

That seems like a relatively rare scenario; we're generally going to
have a lot of pages in motion on any workload we care about, and there's
always going to be pages on the pcp list.

It's not an alternative to David's patch; having page A and page A+1 on
the pcp list will prevent the pages from getting merged.  But it should
delay the time until his bigger hammer kicks in.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, compaction: drain pcps for zone when kcompactd fails
  2018-03-01 23:38   ` David Rientjes
@ 2018-03-06 23:57     ` David Rientjes
  0 siblings, 0 replies; 8+ messages in thread
From: David Rientjes @ 2018-03-06 23:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Joonsoo Kim, linux-kernel, linux-mm

On Thu, 1 Mar 2018, David Rientjes wrote:

> On Thu, 1 Mar 2018, Andrew Morton wrote:
> 
> > On Thu, 1 Mar 2018 03:42:04 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> > 
> > > It's possible for buddy pages to become stranded on pcps that, if drained,
> > > could be merged with other buddy pages on the zone's free area to form
> > > large order pages, including up to MAX_ORDER.
> > 
> > I grabbed this as-is.  Perhaps you could send along a new changelog so
> > that others won't be asking the same questions as Vlastimil?
> > 
> > The patch has no reviews or acks at this time...
> > 
> 
> Thanks.
> 
> As mentioned in my response to Vlastimil, I think the case could also be 
> made that we should do drain_all_pages(zone) in try_to_compact_pages() 
> when we defer for direct compactors.  It would be great to have feedback 
> from those on the cc on that point, the patch in general, and then I can 
> send an update.
> 

Andrew, here's a new changelog that should clarify the questions asked 
about the patch.


It's possible for free pages to become stranded on per-cpu pagesets (pcps) 
that, if drained, could be merged with buddy pages on the zone's free area 
to form large order pages, including up to MAX_ORDER.

Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates a
slab page).  Pages on pcps do not have any page flags set.

109954  1       _______S________________________________________________________
109955  2       __________B_____________________________________________________
109957  1       ________________________________________________________________
109958  1       __________B_____________________________________________________
109959  7       ________________________________________________________________
109960  1       __________B_____________________________________________________
109961  9       ________________________________________________________________
10996a  1       __________B_____________________________________________________
10996b  3       ________________________________________________________________
10996e  1       __________B_____________________________________________________
10996f  1       ________________________________________________________________
...
109f8c  1       __________B_____________________________________________________
109f8d  2       ________________________________________________________________
109f8f  2       __________B_____________________________________________________
109f91  f       ________________________________________________________________
109fa0  1       __________B_____________________________________________________
109fa1  7       ________________________________________________________________
109fa8  1       __________B_____________________________________________________
109fa9  1       ________________________________________________________________
109faa  1       __________B_____________________________________________________
109fab  1       _______S________________________________________________________

The compaction migration scanner is attempting to defragment this memory 
since it is at the beginning of the zone.  It has done so quite well, all 
movable pages have been migrated.  From pfn [0x109955, 0x109fab), there
are only buddy pages and pages without flags set.

These pages may be stranded on pcps that could otherwise allow this memory 
to be coalesced if freed back to the zone free area.  It is possible that 
some of these pages may not be on pcps and that something has called 
alloc_pages() and used the memory directly, but we rely on the absence of
__GFP_MOVABLE in these cases to allocate from MIGATE_UNMOVABLE pageblocks 
to try to keep these MIGRATE_MOVABLE pageblocks as free as possible.

These buddy and pcp pages, spanning 1,621 pages, could be coalesced and 
allow for three transparent hugepages to be dynamically allocated.  
Running the numbers for all such spans on the system, it was found that 
there were over 400 such spans of only buddy pages and pages without flags 
set at the time this /proc/kpageflags sample was collected.  Without this 
support, there were _no_ order-9 or order-10 pages free.

When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur.  Compaction for that order will subsequently
be deferred, which acts as a ratelimit on this drain.

Signed-off-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-06 23:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-01 11:42 [patch] mm, compaction: drain pcps for zone when kcompactd fails David Rientjes
2018-03-01 12:23 ` Vlastimil Babka
2018-03-01 13:05   ` David Rientjes
2018-03-02 10:28     ` Vlastimil Babka
2018-03-02 17:28   ` Matthew Wilcox
2018-03-01 23:27 ` Andrew Morton
2018-03-01 23:38   ` David Rientjes
2018-03-06 23:57     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).