linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
@ 2016-11-05  7:57 Xishi Qiu
  2016-11-05 12:29 ` Anshuman Khandual
  2016-11-07 23:45 ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: Xishi Qiu @ 2016-11-05  7:57 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko,
	Johannes Weiner, Joonsoo Kim, 'Kirill A . Shutemov',
	Taku Izumi
  Cc: Yisheng Xie, Linux MM, LKML

Usually the memory of android phones is very small, so after a long
running, the fragment is very large. Kernel stack which called by
alloc_thread_stack_node() usually alloc 16K memory, and it failed
frequently.

However we have CONFIG_VMAP_STACK now, but it do not support arm64,
and maybe it has some regression because of vmalloc, it need to
find an area and create page table dynamically, this will take a short
time.

I think we can merge as soon as possible when pcp alloc/free to reduce
fragment. The pcp page is hot page, so free it will cause cache miss,
I use perf to test it, but it seems the regression is not so much, maybe
it need to test more. Any reply is welcome.

no patch:
perf stat -e cache-misses make -j50

Kernel: arch/x86/boot/bzImage is ready  (#10)

 Performance counter stats for 'make -j50':

    17,845,292,704      cache-misses

     157.605906725 seconds time elapsed

patched:
perf stat -e cache-misses make -j50

Kernel: arch/x86/boot/bzImage is ready  (#8)

 Performance counter stats for 'make -j50':

    17,876,726,774      cache-misses

     156.293720662 seconds time elapsed

nopatch:
make clean, dropcache, then make -j50, CONFIG_VMAP_STACK is off
[root@localhost ~]# cat /proc/buddyinfo
Node 0, zone      DMA      3      0      2      1      3      2      2      1      0      1      3
Node 0, zone    DMA32      4      4      1      5      2      4      2      2      3      1    447
Node 0, zone   Normal   2389    418    668    707    738    451    246     93     42     21  15147
Node 1, zone   Normal   1137    386    583    631    878    311     80     12      2      8  15640
Node 2, zone   Normal   1875    230    323    462    729    453    177     67     12      9  15749
Node 3, zone   Normal   1675    452    503    898    928    628    256     70     25     14  11688
Node 4, zone   Normal   1917    407    306   2706   1722    909    477    218     54     34  15682
Node 5, zone   Normal   4330   9785   6265   2612   1404    703    276    113     33      7  15730
Node 6, zone   Normal    754    211   1093   1023    748    599    352    193    107     43  15672
Node 7, zone   Normal   1092    133    819    807    729    549    254    120     52     28  15500
[root@localhost ~]# cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.002 0.004 0.016 0.032 0.065 0.097 0.097 0.226
Node 0, zone    DMA32 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.004
Node 0, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.003 0.004 0.004 0.005
Node 1, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.002 0.002 0.002 0.002
Node 2, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.002 0.003 0.003 0.003
Node 3, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.002 0.003 0.005 0.005 0.006 0.007
Node 4, zone   Normal 0.000 0.000 0.000 0.000 0.001 0.003 0.005 0.006 0.008 0.009 0.010
Node 5, zone   Normal 0.000 0.000 0.001 0.003 0.004 0.005 0.007 0.008 0.009 0.009 0.009
Node 6, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.004 0.005 0.007 0.008
Node 7, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.003 0.004 0.005 0.006

patched:
make clean, dropcache, then make -j50, CONFIG_VMAP_STACK is off
[root@localhost ~]# cat /proc/buddyinfo
Node 0, zone      DMA      1      1      2      1      3      2      2      1      0      1      3
Node 0, zone    DMA32      3      3      0      2      2      4      2      2      3      1    447
Node 0, zone   Normal   1293   1097    159    564    620    392    242     89     49     21  15154
Node 1, zone   Normal   1195    369    155     73    295    260     92     32      8     10  15769
Node 2, zone   Normal   1478    434    160    846   1397    590    274    118     39     25  15753
Node 3, zone   Normal    892    285    176    625    691    450    226     78     33     14  11596
Node 4, zone   Normal    604    217     28    468   1560    690    292    126     46     31  15741
Node 5, zone   Normal    888    225    101    263    483    319    196     97     30     24  15726
Node 6, zone   Normal   1908   9294   7075   3373   1765    759    243    128     21     20  15591
Node 7, zone   Normal   1362   1126   1271    646    558    377    170     84     37     35  15602
[root@localhost ~]# cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.002 0.004 0.016 0.032 0.065 0.097 0.097 0.226
Node 0, zone    DMA32 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.004
Node 0, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.002 0.003 0.004 0.005
Node 1, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001
Node 2, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.002 0.003 0.004 0.005 0.005 0.006
Node 3, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.003 0.004 0.005 0.005
Node 4, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.003 0.004 0.005 0.006 0.007
Node 5, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.002 0.003 0.004
Node 6, zone   Normal 0.000 0.000 0.001 0.003 0.004 0.006 0.007 0.008 0.009 0.010 0.010
Node 7, zone   Normal 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.002 0.003 0.004 0.005


Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
---
 mm/page_alloc.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa..82257e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2413,6 +2413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
+	unsigned long page_idx = pfn & 1UL;
+	struct page *buddy;
 
 	if (!free_pcp_prepare(page))
 		return;
@@ -2437,6 +2439,16 @@ void free_hot_cold_page(struct page *page, bool cold)
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	if (page_idx)
+		buddy = page - 1;
+	else
+		buddy = page + 1;
+	/* merge immediately if buddy is free */	
+	if (PageBuddy(buddy)) {
+		free_one_page(zone, page, pfn, 0, migratetype);
+		goto out;
+	}
+
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (!cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
@@ -2591,8 +2603,12 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
+		unsigned long page_idx;
+		struct page *buddy;
+		int retry = 0;
 
 		local_irq_save(flags);
+retry:
 		do {
 			pcp = &this_cpu_ptr(zone->pageset)->pcp;
 			list = &pcp->lists[migratetype];
@@ -2612,6 +2628,19 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 			list_del(&page->lru);
 			pcp->count--;
 
+			page_idx = page_to_pfn(page) & 1UL;
+			if (page_idx)
+				buddy = page - 1;
+			else
+				buddy = page + 1;
+			/* merge immediately if buddy is free */
+			if (PageBuddy(buddy) && retry < 3) {
+				free_one_page(page_zone(page), page,
+						page_to_pfn(page), 0, migratetype);
+				retry++;
+				goto retry;
+			}
+
 		} while (check_new_pcp(page));
 	} else {
 		/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
  2016-11-05  7:57 [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free Xishi Qiu
@ 2016-11-05 12:29 ` Anshuman Khandual
  2016-11-07  1:48   ` Xishi Qiu
  2016-11-07 23:45 ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Anshuman Khandual @ 2016-11-05 12:29 UTC (permalink / raw)
  To: Xishi Qiu, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Michal Hocko, Johannes Weiner, Joonsoo Kim,
	'Kirill A . Shutemov',
	Taku Izumi
  Cc: Yisheng Xie, Linux MM, LKML

On 11/05/2016 01:27 PM, Xishi Qiu wrote:
> Usually the memory of android phones is very small, so after a long
> running, the fragment is very large. Kernel stack which called by
> alloc_thread_stack_node() usually alloc 16K memory, and it failed
> frequently.
> 
> However we have CONFIG_VMAP_STACK now, but it do not support arm64,
> and maybe it has some regression because of vmalloc, it need to
> find an area and create page table dynamically, this will take a short
> time.
> 
> I think we can merge as soon as possible when pcp alloc/free to reduce
> fragment. The pcp page is hot page, so free it will cause cache miss,
> I use perf to test it, but it seems the regression is not so much, maybe
> it need to test more. Any reply is welcome.

The idea of PCP is to have a fast allocation mechanism which does not depend
on an interrupt safe spin lock for every allocation. I am not very familiar
with this part of code but the following documentation from Mel Gorman kind
of explains that the this type of fragmentation problem which you might be
observing as one of the limitations of PCP mechanism.

https://www.kernel.org/doc/gorman/html/understand/understand009.html
"Per CPU page list" sub header.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
  2016-11-05 12:29 ` Anshuman Khandual
@ 2016-11-07  1:48   ` Xishi Qiu
  2016-11-07  4:50     ` Anshuman Khandual
  0 siblings, 1 reply; 6+ messages in thread
From: Xishi Qiu @ 2016-11-07  1:48 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko,
	Johannes Weiner, Joonsoo Kim, 'Kirill A . Shutemov',
	Taku Izumi, Yisheng Xie, Linux MM, LKML

On 2016/11/5 20:29, Anshuman Khandual wrote:

> On 11/05/2016 01:27 PM, Xishi Qiu wrote:
>> Usually the memory of android phones is very small, so after a long
>> running, the fragment is very large. Kernel stack which called by
>> alloc_thread_stack_node() usually alloc 16K memory, and it failed
>> frequently.
>>
>> However we have CONFIG_VMAP_STACK now, but it do not support arm64,
>> and maybe it has some regression because of vmalloc, it need to
>> find an area and create page table dynamically, this will take a short
>> time.
>>
>> I think we can merge as soon as possible when pcp alloc/free to reduce
>> fragment. The pcp page is hot page, so free it will cause cache miss,
>> I use perf to test it, but it seems the regression is not so much, maybe
>> it need to test more. Any reply is welcome.
> 
> The idea of PCP is to have a fast allocation mechanism which does not depend
> on an interrupt safe spin lock for every allocation. I am not very familiar
> with this part of code but the following documentation from Mel Gorman kind
> of explains that the this type of fragmentation problem which you might be
> observing as one of the limitations of PCP mechanism.
> 
> https://www.kernel.org/doc/gorman/html/understand/understand009.html
> "Per CPU page list" sub header.
> 

"The last potential problem is that buddies of newly freed pages could exist
in other pagesets leading to possible fragmentation problems."
So we should not change it, and this is a known issue, right?

Thanks,
Xishi Qiu

> 
> .
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
  2016-11-07  1:48   ` Xishi Qiu
@ 2016-11-07  4:50     ` Anshuman Khandual
  0 siblings, 0 replies; 6+ messages in thread
From: Anshuman Khandual @ 2016-11-07  4:50 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko,
	Johannes Weiner, Joonsoo Kim, 'Kirill A . Shutemov',
	Taku Izumi, Yisheng Xie, Linux MM, LKML

On 11/07/2016 07:18 AM, Xishi Qiu wrote:
> On 2016/11/5 20:29, Anshuman Khandual wrote:
> 
>> On 11/05/2016 01:27 PM, Xishi Qiu wrote:
>>> Usually the memory of android phones is very small, so after a long
>>> running, the fragment is very large. Kernel stack which called by
>>> alloc_thread_stack_node() usually alloc 16K memory, and it failed
>>> frequently.
>>>
>>> However we have CONFIG_VMAP_STACK now, but it do not support arm64,
>>> and maybe it has some regression because of vmalloc, it need to
>>> find an area and create page table dynamically, this will take a short
>>> time.
>>>
>>> I think we can merge as soon as possible when pcp alloc/free to reduce
>>> fragment. The pcp page is hot page, so free it will cause cache miss,
>>> I use perf to test it, but it seems the regression is not so much, maybe
>>> it need to test more. Any reply is welcome.
>>
>> The idea of PCP is to have a fast allocation mechanism which does not depend
>> on an interrupt safe spin lock for every allocation. I am not very familiar
>> with this part of code but the following documentation from Mel Gorman kind
>> of explains that the this type of fragmentation problem which you might be
>> observing as one of the limitations of PCP mechanism.
>>
>> https://www.kernel.org/doc/gorman/html/understand/understand009.html
>> "Per CPU page list" sub header.
>>
> 
> "The last potential problem is that buddies of newly freed pages could exist
> in other pagesets leading to possible fragmentation problems."
> So we should not change it, and this is a known issue, right?

Seems like that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
  2016-11-05  7:57 [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free Xishi Qiu
  2016-11-05 12:29 ` Anshuman Khandual
@ 2016-11-07 23:45 ` Andrew Morton
  2016-11-08 11:03   ` Mel Gorman
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2016-11-07 23:45 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Vlastimil Babka, Mel Gorman, Michal Hocko, Johannes Weiner,
	Joonsoo Kim, 'Kirill A . Shutemov',
	Taku Izumi, Yisheng Xie, Linux MM, LKML

On Sat, 5 Nov 2016 15:57:55 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:

> Usually the memory of android phones is very small, so after a long
> running, the fragment is very large. Kernel stack which called by
> alloc_thread_stack_node() usually alloc 16K memory, and it failed
> frequently.
> 
> However we have CONFIG_VMAP_STACK now, but it do not support arm64,
> and maybe it has some regression because of vmalloc, it need to
> find an area and create page table dynamically, this will take a short
> time.
> 
> I think we can merge as soon as possible when pcp alloc/free to reduce
> fragment. The pcp page is hot page, so free it will cause cache miss,
> I use perf to test it, but it seems the regression is not so much, maybe
> it need to test more. Any reply is welcome.

per-cpu pages may not be worth the effort on such systems - probably
benefit is small.  I discussed this with Mel a few years ago and I
think he did some testing, but I forget the results?

Anyway, if per-cpu pages are causing problems then perhaps we should
have a Kconfig option which simply eliminates them: free these pages
direct into the buddy.  If the resulting code is clean-looking and the
performance testing on small systems shows decent results then that
should address the issues you're seeing.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free
  2016-11-07 23:45 ` Andrew Morton
@ 2016-11-08 11:03   ` Mel Gorman
  0 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2016-11-08 11:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Xishi Qiu, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Joonsoo Kim, 'Kirill A . Shutemov',
	Taku Izumi, Yisheng Xie, Linux MM, LKML

On Mon, Nov 07, 2016 at 03:45:32PM -0800, Andrew Morton wrote:
> On Sat, 5 Nov 2016 15:57:55 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
> 
> > Usually the memory of android phones is very small, so after a long
> > running, the fragment is very large. Kernel stack which called by
> > alloc_thread_stack_node() usually alloc 16K memory, and it failed
> > frequently.
> > 
> > However we have CONFIG_VMAP_STACK now, but it do not support arm64,
> > and maybe it has some regression because of vmalloc, it need to
> > find an area and create page table dynamically, this will take a short
> > time.
> > 
> > I think we can merge as soon as possible when pcp alloc/free to reduce
> > fragment. The pcp page is hot page, so free it will cause cache miss,
> > I use perf to test it, but it seems the regression is not so much, maybe
> > it need to test more. Any reply is welcome.
> 
> per-cpu pages may not be worth the effort on such systems - probably
> benefit is small.  I discussed this with Mel a few years ago and I
> think he did some testing, but I forget the results?
> 

I'm still on holidays so not in the position to review closely but in
general, aggressively merging per-cpu pages is expected to be a bust and
offset heavily by increased contention on zone lock. A batch free early in
the lifetime of the system is going to hit such a heuristic aggressively
even if fragmentation overall is fine.

> Anyway, if per-cpu pages are causing problems then perhaps we should
> have a Kconfig option which simply eliminates them: free these pages
> direct into the buddy.  If the resulting code is clean-looking and the
> performance testing on small systems shows decent results then that
> should address the issues you're seeing.

I know for a fact that deleting the per-cpu allocator works but overall
performance fell down a hole when there were multiple parallel allocation
requests (multiple processes faulting for example). There were prototype
patches that used per-socket locks to minimise costs of cache misses but it
never improved the performance of the page allocator while having similar
properties in terms of fragmentation.

In general, my view is that the latency reduction of the page allocator
went too far since 3.0 which had the strongest protection against
fragmentation. It's now too willing to mix pageblocks together in the
name of latency, mostly done in the name of THP and fragmentation simply
degrades far faster than it used to. Tackling it from the per-cpu
allocator is the wrong direction IMO.

Overall, there is a definite lack of workloads that routinely create
fragmentation in a manner that is reproducible, representative and
measurable. A lot of the patches are vague hand waving and it's not a good
enough basis for merging patches. I know the stress high-alloc workload
exists which was fine in 3.0, but not fine today with larger memory sizes,
the existance of SLUB high-order allocations and a much more aggressive
mix of THP allocations. Ideally that point would be addressed first as a
basis for further work.

After that, one approach would be to review control of pageblocks and be more
willing to protect pageblocks by migrating movable pages out of pageblocks
that MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE steals even if it's deferred
to kswapd with the view to avoiding further fragmentation. Back in 3.0, an
unreleased prototype existed for that but the fragmentation protection was so
strong, it had no benefit. I don't have the prototype any more unfortunately.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-11-08 11:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-05  7:57 [RFC][PATCH] mm: merge as soon as possible when pcp alloc/free Xishi Qiu
2016-11-05 12:29 ` Anshuman Khandual
2016-11-07  1:48   ` Xishi Qiu
2016-11-07  4:50     ` Anshuman Khandual
2016-11-07 23:45 ` Andrew Morton
2016-11-08 11:03   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).