linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
@ 2018-01-24  2:30 Aaron Lu
  2018-01-24  2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Aaron Lu @ 2018-01-24  2:30 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   9037332        8000124       13642741       15728686
this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%

What the test does is: starts $nr_cpu processes and each will repeated
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4093728f292e..a076f754dac1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1113,12 +1113,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	bool isolated_pageblocks;
+	struct list_head head;
+	struct page *page, *tmp;
 
-	spin_lock(&zone->lock);
-	isolated_pageblocks = has_isolate_pageblock(zone);
+	INIT_LIST_HEAD(&head);
 
 	while (count) {
-		struct page *page;
 		struct list_head *list;
 
 		/*
@@ -1140,26 +1140,31 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free = count;
 
 		do {
-			int mt;	/* migratetype of the to-be-freed page */
-
 			page = list_last_entry(list, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 
-			mt = get_pcppage_migratetype(page);
-			/* MIGRATE_ISOLATE page should not go to pcplists */
-			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-			/* Pageblock could have been isolated meanwhile */
-			if (unlikely(isolated_pageblocks))
-				mt = get_pageblock_migratetype(page);
-
 			if (bulkfree_pcp_prepare(page))
 				continue;
 
-			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
+			list_add_tail(&page->lru, &head);
 		} while (--count && --batch_free && !list_empty(list));
 	}
+
+	spin_lock(&zone->lock);
+	isolated_pageblocks = has_isolate_pageblock(zone);
+
+	list_for_each_entry_safe(page, tmp, &head, lru) {
+		int mt = get_pcppage_migratetype(page);
+		/* MIGRATE_ISOLATE page should not go to pcplists */
+		VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+		/* Pageblock could have been isolated meanwhile */
+		if (unlikely(isolated_pageblocks))
+			mt = get_pageblock_migratetype(page);
+
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		trace_mm_page_pcpu_drain(page, 0, mt);
+	}
 	spin_unlock(&zone->lock);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24  2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu
@ 2018-01-24  2:30 ` Aaron Lu
  2018-01-24 16:43   ` Mel Gorman
  2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman
  2018-02-05  5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu
  2 siblings, 1 reply; 19+ messages in thread
From: Aaron Lu @ 2018-01-24  2:30 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster.

Test with will-it-scale/page_fault1 full load:

kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   9037332        8000124       13642741       15728686
patch1/2    9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%

Note: this patch's performance improvement percent is against patch1/2.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a076f754dac1..9ef084d41708 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1140,6 +1140,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free = count;
 
 		do {
+			unsigned long pfn, buddy_pfn;
+			struct page *buddy;
+
 			page = list_last_entry(list, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
@@ -1148,6 +1151,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 				continue;
 
 			list_add_tail(&page->lru, &head);
+
+			/*
+			 * We are going to put the page back to
+			 * the global pool, prefetch its buddy to
+			 * speed up later access under zone->lock.
+			 */
+			pfn = page_to_pfn(page);
+			buddy_pfn = __find_buddy_pfn(pfn, 0);
+			buddy = page + (buddy_pfn - pfn);
+			prefetch(buddy);
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-01-24  2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu
  2018-01-24  2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu
@ 2018-01-24 16:40 ` Mel Gorman
  2018-01-25  7:21   ` [PATCH v2 " Aaron Lu
  2018-02-05  5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu
  2 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2018-01-24 16:40 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Wed, Jan 24, 2018 at 10:30:49AM +0800, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.15-rc4   9037332        8000124       13642741       15728686
> this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
> 
> What the test does is: starts $nr_cpu processes and each will repeated
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  mm/page_alloc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4093728f292e..a076f754dac1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1113,12 +1113,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int migratetype = 0;
>  	int batch_free = 0;
>  	bool isolated_pageblocks;
> +	struct list_head head;
> +	struct page *page, *tmp;
>  
> -	spin_lock(&zone->lock);
> -	isolated_pageblocks = has_isolate_pageblock(zone);
> +	INIT_LIST_HEAD(&head);
>  

Declare head as LIST_HEAD(head) and avoid INIT_LIST_HEAD. Otherwise I
think this is safe

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24  2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu
@ 2018-01-24 16:43   ` Mel Gorman
  2018-01-24 16:57     ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2018-01-24 16:43 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Wed, Jan 24, 2018 at 10:30:50AM +0800, Aaron Lu wrote:
> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
> 
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster.
> 
> Test with will-it-scale/page_fault1 full load:
> 
> kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.15-rc4   9037332        8000124       13642741       15728686
> patch1/2    9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
> this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%
> 
> Note: this patch's performance improvement percent is against patch1/2.
> 

I'm less convinced by this for a microbenchmark. Prefetch has not been a
universal win in the past and we cannot be sure that it's a good idea on
all architectures or doesn't have other side-effects such as consuming
memory bandwidth for data we don't need or evicting cache hot data for
buddy information that is not used. Furthermore, we end up doing some
calculations twice without any guarantee that the prefetch can offset
the cost.

It's not strong enough of an opinion to outright NAK it but I'm not
ACKing it either.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24 16:43   ` Mel Gorman
@ 2018-01-24 16:57     ` Dave Hansen
  2018-01-24 18:19       ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-01-24 16:57 UTC (permalink / raw)
  To: Mel Gorman, Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang,
	Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On 01/24/2018 08:43 AM, Mel Gorman wrote:
> I'm less convinced by this for a microbenchmark. Prefetch has not been a
> universal win in the past and we cannot be sure that it's a good idea on
> all architectures or doesn't have other side-effects such as consuming
> memory bandwidth for data we don't need or evicting cache hot data for
> buddy information that is not used.

I had the same reaction.

But, I think this case is special.  We *always* do buddy merging (well,
before the next patch in the series is applied) and check an order-0
page's buddy to try to merge it when it goes into the main allocator.
So, the cacheline will always come in.

IOW, I don't think this has the same downsides normally associated with
prefetch() since the data is always used.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24 16:57     ` Dave Hansen
@ 2018-01-24 18:19       ` Mel Gorman
  2018-01-24 19:23         ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2018-01-24 18:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Wed, Jan 24, 2018 at 08:57:43AM -0800, Dave Hansen wrote:
> On 01/24/2018 08:43 AM, Mel Gorman wrote:
> > I'm less convinced by this for a microbenchmark. Prefetch has not been a
> > universal win in the past and we cannot be sure that it's a good idea on
> > all architectures or doesn't have other side-effects such as consuming
> > memory bandwidth for data we don't need or evicting cache hot data for
> > buddy information that is not used.
> 
> I had the same reaction.
> 
> But, I think this case is special.  We *always* do buddy merging (well,
> before the next patch in the series is applied) and check an order-0
> page's buddy to try to merge it when it goes into the main allocator.
> So, the cacheline will always come in.
> 
> IOW, I don't think this has the same downsides normally associated with
> prefetch() since the data is always used.

That doesn't side-step the calculations are done twice in the
free_pcppages_bulk path and there is no guarantee that one prefetch
in the list of pages being freed will not evict a previous prefetch
due to collisions. At least on the machine I'm writing this from, the
prefetches necessary for a standard drain are 1/16th of the L1D cache so
some collisions/evictions are possible. We're doing definite work in one
path on the chance it'll still be cache-resident when it's recalculated.
I suspect that only a microbenchmark that is doing very large amounts of
frees (or a large munmap or exit) will notice and the costs of a large
munmap/exit are so high that the prefetch will be negligible savings.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24 18:19       ` Mel Gorman
@ 2018-01-24 19:23         ` Dave Hansen
  2018-01-24 21:12           ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-01-24 19:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On 01/24/2018 10:19 AM, Mel Gorman wrote:
>> IOW, I don't think this has the same downsides normally associated with
>> prefetch() since the data is always used.
> That doesn't side-step the calculations are done twice in the
> free_pcppages_bulk path and there is no guarantee that one prefetch
> in the list of pages being freed will not evict a previous prefetch
> due to collisions.

Fair enough.  The description here could probably use some touchups to
explicitly spell out the downsides.

I do agree with you that there is no guarantee that this will be
resident in the cache before use.  In fact, it might be entertaining to
see if we can show the extra conflicts in the L1 given from this change
given a large enough PCP batch size.

But, this isn't just about the L1.  If the results of the prefetch()
stay in *ANY* cache, then the memory bandwidth impact of this change is
still zero.  You'll have a lot harder time arguing that we're likely to
see L2/L3 evictions in this path for our typical PCP batch sizes.

Do you want to see some analysis for less-frequent PCP frees?  We could
pretty easily instrument the latency doing normal-ish things to see if
we can measure a benefit from this rather than a tight-loop micro.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24 19:23         ` Dave Hansen
@ 2018-01-24 21:12           ` Mel Gorman
  2018-01-25  7:25             ` [PATCH v2 " Aaron Lu
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2018-01-24 21:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Wed, Jan 24, 2018 at 11:23:49AM -0800, Dave Hansen wrote:
> On 01/24/2018 10:19 AM, Mel Gorman wrote:
> >> IOW, I don't think this has the same downsides normally associated with
> >> prefetch() since the data is always used.
> > That doesn't side-step the calculations are done twice in the
> > free_pcppages_bulk path and there is no guarantee that one prefetch
> > in the list of pages being freed will not evict a previous prefetch
> > due to collisions.
> 
> Fair enough.  The description here could probably use some touchups to
> explicitly spell out the downsides.
> 

It would be preferable. As I said, I'm not explicitly NAKing this but it
might push someone else over the edge into an outright ACK. I think patch
1 should go ahead as-is unconditionally as I see no reason to hold that
one back.

I would suggest adding the detail in the changelog that a prefetch will
potentially evict an earlier prefetch from the L1 cache but it is expected
the data would still be preserved in a L2 or L3 cache. Further note that
while there is some additional instruction overhead, it is required that
the data be fetched eventually and it's expected in many cases that cycles
spent early will be offset by reduced memory latency later. Finally note
that actual benefit will be workload/CPU dependant.

Also consider adding a comment above the actual prefetch because it deserves
one otherwise it looks like a fast path is being sprinked with magic dust
from the prefetch fairy.

> I do agree with you that there is no guarantee that this will be
> resident in the cache before use.  In fact, it might be entertaining to
> see if we can show the extra conflicts in the L1 given from this change
> given a large enough PCP batch size.
> 

Well, I wouldn't bother worrying about different PCP batch sizes.  In typical
operations, it's going to be the pcp->batch size. Even if you were dumping
the entire PCP due to a drain, it's still going to be less than many L1
sizes on x86 at least and those drains are usually in the context of a
much larger operation where the overhead of the buddy calculations will
be negligable in comparison.

> But, this isn't just about the L1.  If the results of the prefetch()
> stay in *ANY* cache, then the memory bandwidth impact of this change is
> still zero.  You'll have a lot harder time arguing that we're likely to
> see L2/L3 evictions in this path for our typical PCP batch sizes.
> 

s/lot harder time/impossible without making crap up/

> Do you want to see some analysis for less-frequent PCP frees? 

Actually no I don't. While it would be interesting to see, it's a waste
of your time. Less-frequent PCPs means that the impact of the extra cycles
is *also* marginal and a PCP drain must fetch the data eventually.

> We could
> pretty easily instrument the latency doing normal-ish things to see if
> we can measure a benefit from this rather than a tight-loop micro.

I believe the only realistic scenario where it's going to be detectable
is a network intensive application on very high speed networks where the
cost of the alloc/free paths tends to be more noticable. I suspect anything
else will be so far into the noise that it'll be unnoticable.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman
@ 2018-01-25  7:21   ` Aaron Lu
  2018-02-15 12:06     ` Mel Gorman
  2018-02-15 12:46     ` Matthew Wilcox
  0 siblings, 2 replies; 19+ messages in thread
From: Aaron Lu @ 2018-01-25  7:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   9037332        8000124       13642741       15728686
this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: use LIST_HEAD(head) as suggested by Mel Gorman.

 mm/page_alloc.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4093728f292e..c9e5ded39b16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1113,12 +1113,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	bool isolated_pageblocks;
-
-	spin_lock(&zone->lock);
-	isolated_pageblocks = has_isolate_pageblock(zone);
+	struct page *page, *tmp;
+	LIST_HEAD(head);
 
 	while (count) {
-		struct page *page;
 		struct list_head *list;
 
 		/*
@@ -1140,26 +1138,31 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free = count;
 
 		do {
-			int mt;	/* migratetype of the to-be-freed page */
-
 			page = list_last_entry(list, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 
-			mt = get_pcppage_migratetype(page);
-			/* MIGRATE_ISOLATE page should not go to pcplists */
-			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-			/* Pageblock could have been isolated meanwhile */
-			if (unlikely(isolated_pageblocks))
-				mt = get_pageblock_migratetype(page);
-
 			if (bulkfree_pcp_prepare(page))
 				continue;
 
-			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
+			list_add_tail(&page->lru, &head);
 		} while (--count && --batch_free && !list_empty(list));
 	}
+
+	spin_lock(&zone->lock);
+	isolated_pageblocks = has_isolate_pageblock(zone);
+
+	list_for_each_entry_safe(page, tmp, &head, lru) {
+		int mt = get_pcppage_migratetype(page);
+		/* MIGRATE_ISOLATE page should not go to pcplists */
+		VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+		/* Pageblock could have been isolated meanwhile */
+		if (unlikely(isolated_pageblocks))
+			mt = get_pageblock_migratetype(page);
+
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		trace_mm_page_pcpu_drain(page, 0, mt);
+	}
 	spin_unlock(&zone->lock);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2] free_pcppages_bulk: prefetch buddy while not holding lock
  2018-01-24 21:12           ` Mel Gorman
@ 2018-01-25  7:25             ` Aaron Lu
  0 siblings, 0 replies; 19+ messages in thread
From: Aaron Lu @ 2018-01-25  7:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

In the meantime, there is the concern of a prefetch potentially evicting
existing cachelines. This can be true for L1D cache since it is not huge.
Considering the prefetch instruction used is prefetchnta, which will only
store the date in L2 for "Pentium 4 and Intel Xeon processors" according
to Intel's "Instruction Manual Set" document, it is not likely to cause
cache pollution. Other architectures may have this cache pollution problem
though.

There is also some additional instruction overhead, namely calculating
buddy pfn twice. Since the calculation is a XOR on two local variables,
it's expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where multiple
CPUs are contending on zone->lock and the most time consuming part under
zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages
and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   9037332        8000124       13642741       15728686
patch1/2    9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%

Note: this patch's performance improvement percent is against patch1/2.

Please also note the actual benefit of this patch will be workload/CPU
dependant.

[changelog stole from Dave Hansen and Mel Gorman's comments]
https://lkml.org/lkml/2018/1/24/551
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2:
update changelog according to Dave Hansen and Mel Gorman's comments.
Add more comments in code to explain why prefetch is done as requested
by Mel Gorman.

 mm/page_alloc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9e5ded39b16..6566a4b5b124 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free = count;
 
 		do {
+			unsigned long pfn, buddy_pfn;
+			struct page *buddy;
+
 			page = list_last_entry(list, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
@@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 				continue;
 
 			list_add_tail(&page->lru, &head);
+
+			/*
+			 * We are going to put the page back to the global
+			 * pool, prefetch its buddy to speed up later access
+			 * under zone->lock. It is believed the overhead of
+			 * calculating buddy_pfn here can be offset by reduced
+			 * memory latency later.
+			 */
+			pfn = page_to_pfn(page);
+			buddy_pfn = __find_buddy_pfn(pfn, 0);
+			buddy = page + (buddy_pfn - pfn);
+			prefetch(buddy);
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server
  2018-01-24  2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu
  2018-01-24  2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu
  2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman
@ 2018-02-05  5:30 ` Aaron Lu
  2018-02-05  5:31   ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu
  2018-02-05  5:32   ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu
  2 siblings, 2 replies; 19+ messages in thread
From: Aaron Lu @ 2018-02-05  5:30 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Daniel Jordan

In addition the the two patches, there are two more patches that I would
like to get some feedback.

The two patches are more radical: the 3rd deals with free path
zone->lock contention by avoiding doing any merge for order0 pages while
the 4th deals with allocation path zone->lock contention by taking
pcp->batch pages off the free_area order0 list without the need to
iterate the list.

Both patches are developed based on "the most time consuming part of
operations under zone->lock is cache misses on struct page".

The 3rd patch may be controversial but doesn't have correctness problem;
the 4th is in an early stage and serves only as a proof-of-concept.

Your comments are appreciated, thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress
  2018-02-05  5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu
@ 2018-02-05  5:31   ` Aaron Lu
  2018-02-05 22:17     ` Dave Hansen
  2018-02-05  5:32   ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu
  1 sibling, 1 reply; 19+ messages in thread
From: Aaron Lu @ 2018-02-05  5:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Daniel Jordan

Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(43% on allocation path and 38% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.

One way to avoid this overhead is not do any merging at all for order-0
pages and leave the need for high order pages to compaction. With this
approach, the lock contention for zone->lock on free path dropped to 4%
but allocation side still has as high as 43% lock contention. In the
meantime, the dropped lock contention on free side doesn't translate to
performance increase, instead, it's consumed by increased lock contention
of the per node lru_lock(rose from 2% to 33%).

One concern of this approach is, how much impact does it have on high order
page allocation, like for order-9 pages? I have run the stress-highalloc
workload on a Haswell Desktop(8 CPU/4G memory) sometime ago and it showed
similar success rate for vanilla kernel and the patched kernel, both at 74%.
Note that it indeed took longer to finish the test: 244s vs 218s.

1 vanilla
  Attempted allocations:          1917
  Failed allocs:                   494
  Success allocs:                 1423
  % Success:                        74
  Duration alloctest pass:        218s

2 no_merge_in_buddy
  Attempted allocations:         1917
  Failed allocs:                  497
  Success allocs:                1420
  % Success:                       74
  Duration alloctest pass:       244s

The above test was done with the default --ms-delay=100, which means there
is a delay of 100ms between page allocations. If I set the delay to 1ms,
the success rate of this patch will drop to 36% while vanilla could maintain
at about 70%. Though 1ms delay may not be practical, it indeed shows the
possible impact of this patch on high order page allocation.

The next patch deals with allocation path zone->lock contention.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/compaction.c |  28 ++++++++++++++
 mm/internal.h   |  15 +++++++-
 mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++--------------------
 3 files changed, 117 insertions(+), 42 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 10cd757f1006..b53c4d420533 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -669,6 +669,28 @@ static bool too_many_isolated(struct zone *zone)
 	return isolated > (inactive + active) / 2;
 }
 
+static int merge_page(struct zone *zone, struct page *page, unsigned long pfn)
+{
+	int order = 0;
+	unsigned long buddy_pfn = __find_buddy_pfn(pfn, order);
+	struct page *buddy = page + (buddy_pfn - pfn);
+
+	/* Only do merging if the merge skipped page's buddy is also free */
+	if (PageBuddy(buddy)) {
+		int mt = get_pageblock_migratetype(page);
+		unsigned long flags;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		if (likely(page_merge_skipped(page))) {
+			do_merge(zone, page, mt);
+			order = page_order(page);
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	return order;
+}
+
 /**
  * isolate_migratepages_block() - isolate all migrate-able pages within
  *				  a single pageblock
@@ -777,6 +799,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (PageBuddy(page)) {
 			unsigned long freepage_order = page_order_unsafe(page);
+			/*
+			 * If the page didn't do merging on free time, now do
+			 * it since we are doing compaction.
+			 */
+			if (page_merge_skipped(page))
+				freepage_order = merge_page(zone, page, low_pfn);
 
 			/*
 			 * Without lock, we cannot be sure that what we got is
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..d2b0ac02d459 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -228,9 +228,22 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
-	return page_private(page);
+	return page_private(page) & ~(1 << 16);
 }
 
+/*
+ * This function returns if the page is in buddy but didn't do any merging
+ * for performance reason. This function only makes sense if PageBuddy(page)
+ * is also true. The caller should hold zone->lock for this function to return
+ * correct value, or it can handle invalid values gracefully.
+ */
+static inline bool page_merge_skipped(struct page *page)
+{
+	return PageBuddy(page) && (page->private & (1 << 16));
+}
+
+void do_merge(struct zone *zone, struct page *page, int migratetype);
+
 /*
  * Like page_order(), but for callers who cannot afford to hold the zone lock.
  * PageBuddy() should be checked first by the caller to minimize race window,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ac7fa97dd55..9497c8c5f808 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -781,49 +781,14 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
-/*
- * Freeing function for a buddy system allocator.
- *
- * The concept of a buddy system is to maintain direct-mapped table
- * (containing bit values) for memory blocks of various "orders".
- * The bottom level table contains the map for the smallest allocatable
- * units of memory (here, pages), and each level above it describes
- * pairs of units from the levels below, hence, "buddies".
- * At a high level, all that happens here is marking the table entry
- * at the bottom level available, and propagating the changes upward
- * as necessary, plus some accounting needed to play nicely with other
- * parts of the VM system.
- * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with _mapcount
- * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
- * field.
- * So when we are allocating or freeing one, we can derive the state of the
- * other.  That is, if we allocate a small block, and both were
- * free, the remainder of the region must be split into blocks.
- * If a block is freed, and its buddy is also free, then this
- * triggers coalescing into a block of larger size.
- *
- * -- nyc
- */
-
-static inline void __free_one_page(struct page *page,
-		unsigned long pfn,
-		struct zone *zone, unsigned int order,
-		int migratetype)
+static void inline __do_merge(struct page *page, unsigned int order,
+				struct zone *zone, int migratetype)
 {
+	unsigned int max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
+	unsigned long pfn = page_to_pfn(page);
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
 	struct page *buddy;
-	unsigned int max_order;
-
-	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
-
-	VM_BUG_ON(!zone_is_initialized(zone));
-	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
-	VM_BUG_ON(migratetype == -1);
-	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -879,8 +844,6 @@ static inline void __free_one_page(struct page *page,
 	}
 
 done_merging:
-	set_page_order(page, order);
-
 	/*
 	 * If this is not the largest possible page, check if the buddy
 	 * of the next-highest order is free. If it is, it's possible
@@ -905,9 +868,80 @@ static inline void __free_one_page(struct page *page,
 
 	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
 out:
+	set_page_order(page, order);
 	zone->free_area[order].nr_free++;
 }
 
+void do_merge(struct zone *zone, struct page *page, int migratetype)
+{
+	VM_BUG_ON(page_order(page) != 0);
+
+	list_del(&page->lru);
+	zone->free_area[0].nr_free--;
+	rmv_page_order(page);
+
+	__do_merge(page, 0, zone, migratetype);
+}
+
+static inline bool should_skip_merge(struct zone *zone, unsigned int order)
+{
+#ifdef CONFIG_COMPACTION
+	return !zone->compact_considered && !order;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Freeing function for a buddy system allocator.
+ *
+ * The concept of a buddy system is to maintain direct-mapped table
+ * (containing bit values) for memory blocks of various "orders".
+ * The bottom level table contains the map for the smallest allocatable
+ * units of memory (here, pages), and each level above it describes
+ * pairs of units from the levels below, hence, "buddies".
+ * At a high level, all that happens here is marking the table entry
+ * at the bottom level available, and propagating the changes upward
+ * as necessary, plus some accounting needed to play nicely with other
+ * parts of the VM system.
+ * At each level, we keep a list of pages, which are heads of continuous
+ * free pages of length of (1 << order) and marked with _mapcount
+ * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
+ * field.
+ * So when we are allocating or freeing one, we can derive the state of the
+ * other.  That is, if we allocate a small block, and both were
+ * free, the remainder of the region must be split into blocks.
+ * If a block is freed, and its buddy is also free, then this
+ * triggers coalescing into a block of larger size.
+ *
+ * -- nyc
+ */
+static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
+		struct zone *zone, unsigned int order,
+		int migratetype)
+{
+	VM_BUG_ON(!zone_is_initialized(zone));
+	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+	VM_BUG_ON(migratetype == -1);
+	if (likely(!is_migrate_isolate(migratetype)))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+	if (should_skip_merge(zone, order)) {
+		list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+		/*
+		 * 1 << 16 set on page->private to indicate this order0
+		 * page skipped merging during free time
+		 */
+		set_page_order(page, order | (1 << 16));
+		zone->free_area[order].nr_free++;
+		return;
+	}
+
+	__do_merge(page, order, zone, migratetype);
+}
+
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock
  2018-02-05  5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu
  2018-02-05  5:31   ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu
@ 2018-02-05  5:32   ` Aaron Lu
  1 sibling, 0 replies; 19+ messages in thread
From: Aaron Lu @ 2018-02-05  5:32 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Daniel Jordan

Profile on Intel Skylake server shows the most time consuming part
under zone->lock on allocation path is accessing those to-be-returned
page's struct page in rmqueue_bulk() and its child functions.

We do not really need to touch all those to-be-returned pages under
zone->lock, just need to move them out of the order 0's free_list and
adjust area->nr_free under zone->lock, other operations on page structure
like rmv_page_order(page) etc. could be done outside zone->lock.

So if it's possible to know the 1st and the last page structure of the
pcp->batch number pages in the free_list, we can achieve the above
without needing to touch all those page structures in between. The
problem is, the free page is linked in a doubly list so we only know
where the head and tail is, but not the Nth element in the list.

Assume order0 mt=Movable free_list has 7 pages available:
    head <-> p7 <-> p6 <-> p5 <-> p4 <-> p3 <-> p2 <-> p1

One experiment I have done here is to add a new list for it: say
cluster list, where it will link pages of every pcp->batch(th) element
in the free_list.

Take pcp->batch=3 as an example, we have:

free_list:     head <-> p7 <-> p6 <-> p5 <-> p4 <-> p3 <-> p2 <-> p1
cluster_list:  head <--------> p6 <---------------> p3

Let's call p6-p4 a cluster, similarly, p3-p1 is another cluster.

Then every time rmqueue_bulk() is called to get 3 pages, we will iterate
the cluster_list first. If cluster list is not empty, we can quickly locate
the first and last page, p6 and p4 in this case(p4 is retrieved by checking
p6's next on cluster_list and then check p3's prev on free_list). This way,
we can reduce the need to touch all those page structures in between under
zone->lock.

Note: a common pcp->batch should be 31 since it is the default PCP batch number.

With this change, on 2 sockets Skylake server, with will-it-scale/page_fault1
full load test, zone lock has gone, lru_lock contention rose to 70% and
performance increased by 16.7% compared to vanilla.

There are some fundemental problems with this patch though:
1 When compaction occurs, the number of pages in a cluster could be less than
  predefined; this will make "1 cluster can satify the request" not true any more.
  Due to this reason, the patch currently requires no compaction to happen;
2 When new pages are freed to order 0 free_list, it could merge with its buddy
  and that would also cause fewer pages left in a cluster. Thus, no merge
  for order-0 is required for this patch to work;
3 Similarly, when fallback allocation happens, the same problem could happen again.

Considering the above listed problems, this patch can only serve as a POC that
cache miss is the most time consuming operation in big server. Your comments
on a possible way to overcome them are greatly appreciated.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/linux/mm_types.h |  43 ++++++++++--------
 include/linux/mmzone.h   |   7 +++
 mm/page_alloc.c          | 114 ++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 141 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cfd0ac4e5e0e..e7aee48a224a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -43,26 +43,33 @@ struct page {
 	/* First double word block */
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
-	union {
-		struct address_space *mapping;	/* If low bit clear, points to
-						 * inode address_space, or NULL.
-						 * If page mapped as anonymous
-						 * memory, low bit is set, and
-						 * it points to anon_vma object
-						 * or KSM private structure. See
-						 * PAGE_MAPPING_ANON and
-						 * PAGE_MAPPING_KSM.
-						 */
-		void *s_mem;			/* slab first object */
-		atomic_t compound_mapcount;	/* first tail page */
-		/* page_deferred_list().next	 -- second tail page */
-	};
 
-	/* Second double word */
 	union {
-		pgoff_t index;		/* Our offset within mapping. */
-		void *freelist;		/* sl[aou]b first free object */
-		/* page_deferred_list().prev	-- second tail page */
+		struct {
+			union {
+				struct address_space *mapping;	/* If low bit clear, points to
+								 * inode address_space, or NULL.
+								 * If page mapped as anonymous
+								 * memory, low bit is set, and
+								 * it points to anon_vma object
+								 * or KSM private structure. See
+								 * PAGE_MAPPING_ANON and
+								 * PAGE_MAPPING_KSM.
+								 */
+				void *s_mem;			/* slab first object */
+				atomic_t compound_mapcount;	/* first tail page */
+				/* page_deferred_list().next	 -- second tail page */
+			};
+
+			/* Second double word */
+			union {
+				pgoff_t index;		/* Our offset within mapping. */
+				void *freelist;		/* sl[aou]b first free object */
+				/* page_deferred_list().prev	-- second tail page */
+			};
+		};
+
+		struct list_head cluster;
 	};
 
 	union {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..3f1451213184 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,6 +355,12 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct order0_cluster {
+	struct list_head list[MIGRATE_PCPTYPES];
+	unsigned long offset[MIGRATE_PCPTYPES];
+	int batch;
+};
+
 struct zone {
 	/* Read-mostly fields */
 
@@ -459,6 +465,7 @@ struct zone {
 
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
+	struct order0_cluster   order0_cluster;
 
 	/* zone flags, see below */
 	unsigned long		flags;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9497c8c5f808..3eaafe597a66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -736,6 +736,7 @@ static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
+	BUG_ON(page->cluster.next);
 }
 
 /*
@@ -793,6 +794,9 @@ static void inline __do_merge(struct page *page, unsigned int order,
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+	/* order0 merge doesn't work yet */
+	BUG_ON(!order);
+
 continue_merging:
 	while (order < max_order - 1) {
 		buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -883,6 +887,19 @@ void do_merge(struct zone *zone, struct page *page, int migratetype)
 	__do_merge(page, 0, zone, migratetype);
 }
 
+static inline void add_to_order0_free_list(struct page *page, struct zone *zone, int mt)
+{
+	struct order0_cluster *cluster = &zone->order0_cluster;
+
+	list_add(&page->lru, &zone->free_area[0].free_list[mt]);
+
+	/* If this is the pcp->batch(th) page, link it to the cluster list */
+	if (mt < MIGRATE_PCPTYPES && !(++cluster->offset[mt] % cluster->batch)) {
+		list_add(&page->cluster, &cluster->list[mt]);
+		cluster->offset[mt] = 0;
+	}
+}
+
 static inline bool should_skip_merge(struct zone *zone, unsigned int order)
 {
 #ifdef CONFIG_COMPACTION
@@ -929,7 +946,7 @@ static inline void __free_one_page(struct page *page,
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 
 	if (should_skip_merge(zone, order)) {
-		list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+		add_to_order0_free_list(page, zone, migratetype);
 		/*
 		 * 1 << 16 set on page->private to indicate this order0
 		 * page skipped merging during free time
@@ -1732,7 +1749,10 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		if (high)
+			list_add(&page[size].lru, &area->free_list[migratetype]);
+		else
+			add_to_order0_free_list(&page[size], zone, migratetype);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1881,6 +1901,11 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
+		if (!current_order && migratetype < MIGRATE_PCPTYPES) {
+			BUG_ON(!zone->order0_cluster.offset[migratetype]);
+			BUG_ON(page->cluster.next);
+			zone->order0_cluster.offset[migratetype]--;
+		}
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
@@ -1968,8 +1993,13 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
+		if (order) {
+			list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
+		} else {
+			__list_del_entry(&page->lru);
+			add_to_order0_free_list(page, zone, migratetype);
+		}
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2118,7 +2148,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	if (current_order)
+		list_move(&page->lru, &area->free_list[start_type]);
+	else {
+		__list_del_entry(&page->lru);
+		add_to_order0_free_list(page, zone, start_type);
+	}
 }
 
 /*
@@ -2379,6 +2414,45 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
 	return page;
 }
 
+static noinline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order,
+			unsigned long count, struct list_head *list,
+			int migratetype)
+{
+	struct list_head *cluster_head;
+	struct page *head, *tail;
+
+	cluster_head = &zone->order0_cluster.list[migratetype];
+	head = list_first_entry_or_null(cluster_head, struct page, cluster);
+	if (!head)
+		return 0;
+
+	if (head->cluster.next == cluster_head)
+		tail = list_last_entry(&zone->free_area[0].free_list[migratetype], struct page, lru);
+	else {
+		struct page *tmp = list_entry(head->cluster.next, struct page, cluster);
+		tail = list_entry(tmp->lru.prev, struct page, lru);
+	}
+
+	zone->free_area[0].nr_free -= count;
+
+	/* Remove the page from the cluster list */
+	list_del(&head->cluster);
+	/* Restore the two page fields */
+	head->cluster.next = head->cluster.prev = NULL;
+
+	/* Take the pcp->batch pages off free_area list */
+	tail->lru.next->prev = head->lru.prev;
+	head->lru.prev->next = tail->lru.next;
+
+	/* Attach them to list */
+	head->lru.prev = list;
+	list->next = &head->lru;
+	tail->lru.next = list;
+	list->prev = &tail->lru;
+
+	return 1;
+}
+
 /*
  * Obtain a specified number of elements from the buddy allocator, all under
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
@@ -2391,6 +2465,28 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
+	if (count == zone->order0_cluster.batch &&
+	    rmqueue_bulk_cluster(zone, order, count, list, migratetype)) {
+		struct page *page, *tmp;
+		spin_unlock(&zone->lock);
+
+		i = alloced = count;
+		list_for_each_entry_safe(page, tmp, list, lru) {
+			rmv_page_order(page);
+			set_pcppage_migratetype(page, migratetype);
+
+			if (unlikely(check_pcp_refill(page))) {
+				list_del(&page->lru);
+				alloced--;
+				continue;
+			}
+			if (is_migrate_cma(get_pcppage_migratetype(page)))
+				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
+						-(1 << order));
+		}
+		goto done_alloc;
+	}
+
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2415,7 +2511,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
+	spin_unlock(&zone->lock);
 
+done_alloc:
 	/*
 	 * i pages were removed from the buddy list even if some leak due
 	 * to check_pcp_refill failing so adjust NR_FREE_PAGES based
@@ -2423,7 +2521,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
 	return alloced;
 }
 
@@ -5451,6 +5548,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		if (!order && t < MIGRATE_PCPTYPES) {
+			INIT_LIST_HEAD(&zone->order0_cluster.list[t]);
+			zone->order0_cluster.offset[t] = 0;
+		}
 	}
 }
 
@@ -5488,6 +5589,9 @@ static int zone_batchsize(struct zone *zone)
 	 * and the other with pages of the other colors.
 	 */
 	batch = rounddown_pow_of_two(batch + batch/2) - 1;
+	if (batch < 1)
+		batch = 1;
+	zone->order0_cluster.batch = batch;
 
 	return batch;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress
  2018-02-05  5:31   ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu
@ 2018-02-05 22:17     ` Dave Hansen
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2018-02-05 22:17 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Vlastimil Babka, Mel Gorman, Daniel Jordan

On 02/04/2018 09:31 PM, Aaron Lu wrote:
> Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> Intel Skylake server showed severe lock contention of zone->lock, as
> high as about 80%(43% on allocation path and 38% on free path) CPU
> cycles are burnt spinning. With perf, the most time consuming part inside
> that lock on free path is cache missing on page structures, mostly on
> the to-be-freed page's buddy due to merging.
> 
> One way to avoid this overhead is not do any merging at all for order-0
> pages and leave the need for high order pages to compaction.

I think the RFC here is: we *know* this hurts high-order allocations and
Aaron demonstrated that it does make the latency worse.  But,
unexpectedly, it didn't totally crater them.

So, is the harm to large allocations worth the performance benefit
afforded to smaller ones by this patch?  How would we make a decision on
something like that?

If nothing else, this would make a nice companion topic to Daniel
Jordan's "lru_lock scalability" proposal for LSF/MM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-01-25  7:21   ` [PATCH v2 " Aaron Lu
@ 2018-02-15 12:06     ` Mel Gorman
  2018-02-23  1:37       ` Aaron Lu
  2018-02-15 12:46     ` Matthew Wilcox
  1 sibling, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2018-02-15 12:06 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.15-rc4   9037332        8000124       13642741       15728686
> this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

It looks like this series may have gotten lost because it was embedded
within an existing thread or else it was the proximity to the merge
window. I suggest a rebase, retest and resubmit unless there was some
major objection that I missed. Patch 1 is fine by me at least. I never
explicitly acked patch 2 but I've no major objection to it, just am a tad
uncomfortable with prefetch magic sauce in general.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-01-25  7:21   ` [PATCH v2 " Aaron Lu
  2018-02-15 12:06     ` Mel Gorman
@ 2018-02-15 12:46     ` Matthew Wilcox
  2018-02-15 14:55       ` Mel Gorman
  2018-02-23  1:42       ` Aaron Lu
  1 sibling, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2018-02-15 12:46 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko,
	Vlastimil Babka

On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.

I have no objection to this patch.  If you're looking for ideas for
future improvement though, I wonder whether using a LIST_HEAD is the
best way to store these pages temporarily.  If you batch them into a
pagevec and then free the entire pagevec, the CPU should be a little
faster scanning a short array than walking a linked list.

It would also puts a hard boundary on how long zone->lock is held, as
you'd drop it and go back for another batch after 15 pages.  That might
be bad, of course.

Another minor change I'd like to see is free_pcpages_bulk updating
pcp->count itself; all of the callers do it currently.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-02-15 12:46     ` Matthew Wilcox
@ 2018-02-15 14:55       ` Mel Gorman
  2018-02-23  1:42       ` Aaron Lu
  1 sibling, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2018-02-15 14:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko,
	Vlastimil Babka

On Thu, Feb 15, 2018 at 04:46:44AM -0800, Matthew Wilcox wrote:
> On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> 
> I have no objection to this patch.  If you're looking for ideas for
> future improvement though, I wonder whether using a LIST_HEAD is the
> best way to store these pages temporarily.  If you batch them into a
> pagevec and then free the entire pagevec, the CPU should be a little
> faster scanning a short array than walking a linked list.
> 
> It would also puts a hard boundary on how long zone->lock is held, as
> you'd drop it and go back for another batch after 15 pages.  That might
> be bad, of course.
> 

It's not a guaranteed win. You're trading a list traversal for increased
stack usage of 128 bytes (unless you stick a static one in the pgdat and
incur a cache miss penalty instead or a per-cpu pagevec and increase memory
consumption) and 2 spin lock acquire/releases in the common case which may
or may not be contended. It might make more sense if the PCP's themselves
where statically sized but that would limit tuning options and increase
the per-cpu footprint of the pcp structures.

Maybe I'm missing something obvious and it really would be a universal
win but right now I find it hard to imagine that it's a win.

> Another minor change I'd like to see is free_pcpages_bulk updating

> pcp->count itself; all of the callers do it currently.
> 

That should be reasonable, it's not even particularly difficult.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-02-15 12:06     ` Mel Gorman
@ 2018-02-23  1:37       ` Aaron Lu
  0 siblings, 0 replies; 19+ messages in thread
From: Aaron Lu @ 2018-02-23  1:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka

On Thu, Feb 15, 2018 at 12:06:08PM +0000, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.15-rc4   9037332        8000124       13642741       15728686
> > this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> > 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> 
> It looks like this series may have gotten lost because it was embedded
> within an existing thread or else it was the proximity to the merge
> window. I suggest a rebase, retest and resubmit unless there was some
> major objection that I missed. Patch 1 is fine by me at least. I never
> explicitly acked patch 2 but I've no major objection to it, just am a tad
> uncomfortable with prefetch magic sauce in general.

Thanks for the suggestion.
I just got back from vacation and will send out once I collected all the
required date.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free
  2018-02-15 12:46     ` Matthew Wilcox
  2018-02-15 14:55       ` Mel Gorman
@ 2018-02-23  1:42       ` Aaron Lu
  1 sibling, 0 replies; 19+ messages in thread
From: Aaron Lu @ 2018-02-23  1:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko,
	Vlastimil Babka

On Thu, Feb 15, 2018 at 04:46:44AM -0800, Matthew Wilcox wrote:
> On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> 
> I have no objection to this patch.  If you're looking for ideas for
> future improvement though, I wonder whether using a LIST_HEAD is the
> best way to store these pages temporarily.  If you batch them into a
> pagevec and then free the entire pagevec, the CPU should be a little
> faster scanning a short array than walking a linked list.

Thanks for the suggestion.

> It would also puts a hard boundary on how long zone->lock is held, as
> you'd drop it and go back for another batch after 15 pages.  That might
> be bad, of course.

Yes that's a concern.
As Mel reponded in another email, I think I'll just keep using list
here.

> 
> Another minor change I'd like to see is free_pcpages_bulk updating
> pcp->count itself; all of the callers do it currently.

Sounds good, I'll prepare a separate patch for this, thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-02-23  1:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24  2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu
2018-01-24  2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu
2018-01-24 16:43   ` Mel Gorman
2018-01-24 16:57     ` Dave Hansen
2018-01-24 18:19       ` Mel Gorman
2018-01-24 19:23         ` Dave Hansen
2018-01-24 21:12           ` Mel Gorman
2018-01-25  7:25             ` [PATCH v2 " Aaron Lu
2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman
2018-01-25  7:21   ` [PATCH v2 " Aaron Lu
2018-02-15 12:06     ` Mel Gorman
2018-02-23  1:37       ` Aaron Lu
2018-02-15 12:46     ` Matthew Wilcox
2018-02-15 14:55       ` Mel Gorman
2018-02-23  1:42       ` Aaron Lu
2018-02-05  5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu
2018-02-05  5:31   ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu
2018-02-05 22:17     ` Dave Hansen
2018-02-05  5:32   ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).