All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
@ 2018-03-20  8:54 Aaron Lu
  2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20  8:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

This series is meant to improve zone->lock scalability for order 0 pages.
With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
Perf profile shows the most time consuming part under zone->lock is the
cache miss on "struct page", so here I'm trying to avoid those cache
misses.

Patch 1/4 adds some wrapper functions for page to be added/removed
into/from buddy and doesn't have functionality changes.

Patch 2/4 skip doing merge for order 0 pages to avoid cache misses on
buddy's "struct page". On a 2 sockets Intel Skylake, this has very good
effect on free path for will-it-scale/page_fault1 full load in that it
reduced zone->lock contention on free path from 35% to 1.1%. Also, it
shows good result on parallel free(*) workload by reducing zone->lock
contention from 90% to almost zero(lru lock increased from almost 0 to
90% though).

Patch 3/4 deals with allocation path zone->lock contention by not
touching pages on free_list one by one inside zone->lock. Together with
patch 2/4, zone->lock contention is entirely eliminated for
will-it-scale/page_fault1 full load, though this patch adds some
overhead to manage cluster on free path and it has some bad effects on
parallel free workload in that it increased zone->lock contention from
almost 0 to 25%.

Patch 4/4 is an optimization in free path due to cluster operation. It
decreased the number of times add_to_cluster() has to be called and
restored performance for parallel free workload by reducing zone->lock's
contention to almost 0% again.

The good thing about this patchset is, it eliminated zone->lock
contention for will-it-scale/page_fault1 and parallel free on big
servers(contention shifted to lru_lock). The bad thing are:
 - it added some overhead in compaction path where it will do merging
   for those merge-skipped order 0 pages;
 - it is unfriendly to high order page allocation since we do not do
   merging for order 0 pages now.

To see how much effect it has on compaction, mmtests/stress-highalloc is
used on a Desktop machine with 8 CPUs and 4G memory.
(mmtests/stress-highalloc: make N copies of kernel tree and start
building them to consume almost all memory with reclaimable file page
cache. These file page cache will not be returned to buddy so effectively
makes it a worst case for high order page workload. Then after 5 minutes,
start allocating X order-9 pages to see how well compaction works).

With a delay of 100ms between allocations:
kernel   success_rate  average_time_of_alloc_one_hugepage
base           58%       3.95927e+06 ns
patch2/4       58%       5.45935e+06 ns
patch4/4       57%       6.59174e+06 ns

With a delay of 1ms between allocations:
kernel   success_rate  average_time_of_alloc_one_hugepage
base           53%       3.17362e+06 ns
patch2/4       44%       2.31637e+06 ns
patch4/4       59%       2.73029e+06 ns

If we compare patch4/4's result with base, it performed OK I think.
This is probably due to compaction is a heavy job so the added overhead
doesn't affect much.

To see how much effect it has on workload that uses hugepage, I did the
following test on a 2 sockets Intel Skylake with 112 CPUs/64G memory:
1 Break all high order pages by starting a program that consumes almost
  all memory with anonymous pages and then exit. This is used to create
  an extreme bad case for this patchset compared to vanilla that always
  does merging;
2 Start 56 processes of will-it-scale/page_fault1 that use hugepages
  through calling madvise(MADV_HUGEPAGE). To make things worse for this
  patchset, start another 56 processes of will-it-scale/page_fault1 that
  uses order 0 pages to continually cause trouble for the 56 THP users.
  Let them run for 5 minutes.

Score result(higher is better):

kernel      order0           THP
base        1522246        10540254
patch2/4    5266247 +246%   3309816 -69%
patch4/4    2234073 +47%    9610295 -8.8%

TBH, I'm not sure if the way I tried above is good enough to expose the
problem of this patchset. So if you have any thoughts on this patchset,
please feel free to let me know, thanks.

(*) Parallel free is a workload that I used to see how well parallel
freeing a large VMA can be. I tested this on a 4 sockets Intel Skylake
machine with 768G memory. The test program starts by doing a 512G anon
memory allocation with mmap() and then exit to see how fast it can exit.
The parallel is implemented inside kernel and has been posted before:
http://lkml.kernel.org/r/1489568404-7817-1-git-send-email-aaron.lu@intel.com

A branch is maintained here in case someone wants to give it a try:
https://github.com/aaronlu/linux zone_lock_rfc_v2

v2:
Rewrote allocation path optimization, compaction could happen now and no
crashes that I'm aware of.

v1 is here:
https://lkml.kernel.org/r/20180205053013.GB16980@intel.com

Aaron Lu (4):
  mm/page_alloc: use helper functions to add/remove a page to/from buddy
  mm/__free_one_page: skip merge for order-0 page unless compaction
    failed
  mm/rmqueue_bulk: alloc without touching individual page structure
  mm/free_pcppages_bulk: reduce overhead of cluster operation on free
    path

 Documentation/vm/struct_page_field |  10 +
 include/linux/mm_types.h           |   3 +
 include/linux/mmzone.h             |  35 +++
 mm/compaction.c                    |  17 +-
 mm/internal.h                      |  61 +++++
 mm/page_alloc.c                    | 488 +++++++++++++++++++++++++++++++++----
 6 files changed, 563 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/vm/struct_page_field

-- 
2.14.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
@ 2018-03-20  8:54 ` Aaron Lu
  2018-03-20 11:35   ` Vlastimil Babka
  2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-20  8:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

There are multiple places that add/remove a page into/from buddy,
introduce helper functions for them.

This also makes it easier to add code when a page is added/removed
to/from buddy.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3db9cfb2265b..3cdf1e10d412 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -736,12 +736,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
+static inline void add_to_buddy_common(struct page *page, struct zone *zone,
+					unsigned int order, int mt)
+{
+	set_page_order(page, order);
+	zone->free_area[order].nr_free++;
+}
+
+static inline void add_to_buddy_head(struct page *page, struct zone *zone,
+					unsigned int order, int mt)
+{
+	add_to_buddy_common(page, zone, order, mt);
+	list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
+static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
+					unsigned int order, int mt)
+{
+	add_to_buddy_common(page, zone, order, mt);
+	list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 }
 
+static inline void remove_from_buddy(struct page *page, struct zone *zone,
+					unsigned int order)
+{
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+}
+
 /*
  * This function checks whether a page is free && is the buddy
  * we can do coalesce a page and its buddy if
@@ -845,13 +874,10 @@ static inline void __free_one_page(struct page *page,
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-		if (page_is_guard(buddy)) {
+		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
-		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
-			rmv_page_order(buddy);
-		}
+		else
+			remove_from_buddy(buddy, zone, order);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -883,8 +909,6 @@ static inline void __free_one_page(struct page *page,
 	}
 
 done_merging:
-	set_page_order(page, order);
-
 	/*
 	 * If this is not the largest possible page, check if the buddy
 	 * of the next-highest order is free. If it is, it's possible
@@ -901,15 +925,12 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
-			goto out;
+			add_to_buddy_tail(page, zone, order, migratetype);
+			return;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
-	zone->free_area[order].nr_free++;
+	add_to_buddy_head(page, zone, order, migratetype);
 }
 
 /*
@@ -1731,9 +1752,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
-		area->nr_free++;
-		set_page_order(&page[size], high);
+		add_to_buddy_head(&page[size], zone, high, migratetype);
 	}
 }
 
@@ -1877,9 +1896,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							struct page, lru);
 		if (!page)
 			continue;
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		remove_from_buddy(page, zone, current_order);
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
@@ -2795,9 +2812,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
-	rmv_page_order(page);
+	remove_from_buddy(page, zone, order);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -7886,9 +7901,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_area[order].nr_free--;
+		remove_from_buddy(page, zone, order);
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
 		pfn += (1 << order);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
  2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
@ 2018-03-20  8:54 ` Aaron Lu
  2018-03-20 11:45   ` Vlastimil Babka
  2018-03-20 22:58   ` Figo.zhang
  2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20  8:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(42% on allocation path and 35% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.

One way to avoid this overhead is not do any merging at all for order-0
pages. With this approach, the lock contention for zone->lock on free
path dropped to 1.1% but allocation side still has as high as 42% lock
contention. In the meantime, the dropped lock contention on free side
doesn't translate to performance increase, instead, it's consumed by
increased lock contention of the per node lru_lock(rose from 5% to 37%)
and the final performance slightly dropped about 1%.

Though performance dropped a little, it almost eliminated zone lock
contention on free path and it is the foundation for the next patch
that eliminates zone lock contention for allocation path.

A new document file called "struct_page_filed" is added to explain
the newly reused field in "struct page".

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 Documentation/vm/struct_page_field |  5 +++
 include/linux/mm_types.h           |  1 +
 mm/compaction.c                    | 13 +++++-
 mm/internal.h                      | 27 ++++++++++++
 mm/page_alloc.c                    | 89 +++++++++++++++++++++++++++++++++-----
 5 files changed, 122 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/vm/struct_page_field

diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_page_field
new file mode 100644
index 000000000000..1ab6c19ccc7a
--- /dev/null
+++ b/Documentation/vm/struct_page_field
@@ -0,0 +1,5 @@
+buddy_merge_skipped:
+Used to indicate this page skipped merging when added to buddy. This
+field only makes sense if the page is in Buddy and is order zero.
+It's a bug if any higher order pages in Buddy has this field set.
+Shares space with index.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..7edc4e102a8e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -91,6 +91,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* sl[aou]b first free object */
 		/* page_deferred_list().prev	-- second tail page */
+		bool buddy_merge_skipped; /* skipped merging when added to buddy */
 	};
 
 	union {
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c8999d027ab..fb9031fdca41 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -776,8 +776,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * potential isolation targets.
 		 */
 		if (PageBuddy(page)) {
-			unsigned long freepage_order = page_order_unsafe(page);
+			unsigned long freepage_order;
 
+			/*
+			 * If this is a merge_skipped page, do merge now
+			 * since high-order pages are needed. zone lock
+			 * isn't taken for the merge_skipped check so the
+			 * check could be wrong but the worst case is we
+			 * lose a merge opportunity.
+			 */
+			if (page_merge_was_skipped(page))
+				try_to_merge_page(page);
+
+			freepage_order = page_order_unsafe(page);
 			/*
 			 * Without lock, we cannot be sure that what we got is
 			 * a valid page order. Consider only values in the
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..2bfbaae2d835 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,4 +538,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 }
 
 void setup_zone_pageset(struct zone *zone);
+
+static inline bool page_merge_was_skipped(struct page *page)
+{
+	return page->buddy_merge_skipped;
+}
+
+void try_to_merge_page(struct page *page);
+
+#ifdef CONFIG_COMPACTION
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+	/* Compaction has failed in this zone, we shouldn't skip merging */
+	if (zone->compact_considered)
+		return false;
+
+	/* Only consider no_merge for order 0 pages */
+	if (order)
+		return false;
+
+	return true;
+}
+#else /* CONFIG_COMPACTION */
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+	return false;
+}
+#endif  /* CONFIG_COMPACTION */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3cdf1e10d412..eb78014dfbde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,6 +730,16 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
+static inline void set_page_merge_skipped(struct page *page)
+{
+	page->buddy_merge_skipped = true;
+}
+
+static inline void clear_page_merge_skipped(struct page *page)
+{
+	page->buddy_merge_skipped = false;
+}
+
 static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
@@ -739,6 +749,13 @@ static inline void set_page_order(struct page *page, unsigned int order)
 static inline void add_to_buddy_common(struct page *page, struct zone *zone,
 					unsigned int order, int mt)
 {
+	/*
+	 * Always clear buddy_merge_skipped when added to buddy because
+	 * buddy_merge_skipped shares space with index and index could
+	 * be used as migratetype for PCP pages.
+	 */
+	clear_page_merge_skipped(page);
+
 	set_page_order(page, order);
 	zone->free_area[order].nr_free++;
 }
@@ -769,6 +786,7 @@ static inline void remove_from_buddy(struct page *page, struct zone *zone,
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
+	clear_page_merge_skipped(page);
 }
 
 /*
@@ -839,7 +857,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  * -- nyc
  */
 
-static inline void __free_one_page(struct page *page,
+static inline void do_merge(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
@@ -851,16 +869,6 @@ static inline void __free_one_page(struct page *page,
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
-	VM_BUG_ON(!zone_is_initialized(zone));
-	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
-	VM_BUG_ON(migratetype == -1);
-	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-
-	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
-	VM_BUG_ON_PAGE(bad_range(zone, page), page);
-
 continue_merging:
 	while (order < max_order - 1) {
 		buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -933,6 +941,61 @@ static inline void __free_one_page(struct page *page,
 	add_to_buddy_head(page, zone, order, migratetype);
 }
 
+void try_to_merge_page(struct page *page)
+{
+	unsigned long pfn, buddy_pfn, flags;
+	struct page *buddy;
+	struct zone *zone;
+
+	/*
+	 * No need to do merging if buddy is not free.
+	 * zone lock isn't taken so this could be wrong but worst case
+	 * is we lose a merge opportunity.
+	 */
+	pfn = page_to_pfn(page);
+	buddy_pfn = __find_buddy_pfn(pfn, 0);
+	buddy = page + (buddy_pfn - pfn);
+	if (!PageBuddy(buddy))
+		return;
+
+	zone = page_zone(page);
+	spin_lock_irqsave(&zone->lock, flags);
+	/* Verify again after taking the lock */
+	if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
+		   PageBuddy(buddy))) {
+		int mt = get_pageblock_migratetype(page);
+
+		remove_from_buddy(page, zone, 0);
+		do_merge(page, pfn, zone, 0, mt);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
+		struct zone *zone, unsigned int order,
+		int migratetype)
+{
+	VM_BUG_ON(!zone_is_initialized(zone));
+	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+	VM_BUG_ON(migratetype == -1);
+	if (likely(!is_migrate_isolate(migratetype)))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
+	VM_BUG_ON_PAGE(bad_range(zone, page), page);
+
+	if (can_skip_merge(zone, order)) {
+		add_to_buddy_head(page, zone, 0, migratetype);
+		set_page_merge_skipped(page);
+		return;
+	}
+
+	do_merge(page, pfn, zone, order, migratetype);
+}
+
+
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -1183,8 +1246,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			 * can be offset by reduced memory latency later. To
 			 * avoid excessive prefetching due to large count, only
 			 * prefetch buddy for the last pcp->batch nr of pages.
+			 *
+			 * If merge can be skipped, no need to prefetch buddy.
 			 */
-			if (count > pcp->batch)
+			if (can_skip_merge(zone, 0) || count > pcp->batch)
 				continue;
 			pfn = page_to_pfn(page);
 			buddy_pfn = __find_buddy_pfn(pfn, 0);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
  2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
  2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
@ 2018-03-20  8:54 ` Aaron Lu
  2018-03-20 22:29   ` Figo.zhang
  2018-03-21 12:55   ` Vlastimil Babka
  2018-03-20  8:54 ` [RFC PATCH v2 4/4] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path Aaron Lu
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20  8:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

Profile on Intel Skylake server shows the most time consuming part
under zone->lock on allocation path is accessing those to-be-returned
page's "struct page" on the free_list inside zone->lock. One explanation
is, different CPUs are releasing pages to the head of free_list and
those page's 'struct page' may very well be cache cold for the allocating
CPU when it grabs these pages from free_list' head. The purpose here
is to avoid touching these pages one by one inside zone->lock.

One idea is, we just take the requested number of pages off free_list
with something like list_cut_position() and then adjust nr_free of
free_area accordingly inside zone->lock and other operations like
clearing PageBuddy flag for these pages are done outside of zone->lock.

list_cut_position() needs to know where to cut, that's what the new
'struct cluster' meant to provide. All pages on order 0's free_list
belongs to a cluster so when a number of pages is needed, the cluster
to which head page of free_list belongs is checked and then tail page
of the cluster could be found. With tail page, list_cut_position() can
be used to drop the cluster off free_list. The 'struct cluster' also has
'nr' to tell how many pages this cluster has so nr_free of free_area can
be adjusted inside the lock too.

This caused a race window though: from the moment zone->lock is dropped
till these pages' PageBuddy flags get cleared, these pages are not in
buddy but still have PageBuddy flag set.

This doesn't cause problems for users that access buddy pages through
free_list. But there are other users, like move_freepages() which is
used to move a pageblock pages from one migratetype to another in
fallback allocation path, will test PageBuddy flag of a page derived
from PFN. The end result could be that for pages in the race window,
they are moved back to free_list of another migratetype. For this
reason, a synchronization function zone_wait_cluster_alloc() is
introduced to wait till all pages are in correct state. This function
is meant to be called with zone->lock held, so after this function
returns, we do not need to worry about new pages becoming racy state.

Another user is compaction, where it will scan a pageblock for
migratable candidates. In this process, pages derived from PFN will
be checked for PageBuddy flag to decide if it is a merge skipped page.
To avoid a racy page getting merged back into buddy, the
zone_wait_and_disable_cluster_alloc() function is introduced to:
1 disable clustered allocation by increasing zone->cluster.disable_depth;
2 wait till the race window pass by calling zone_wait_cluster_alloc().
This function is also meant to be called with zone->lock held so after
it returns, all pages are in correct state and no more cluster alloc
will be attempted till zone_enable_cluster_alloc() is called to decrease
zone->cluster.disable_depth.

The two patches could eliminate zone->lock contention entirely but at
the same time, pgdat->lru_lock contention rose to 82%. Final performance
increased about 8.3%.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 Documentation/vm/struct_page_field |   5 +
 include/linux/mm_types.h           |   2 +
 include/linux/mmzone.h             |  35 +++++
 mm/compaction.c                    |   4 +
 mm/internal.h                      |  34 +++++
 mm/page_alloc.c                    | 288 +++++++++++++++++++++++++++++++++++--
 6 files changed, 360 insertions(+), 8 deletions(-)

diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_page_field
index 1ab6c19ccc7a..bab738ea4e0a 100644
--- a/Documentation/vm/struct_page_field
+++ b/Documentation/vm/struct_page_field
@@ -3,3 +3,8 @@ Used to indicate this page skipped merging when added to buddy. This
 field only makes sense if the page is in Buddy and is order zero.
 It's a bug if any higher order pages in Buddy has this field set.
 Shares space with index.
+
+cluster:
+Order 0 Buddy pages are grouped in cluster on free_list to speed up
+allocation. This field stores the cluster pointer for them.
+Shares space with mapping.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7edc4e102a8e..49fe9d755a7c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,8 @@ struct page {
 		void *s_mem;			/* slab first object */
 		atomic_t compound_mapcount;	/* first tail page */
 		/* page_deferred_list().next	 -- second tail page */
+
+		struct cluster *cluster;	/* order 0 cluster this page belongs to */
 	};
 
 	/* Second double word */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..09ba9d3cc385 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,6 +355,40 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct cluster {
+	struct page     *tail;  /* tail page of the cluster */
+	int             nr;     /* how many pages are in this cluster */
+};
+
+struct order0_cluster {
+	/* order 0 cluster array, dynamically allocated */
+	struct cluster *array;
+	/*
+	 * order 0 cluster array length, also used to indicate if cluster
+	 * allocation is enabled for this zone(cluster allocation is disabled
+	 * for small zones whose batch size is smaller than 1, like DMA zone)
+	 */
+	int             len;
+	/*
+	 * smallest position from where we search for an
+	 * empty cluster from the cluster array
+	 */
+	int		zero_bit;
+	/* bitmap used to quickly locate an empty cluster from cluster array */
+	unsigned long   *bitmap;
+
+	/* disable cluster allocation to avoid new pages becoming racy state. */
+	unsigned long	disable_depth;
+
+	/*
+	 * used to indicate if there are pages allocated in cluster mode
+	 * still in racy state. Caller with zone->lock held could use helper
+	 * function zone_wait_cluster_alloc() to wait all such pages to exit
+	 * the race window.
+	 */
+	atomic_t        in_progress;
+};
+
 struct zone {
 	/* Read-mostly fields */
 
@@ -459,6 +493,7 @@ struct zone {
 
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
+	struct order0_cluster	cluster;
 
 	/* zone flags, see below */
 	unsigned long		flags;
diff --git a/mm/compaction.c b/mm/compaction.c
index fb9031fdca41..e71fa82786a1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
 	migrate_prep_local();
 
+	zone_wait_and_disable_cluster_alloc(zone);
+
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
 		int err;
 
@@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 			zone->compact_cached_free_pfn = free_pfn;
 	}
 
+	zone_enable_cluster_alloc(zone);
+
 	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
 	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
 
diff --git a/mm/internal.h b/mm/internal.h
index 2bfbaae2d835..1b0535af1b49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -557,12 +557,46 @@ static inline bool can_skip_merge(struct zone *zone, int order)
 	if (order)
 		return false;
 
+	/*
+	 * Clustered allocation is only disabled when high-order pages
+	 * are needed, e.g. in compaction and CMA alloc, so we should
+	 * also skip merging in that case.
+	 */
+	if (zone->cluster.disable_depth)
+		return false;
+
 	return true;
 }
+
+static inline void zone_wait_cluster_alloc(struct zone *zone)
+{
+	while (atomic_read(&zone->cluster.in_progress))
+		cpu_relax();
+}
+
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->cluster.disable_depth++;
+	zone_wait_cluster_alloc(zone);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void zone_enable_cluster_alloc(struct zone *zone)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->cluster.disable_depth--;
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 #else /* CONFIG_COMPACTION */
 static inline bool can_skip_merge(struct zone *zone, int order)
 {
 	return false;
 }
+static inline void zone_wait_cluster_alloc(struct zone *zone) {}
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone) {}
+static inline void zone_enable_cluster_alloc(struct zone *zone) {}
 #endif  /* CONFIG_COMPACTION */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb78014dfbde..ac93833a2877 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -746,6 +746,82 @@ static inline void set_page_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
+static inline struct cluster *new_cluster(struct zone *zone, int nr,
+						struct page *tail)
+{
+	struct order0_cluster *cluster = &zone->cluster;
+	int n = find_next_zero_bit(cluster->bitmap, cluster->len, cluster->zero_bit);
+	if (n == cluster->len) {
+		printk_ratelimited("node%d zone %s cluster used up\n",
+				zone->zone_pgdat->node_id, zone->name);
+		return NULL;
+	}
+	cluster->zero_bit = n;
+	set_bit(n, cluster->bitmap);
+	cluster->array[n].nr = nr;
+	cluster->array[n].tail = tail;
+	return &cluster->array[n];
+}
+
+static inline struct cluster *add_to_cluster_common(struct page *page,
+			struct zone *zone, struct page *neighbor)
+{
+	struct cluster *c;
+
+	if (neighbor) {
+		int batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+		c = neighbor->cluster;
+		if (c && c->nr < batch) {
+			page->cluster = c;
+			c->nr++;
+			return c;
+		}
+	}
+
+	c = new_cluster(zone, 1, page);
+	if (unlikely(!c))
+		return NULL;
+
+	page->cluster = c;
+	return c;
+}
+
+/*
+ * Add this page to the cluster where the previous head page belongs.
+ * Called after page is added to free_list(and becoming the new head).
+ */
+static inline void add_to_cluster_head(struct page *page, struct zone *zone,
+				       int order, int mt)
+{
+	struct page *neighbor;
+
+	if (order || !zone->cluster.len)
+		return;
+
+	neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ?
+		   NULL : list_entry(page->lru.next, struct page, lru);
+	add_to_cluster_common(page, zone, neighbor);
+}
+
+/*
+ * Add this page to the cluster where the previous tail page belongs.
+ * Called after page is added to free_list(and becoming the new tail).
+ */
+static inline void add_to_cluster_tail(struct page *page, struct zone *zone,
+				       int order, int mt)
+{
+	struct page *neighbor;
+	struct cluster *c;
+
+	if (order || !zone->cluster.len)
+		return;
+
+	neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ?
+		   NULL : list_entry(page->lru.prev, struct page, lru);
+	c = add_to_cluster_common(page, zone, neighbor);
+	c->tail = page;
+}
+
 static inline void add_to_buddy_common(struct page *page, struct zone *zone,
 					unsigned int order, int mt)
 {
@@ -765,6 +841,7 @@ static inline void add_to_buddy_head(struct page *page, struct zone *zone,
 {
 	add_to_buddy_common(page, zone, order, mt);
 	list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+	add_to_cluster_head(page, zone, order, mt);
 }
 
 static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
@@ -772,6 +849,7 @@ static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
 {
 	add_to_buddy_common(page, zone, order, mt);
 	list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+	add_to_cluster_tail(page, zone, order, mt);
 }
 
 static inline void rmv_page_order(struct page *page)
@@ -780,9 +858,29 @@ static inline void rmv_page_order(struct page *page)
 	set_page_private(page, 0);
 }
 
+/* called before removed from free_list */
+static inline void remove_from_cluster(struct page *page, struct zone *zone)
+{
+	struct cluster *c = page->cluster;
+	if (!c)
+		return;
+
+	page->cluster = NULL;
+	c->nr--;
+	if (!c->nr) {
+		int bit = c - zone->cluster.array;
+		c->tail = NULL;
+		clear_bit(bit, zone->cluster.bitmap);
+		if (bit < zone->cluster.zero_bit)
+			zone->cluster.zero_bit = bit;
+	} else if (page == c->tail)
+		c->tail = list_entry(page->lru.prev, struct page, lru);
+}
+
 static inline void remove_from_buddy(struct page *page, struct zone *zone,
 					unsigned int order)
 {
+	remove_from_cluster(page, zone);
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
@@ -2025,6 +2123,17 @@ static int move_freepages(struct zone *zone,
 	if (num_movable)
 		*num_movable = 0;
 
+	/*
+	 * Cluster alloced pages may have their PageBuddy flag unclear yet
+	 * after dropping zone->lock in rmqueue_bulk() and steal here could
+	 * move them back to free_list. So it's necessary to wait till all
+	 * those pages have their flags properly cleared.
+	 *
+	 * We do not need to disable cluster alloc though since we already
+	 * held zone->lock and no allocation could happen.
+	 */
+	zone_wait_cluster_alloc(zone);
+
 	for (page = start_page; page <= end_page;) {
 		if (!pfn_valid_within(page_to_pfn(page))) {
 			page++;
@@ -2049,8 +2158,10 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
+		remove_from_cluster(page, zone);
 		list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
+		add_to_cluster_head(page, zone, order, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2199,7 +2310,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
+	remove_from_cluster(page, zone);
 	list_move(&page->lru, &area->free_list[start_type]);
+	add_to_cluster_head(page, zone, current_order, start_type);
 }
 
 /*
@@ -2460,6 +2573,145 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
 	return page;
 }
 
+static int __init zone_order0_cluster_init(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		int len, mt, batch;
+		unsigned long flags;
+		struct order0_cluster *cluster;
+
+		if (!managed_zone(zone))
+			continue;
+
+		/* no need to enable cluster allocation for batch<=1 zone */
+		preempt_disable();
+		batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+		preempt_enable();
+		if (batch <= 1)
+			continue;
+
+		cluster = &zone->cluster;
+		/* FIXME: possible overflow of int type */
+		len = DIV_ROUND_UP(zone->managed_pages, batch);
+		cluster->array = vzalloc(len * sizeof(struct cluster));
+		if (!cluster->array)
+			return -ENOMEM;
+		cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) *
+				sizeof(unsigned long));
+		if (!cluster->bitmap)
+			return -ENOMEM;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		cluster->len = len;
+		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+			struct page *page;
+			list_for_each_entry_reverse(page,
+					&zone->free_area[0].free_list[mt], lru)
+				add_to_cluster_head(page, zone, 0, mt);
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	return 0;
+}
+subsys_initcall(zone_order0_cluster_init);
+
+static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long count,
+						struct list_head *list, int mt)
+{
+	struct list_head *head = &zone->free_area[0].free_list[mt];
+	int nr = 0;
+
+	while (nr < count) {
+		struct page *head_page;
+		struct list_head *tail, tmp_list;
+		struct cluster *c;
+		int bit;
+
+		head_page = list_first_entry_or_null(head, struct page, lru);
+		if (!head_page || !head_page->cluster)
+			break;
+
+		c = head_page->cluster;
+		tail = &c->tail->lru;
+
+		/* drop the cluster off free_list and attach to list */
+		list_cut_position(&tmp_list, head, tail);
+		list_splice_tail(&tmp_list, list);
+
+		nr += c->nr;
+		zone->free_area[0].nr_free -= c->nr;
+
+		/* this cluster is empty now */
+		c->tail = NULL;
+		c->nr = 0;
+		bit = c - zone->cluster.array;
+		clear_bit(bit, zone->cluster.bitmap);
+		if (bit < zone->cluster.zero_bit)
+			zone->cluster.zero_bit = bit;
+	}
+
+	return nr;
+}
+
+static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order,
+				unsigned long count, struct list_head *list,
+				int migratetype)
+{
+	int alloced;
+	struct page *page;
+
+	/*
+	 * Cluster alloc races with merging so don't try cluster alloc when we
+	 * can't skip merging. Note that can_skip_merge() keeps the same return
+	 * value from here till all pages have their flags properly processed,
+	 * i.e. the end of the function where in_progress is incremented, even
+	 * we have dropped the lock in the middle because the only place that
+	 * can change can_skip_merge()'s return value is compaction code and
+	 * compaction needs to wait on in_progress.
+	 */
+	if (!can_skip_merge(zone, 0))
+		return 0;
+
+	/* Cluster alloc is disabled, mostly compaction is already in progress */
+	if (zone->cluster.disable_depth)
+		return 0;
+
+	/* Cluster alloc is disabled for this zone */
+	if (unlikely(!zone->cluster.len))
+		return 0;
+
+	alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype);
+	if (!alloced)
+		return 0;
+
+	/*
+	 * Cache miss on page structure could slow things down
+	 * dramatically so accessing these alloced pages without
+	 * holding lock for better performance.
+	 *
+	 * Since these pages still have PageBuddy set, there is a race
+	 * window between now and when PageBuddy is cleared for them
+	 * below. Any operation that would scan a pageblock and check
+	 * PageBuddy(page), e.g. compaction, will need to wait till all
+	 * such pages are properly processed. in_progress is used for
+	 * such purpose so increase it now before dropping the lock.
+	 */
+	atomic_inc(&zone->cluster.in_progress);
+	spin_unlock(&zone->lock);
+
+	list_for_each_entry(page, list, lru) {
+		rmv_page_order(page);
+		page->cluster = NULL;
+		set_pcppage_migratetype(page, migratetype);
+	}
+	atomic_dec(&zone->cluster.in_progress);
+
+	return alloced;
+}
+
 /*
  * Obtain a specified number of elements from the buddy allocator, all under
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
@@ -2469,17 +2721,23 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype)
 {
-	int i, alloced = 0;
+	int i, alloced;
+	struct page *page, *tmp;
 
 	spin_lock(&zone->lock);
-	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+	alloced = rmqueue_bulk_cluster(zone, order, count, list, migratetype);
+	if (alloced > 0) {
+		if (alloced >= count)
+			goto out;
+		else
+			spin_lock(&zone->lock);
+	}
+
+	for (; alloced < count; alloced++) {
+		page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
 
-		if (unlikely(check_pcp_refill(page)))
-			continue;
-
 		/*
 		 * Split buddy pages returned by expand() are received here in
 		 * physical page order. The page is added to the tail of
@@ -2491,7 +2749,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->lru, list);
-		alloced++;
+	}
+	spin_unlock(&zone->lock);
+
+out:
+	i = alloced;
+	list_for_each_entry_safe(page, tmp, list, lru) {
+		if (unlikely(check_pcp_refill(page))) {
+			list_del(&page->lru);
+			alloced--;
+			continue;
+		}
+
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
@@ -2504,7 +2773,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
 	return alloced;
 }
 
@@ -7744,6 +8012,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	unsigned long outer_start, outer_end;
 	unsigned int order;
 	int ret = 0;
+	struct zone *zone = page_zone(pfn_to_page(start));
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
@@ -7786,6 +8055,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	if (ret)
 		return ret;
 
+	zone_wait_and_disable_cluster_alloc(zone);
 	/*
 	 * In case of -EBUSY, we'd like to know which page causes problem.
 	 * So, just fall through. test_pages_isolated() has a tracepoint
@@ -7868,6 +8138,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 done:
 	undo_isolate_page_range(pfn_max_align_down(start),
 				pfn_max_align_up(end), migratetype);
+
+	zone_enable_cluster_alloc(zone);
 	return ret;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 4/4] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
                   ` (2 preceding siblings ...)
  2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
@ 2018-03-20  8:54 ` Aaron Lu
  2018-03-21 17:44 ` [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Daniel Jordan
  2018-03-29 19:19 ` Daniel Jordan
  5 siblings, 0 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20  8:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

After "no_merge for order 0", the biggest overhead in free path for
order 0 pages is now add_to_cluster(). As pages are freed one by one,
it caused frequent operation of add_to_cluster().

Ideally, if only one migratetype pcp list has pages to free and
count=pcp->batch in free_pcppages_bulk(), we can avoid calling
add_to_cluster() one time per page but adding them in one go as
a single cluster. Let's call this ideal case as single_mt and
single_mt_unmovable represents when only unmovable pcp list has
pages and count in free_pcppages_bulk() equals to pcp->batch.
Ditto for single_mt_movable and single_mt_reclaimable.

I added some counters to see how often this ideal case is. On my
desktop, after boot:

free_pcppages_bulk:	6268
single_mt:		3885 (62%)

free_pcppages_bulk means the number of time this function gets called.
single_mt means number of times when only one pcp migratetype list has
pages to be freed and count=pcp->batch.

single_mt can be further devided into the following 3 cases:
single_mt_unmovable:	 263 (4%)
single_mt_movable:	2566 (41%)
single_mt_reclaimable:	1056 (17%)

After kbuild with a distro kconfig:

free_pcppages_bulk:	9100508
single_mt:		8440310 (93%)

Again, single_mt can be further devided:
single_mt_unmovable:	    290 (0%)
single_mt_movable:	8435483 (92.75%)
single_mt_reclaimable:	   4537 (0.05%)

Considering capturing the case of single_mt_movable requires fewer
lines of code and it is the most often ideal case, I think capturing
this case alone is enough.

This optimization brings zone->lock contention down from 25% to
almost zero again using the parallel free workload.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac93833a2877..ad15e4ef99d6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1281,6 +1281,36 @@ static bool bulkfree_pcp_prepare(struct page *page)
 }
 #endif /* CONFIG_DEBUG_VM */
 
+static inline bool free_cluster_pages(struct zone *zone, struct list_head *list,
+				      int mt, int count)
+{
+	struct cluster *c;
+	struct page *page, *n;
+
+	if (!can_skip_merge(zone, 0))
+		return false;
+
+	if (count != this_cpu_ptr(zone->pageset)->pcp.batch)
+		return false;
+
+	c = new_cluster(zone, count, list_first_entry(list, struct page, lru));
+	if (unlikely(!c))
+		return false;
+
+	list_for_each_entry_safe(page, n, list, lru) {
+		set_page_order(page, 0);
+		set_page_merge_skipped(page);
+		page->cluster = c;
+		list_add(&page->lru, &zone->free_area[0].free_list[mt]);
+	}
+
+	INIT_LIST_HEAD(list);
+	zone->free_area[0].nr_free += count;
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+
+	return true;
+}
+
 /*
  * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
@@ -1295,9 +1325,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
 {
-	int migratetype = 0;
-	int batch_free = 0;
-	bool isolated_pageblocks;
+	int migratetype = MIGRATE_MOVABLE;
+	int batch_free = 0, saved_count = count;
+	bool isolated_pageblocks, single_mt = false;
 	struct page *page, *tmp;
 	LIST_HEAD(head);
 
@@ -1319,8 +1349,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (list_empty(list));
 
 		/* This is the only non-empty list. Free them all. */
-		if (batch_free == MIGRATE_PCPTYPES)
+		if (batch_free == MIGRATE_PCPTYPES) {
 			batch_free = count;
+			if (batch_free == saved_count)
+				single_mt = true;
+		}
 
 		do {
 			unsigned long pfn, buddy_pfn;
@@ -1359,9 +1392,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
+	if (!isolated_pageblocks && single_mt)
+		free_cluster_pages(zone, &head, migratetype, saved_count);
+
 	/*
 	 * Use safe version since after __free_one_page(),
 	 * page->lru.next will not point to original list.
+	 *
+	 * If free_cluster_pages() succeeds, head will be an empty list here.
 	 */
 	list_for_each_entry_safe(page, tmp, &head, lru) {
 		int mt = get_pcppage_migratetype(page);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy
  2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
@ 2018-03-20 11:35   ` Vlastimil Babka
  2018-03-20 13:50     ` Aaron Lu
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2018-03-20 11:35 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Mel Gorman, Matthew Wilcox,
	Daniel Jordan

On 03/20/2018 09:54 AM, Aaron Lu wrote:
> There are multiple places that add/remove a page into/from buddy,
> introduce helper functions for them.
> 
> This also makes it easier to add code when a page is added/removed
> to/from buddy.
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db9cfb2265b..3cdf1e10d412 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -736,12 +736,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
>  	__SetPageBuddy(page);
>  }
>  
> +static inline void add_to_buddy_common(struct page *page, struct zone *zone,
> +					unsigned int order, int mt)
> +{
> +	set_page_order(page, order);
> +	zone->free_area[order].nr_free++;

The 'mt' parameter seems unused here. Otherwise

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> +}
> +
> +static inline void add_to_buddy_head(struct page *page, struct zone *zone,
> +					unsigned int order, int mt)
> +{
> +	add_to_buddy_common(page, zone, order, mt);
> +	list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> +}
> +
> +static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
> +					unsigned int order, int mt)
> +{
> +	add_to_buddy_common(page, zone, order, mt);
> +	list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
> +}
> +
>  static inline void rmv_page_order(struct page *page)
>  {
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);
>  }
>  
> +static inline void remove_from_buddy(struct page *page, struct zone *zone,
> +					unsigned int order)
> +{
> +	list_del(&page->lru);
> +	zone->free_area[order].nr_free--;
> +	rmv_page_order(page);
> +}
> +
>  /*
>   * This function checks whether a page is free && is the buddy
>   * we can do coalesce a page and its buddy if
> @@ -845,13 +874,10 @@ static inline void __free_one_page(struct page *page,
>  		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
>  		 * merge with it and move up one order.
>  		 */
> -		if (page_is_guard(buddy)) {
> +		if (page_is_guard(buddy))
>  			clear_page_guard(zone, buddy, order, migratetype);
> -		} else {
> -			list_del(&buddy->lru);
> -			zone->free_area[order].nr_free--;
> -			rmv_page_order(buddy);
> -		}
> +		else
> +			remove_from_buddy(buddy, zone, order);
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
> @@ -883,8 +909,6 @@ static inline void __free_one_page(struct page *page,
>  	}
>  
>  done_merging:
> -	set_page_order(page, order);
> -
>  	/*
>  	 * If this is not the largest possible page, check if the buddy
>  	 * of the next-highest order is free. If it is, it's possible
> @@ -901,15 +925,12 @@ static inline void __free_one_page(struct page *page,
>  		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
>  		if (pfn_valid_within(buddy_pfn) &&
>  		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype]);
> -			goto out;
> +			add_to_buddy_tail(page, zone, order, migratetype);
> +			return;
>  		}
>  	}
>  
> -	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
> -out:
> -	zone->free_area[order].nr_free++;
> +	add_to_buddy_head(page, zone, order, migratetype);
>  }
>  
>  /*
> @@ -1731,9 +1752,7 @@ static inline void expand(struct zone *zone, struct page *page,
>  		if (set_page_guard(zone, &page[size], high, migratetype))
>  			continue;
>  
> -		list_add(&page[size].lru, &area->free_list[migratetype]);
> -		area->nr_free++;
> -		set_page_order(&page[size], high);
> +		add_to_buddy_head(&page[size], zone, high, migratetype);
>  	}
>  }
>  
> @@ -1877,9 +1896,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  							struct page, lru);
>  		if (!page)
>  			continue;
> -		list_del(&page->lru);
> -		rmv_page_order(page);
> -		area->nr_free--;
> +		remove_from_buddy(page, zone, current_order);
>  		expand(zone, page, order, current_order, area, migratetype);
>  		set_pcppage_migratetype(page, migratetype);
>  		return page;
> @@ -2795,9 +2812,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  	}
>  
>  	/* Remove page from free list */
> -	list_del(&page->lru);
> -	zone->free_area[order].nr_free--;
> -	rmv_page_order(page);
> +	remove_from_buddy(page, zone, order);
>  
>  	/*
>  	 * Set the pageblock if the isolated page is at least half of a
> @@ -7886,9 +7901,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>  		pr_info("remove from free list %lx %d %lx\n",
>  			pfn, 1 << order, end_pfn);
>  #endif
> -		list_del(&page->lru);
> -		rmv_page_order(page);
> -		zone->free_area[order].nr_free--;
> +		remove_from_buddy(page, zone, order);
>  		for (i = 0; i < (1 << order); i++)
>  			SetPageReserved((page+i));
>  		pfn += (1 << order);
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
@ 2018-03-20 11:45   ` Vlastimil Babka
  2018-03-20 14:11     ` Aaron Lu
  2018-03-20 22:58   ` Figo.zhang
  1 sibling, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2018-03-20 11:45 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Mel Gorman, Matthew Wilcox,
	Daniel Jordan

On 03/20/2018 09:54 AM, Aaron Lu wrote:
> Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> Intel Skylake server showed severe lock contention of zone->lock, as
> high as about 80%(42% on allocation path and 35% on free path) CPU
> cycles are burnt spinning. With perf, the most time consuming part inside
> that lock on free path is cache missing on page structures, mostly on
> the to-be-freed page's buddy due to merging.

But why, with all the prefetching in place?

> One way to avoid this overhead is not do any merging at all for order-0
> pages. With this approach, the lock contention for zone->lock on free
> path dropped to 1.1% but allocation side still has as high as 42% lock
> contention. In the meantime, the dropped lock contention on free side
> doesn't translate to performance increase, instead, it's consumed by
> increased lock contention of the per node lru_lock(rose from 5% to 37%)
> and the final performance slightly dropped about 1%.
> 
> Though performance dropped a little, it almost eliminated zone lock
> contention on free path and it is the foundation for the next patch
> that eliminates zone lock contention for allocation path.

Not thrilled about such disruptive change in the name of a
microbenchmark :/ Shouldn't normally the pcplists hide the overhead?
If not, wouldn't it make more sense to turn zone->lock into a range lock?

> A new document file called "struct_page_filed" is added to explain
> the newly reused field in "struct page".

Sounds rather ad-hoc for a single field, I'd rather document it via
comments.

> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  Documentation/vm/struct_page_field |  5 +++
>  include/linux/mm_types.h           |  1 +
>  mm/compaction.c                    | 13 +++++-
>  mm/internal.h                      | 27 ++++++++++++
>  mm/page_alloc.c                    | 89 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 122 insertions(+), 13 deletions(-)
>  create mode 100644 Documentation/vm/struct_page_field
> 
> diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_page_field
> new file mode 100644
> index 000000000000..1ab6c19ccc7a
> --- /dev/null
> +++ b/Documentation/vm/struct_page_field
> @@ -0,0 +1,5 @@
> +buddy_merge_skipped:
> +Used to indicate this page skipped merging when added to buddy. This
> +field only makes sense if the page is in Buddy and is order zero.
> +It's a bug if any higher order pages in Buddy has this field set.
> +Shares space with index.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fd1af6b9591d..7edc4e102a8e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -91,6 +91,7 @@ struct page {
>  		pgoff_t index;		/* Our offset within mapping. */
>  		void *freelist;		/* sl[aou]b first free object */
>  		/* page_deferred_list().prev	-- second tail page */
> +		bool buddy_merge_skipped; /* skipped merging when added to buddy */
>  	};
>  
>  	union {
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2c8999d027ab..fb9031fdca41 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -776,8 +776,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		 * potential isolation targets.
>  		 */
>  		if (PageBuddy(page)) {
> -			unsigned long freepage_order = page_order_unsafe(page);
> +			unsigned long freepage_order;
>  
> +			/*
> +			 * If this is a merge_skipped page, do merge now
> +			 * since high-order pages are needed. zone lock
> +			 * isn't taken for the merge_skipped check so the
> +			 * check could be wrong but the worst case is we
> +			 * lose a merge opportunity.
> +			 */
> +			if (page_merge_was_skipped(page))
> +				try_to_merge_page(page);
> +
> +			freepage_order = page_order_unsafe(page);
>  			/*
>  			 * Without lock, we cannot be sure that what we got is
>  			 * a valid page order. Consider only values in the
> diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..2bfbaae2d835 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -538,4 +538,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
>  }
>  
>  void setup_zone_pageset(struct zone *zone);
> +
> +static inline bool page_merge_was_skipped(struct page *page)
> +{
> +	return page->buddy_merge_skipped;
> +}
> +
> +void try_to_merge_page(struct page *page);
> +
> +#ifdef CONFIG_COMPACTION
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +	/* Compaction has failed in this zone, we shouldn't skip merging */
> +	if (zone->compact_considered)
> +		return false;
> +
> +	/* Only consider no_merge for order 0 pages */
> +	if (order)
> +		return false;
> +
> +	return true;
> +}
> +#else /* CONFIG_COMPACTION */
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +	return false;
> +}
> +#endif  /* CONFIG_COMPACTION */
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3cdf1e10d412..eb78014dfbde 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -730,6 +730,16 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>  				unsigned int order, int migratetype) {}
>  #endif
>  
> +static inline void set_page_merge_skipped(struct page *page)
> +{
> +	page->buddy_merge_skipped = true;
> +}
> +
> +static inline void clear_page_merge_skipped(struct page *page)
> +{
> +	page->buddy_merge_skipped = false;
> +}
> +
>  static inline void set_page_order(struct page *page, unsigned int order)
>  {
>  	set_page_private(page, order);
> @@ -739,6 +749,13 @@ static inline void set_page_order(struct page *page, unsigned int order)
>  static inline void add_to_buddy_common(struct page *page, struct zone *zone,
>  					unsigned int order, int mt)
>  {
> +	/*
> +	 * Always clear buddy_merge_skipped when added to buddy because
> +	 * buddy_merge_skipped shares space with index and index could
> +	 * be used as migratetype for PCP pages.
> +	 */
> +	clear_page_merge_skipped(page);
> +
>  	set_page_order(page, order);
>  	zone->free_area[order].nr_free++;
>  }
> @@ -769,6 +786,7 @@ static inline void remove_from_buddy(struct page *page, struct zone *zone,
>  	list_del(&page->lru);
>  	zone->free_area[order].nr_free--;
>  	rmv_page_order(page);
> +	clear_page_merge_skipped(page);
>  }
>  
>  /*
> @@ -839,7 +857,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>   * -- nyc
>   */
>  
> -static inline void __free_one_page(struct page *page,
> +static inline void do_merge(struct page *page,
>  		unsigned long pfn,
>  		struct zone *zone, unsigned int order,
>  		int migratetype)
> @@ -851,16 +869,6 @@ static inline void __free_one_page(struct page *page,
>  
>  	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
>  
> -	VM_BUG_ON(!zone_is_initialized(zone));
> -	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> -
> -	VM_BUG_ON(migratetype == -1);
> -	if (likely(!is_migrate_isolate(migratetype)))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
> -	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> -	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> -
>  continue_merging:
>  	while (order < max_order - 1) {
>  		buddy_pfn = __find_buddy_pfn(pfn, order);
> @@ -933,6 +941,61 @@ static inline void __free_one_page(struct page *page,
>  	add_to_buddy_head(page, zone, order, migratetype);
>  }
>  
> +void try_to_merge_page(struct page *page)
> +{
> +	unsigned long pfn, buddy_pfn, flags;
> +	struct page *buddy;
> +	struct zone *zone;
> +
> +	/*
> +	 * No need to do merging if buddy is not free.
> +	 * zone lock isn't taken so this could be wrong but worst case
> +	 * is we lose a merge opportunity.
> +	 */
> +	pfn = page_to_pfn(page);
> +	buddy_pfn = __find_buddy_pfn(pfn, 0);
> +	buddy = page + (buddy_pfn - pfn);
> +	if (!PageBuddy(buddy))
> +		return;
> +
> +	zone = page_zone(page);
> +	spin_lock_irqsave(&zone->lock, flags);
> +	/* Verify again after taking the lock */
> +	if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
> +		   PageBuddy(buddy))) {
> +		int mt = get_pageblock_migratetype(page);
> +
> +		remove_from_buddy(page, zone, 0);
> +		do_merge(page, pfn, zone, 0, mt);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static inline void __free_one_page(struct page *page,
> +		unsigned long pfn,
> +		struct zone *zone, unsigned int order,
> +		int migratetype)
> +{
> +	VM_BUG_ON(!zone_is_initialized(zone));
> +	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> +
> +	VM_BUG_ON(migratetype == -1);
> +	if (likely(!is_migrate_isolate(migratetype)))
> +		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
> +	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> +	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> +
> +	if (can_skip_merge(zone, order)) {
> +		add_to_buddy_head(page, zone, 0, migratetype);
> +		set_page_merge_skipped(page);
> +		return;
> +	}
> +
> +	do_merge(page, pfn, zone, order, migratetype);
> +}
> +
> +
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple branches,
>   * try and check multiple fields with one check. The caller must do a detailed
> @@ -1183,8 +1246,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			 * can be offset by reduced memory latency later. To
>  			 * avoid excessive prefetching due to large count, only
>  			 * prefetch buddy for the last pcp->batch nr of pages.
> +			 *
> +			 * If merge can be skipped, no need to prefetch buddy.
>  			 */
> -			if (count > pcp->batch)
> +			if (can_skip_merge(zone, 0) || count > pcp->batch)
>  				continue;
>  			pfn = page_to_pfn(page);
>  			buddy_pfn = __find_buddy_pfn(pfn, 0);
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy
  2018-03-20 11:35   ` Vlastimil Babka
@ 2018-03-20 13:50     ` Aaron Lu
  0 siblings, 0 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20 13:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 12:35:46PM +0100, Vlastimil Babka wrote:
> On 03/20/2018 09:54 AM, Aaron Lu wrote:
> > There are multiple places that add/remove a page into/from buddy,
> > introduce helper functions for them.
> > 
> > This also makes it easier to add code when a page is added/removed
> > to/from buddy.
> > 
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 26 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3db9cfb2265b..3cdf1e10d412 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -736,12 +736,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
> >  	__SetPageBuddy(page);
> >  }
> >  
> > +static inline void add_to_buddy_common(struct page *page, struct zone *zone,
> > +					unsigned int order, int mt)
> > +{
> > +	set_page_order(page, order);
> > +	zone->free_area[order].nr_free++;
> 
> The 'mt' parameter seems unused here. Otherwise

An right, forgot to remove it...

> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> > +}
> > +
> > +static inline void add_to_buddy_head(struct page *page, struct zone *zone,
> > +					unsigned int order, int mt)
> > +{
> > +	add_to_buddy_common(page, zone, order, mt);
> > +	list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> > +}
> > +
> > +static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
> > +					unsigned int order, int mt)
> > +{
> > +	add_to_buddy_common(page, zone, order, mt);
> > +	list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
> > +}
> > +
> >  static inline void rmv_page_order(struct page *page)
> >  {
> >  	__ClearPageBuddy(page);
> >  	set_page_private(page, 0);
> >  }
> >  
> > +static inline void remove_from_buddy(struct page *page, struct zone *zone,
> > +					unsigned int order)
> > +{
> > +	list_del(&page->lru);
> > +	zone->free_area[order].nr_free--;
> > +	rmv_page_order(page);
> > +}
> > +
> >  /*
> >   * This function checks whether a page is free && is the buddy
> >   * we can do coalesce a page and its buddy if
> > @@ -845,13 +874,10 @@ static inline void __free_one_page(struct page *page,
> >  		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
> >  		 * merge with it and move up one order.
> >  		 */
> > -		if (page_is_guard(buddy)) {
> > +		if (page_is_guard(buddy))
> >  			clear_page_guard(zone, buddy, order, migratetype);
> > -		} else {
> > -			list_del(&buddy->lru);
> > -			zone->free_area[order].nr_free--;
> > -			rmv_page_order(buddy);
> > -		}
> > +		else
> > +			remove_from_buddy(buddy, zone, order);
> >  		combined_pfn = buddy_pfn & pfn;
> >  		page = page + (combined_pfn - pfn);
> >  		pfn = combined_pfn;
> > @@ -883,8 +909,6 @@ static inline void __free_one_page(struct page *page,
> >  	}
> >  
> >  done_merging:
> > -	set_page_order(page, order);
> > -
> >  	/*
> >  	 * If this is not the largest possible page, check if the buddy
> >  	 * of the next-highest order is free. If it is, it's possible
> > @@ -901,15 +925,12 @@ static inline void __free_one_page(struct page *page,
> >  		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
> >  		if (pfn_valid_within(buddy_pfn) &&
> >  		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
> > -			list_add_tail(&page->lru,
> > -				&zone->free_area[order].free_list[migratetype]);
> > -			goto out;
> > +			add_to_buddy_tail(page, zone, order, migratetype);
> > +			return;
> >  		}
> >  	}
> >  
> > -	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
> > -out:
> > -	zone->free_area[order].nr_free++;
> > +	add_to_buddy_head(page, zone, order, migratetype);
> >  }
> >  
> >  /*
> > @@ -1731,9 +1752,7 @@ static inline void expand(struct zone *zone, struct page *page,
> >  		if (set_page_guard(zone, &page[size], high, migratetype))
> >  			continue;
> >  
> > -		list_add(&page[size].lru, &area->free_list[migratetype]);
> > -		area->nr_free++;
> > -		set_page_order(&page[size], high);
> > +		add_to_buddy_head(&page[size], zone, high, migratetype);
> >  	}
> >  }
> >  
> > @@ -1877,9 +1896,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >  							struct page, lru);
> >  		if (!page)
> >  			continue;
> > -		list_del(&page->lru);
> > -		rmv_page_order(page);
> > -		area->nr_free--;
> > +		remove_from_buddy(page, zone, current_order);
> >  		expand(zone, page, order, current_order, area, migratetype);
> >  		set_pcppage_migratetype(page, migratetype);
> >  		return page;
> > @@ -2795,9 +2812,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
> >  	}
> >  
> >  	/* Remove page from free list */
> > -	list_del(&page->lru);
> > -	zone->free_area[order].nr_free--;
> > -	rmv_page_order(page);
> > +	remove_from_buddy(page, zone, order);
> >  
> >  	/*
> >  	 * Set the pageblock if the isolated page is at least half of a
> > @@ -7886,9 +7901,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> >  		pr_info("remove from free list %lx %d %lx\n",
> >  			pfn, 1 << order, end_pfn);
> >  #endif
> > -		list_del(&page->lru);
> > -		rmv_page_order(page);
> > -		zone->free_area[order].nr_free--;
> > +		remove_from_buddy(page, zone, order);
> >  		for (i = 0; i < (1 << order); i++)
> >  			SetPageReserved((page+i));
> >  		pfn += (1 << order);
> > 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20 11:45   ` Vlastimil Babka
@ 2018-03-20 14:11     ` Aaron Lu
  2018-03-21  7:53       ` Vlastimil Babka
  2018-03-22 17:15       ` Matthew Wilcox
  0 siblings, 2 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-20 14:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 12:45:50PM +0100, Vlastimil Babka wrote:
> On 03/20/2018 09:54 AM, Aaron Lu wrote:
> > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > Intel Skylake server showed severe lock contention of zone->lock, as
> > high as about 80%(42% on allocation path and 35% on free path) CPU
> > cycles are burnt spinning. With perf, the most time consuming part inside
> > that lock on free path is cache missing on page structures, mostly on
> > the to-be-freed page's buddy due to merging.
> 
> But why, with all the prefetching in place?

The prefetch is just for its order 0 buddy, if merge happens, then its
order 1 buddy will also be checked and on and on, so the cache misses
are much more in merge mode.

> 
> > One way to avoid this overhead is not do any merging at all for order-0
> > pages. With this approach, the lock contention for zone->lock on free
> > path dropped to 1.1% but allocation side still has as high as 42% lock
> > contention. In the meantime, the dropped lock contention on free side
> > doesn't translate to performance increase, instead, it's consumed by
> > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > and the final performance slightly dropped about 1%.
> > 
> > Though performance dropped a little, it almost eliminated zone lock
> > contention on free path and it is the foundation for the next patch
> > that eliminates zone lock contention for allocation path.
> 
> Not thrilled about such disruptive change in the name of a
> microbenchmark :/ Shouldn't normally the pcplists hide the overhead?

Sadly, with the default pcp count, it didn't avoid the lock contention.
We can of course increase pcp->count to a large enough value to avoid
entering buddy and thus avoid zone->lock contention, but that would
require admin to manually change the value on a per-machine per-workload
basis I believe.

> If not, wouldn't it make more sense to turn zone->lock into a range lock?

Not familiar with range lock, will need to take a look at it, thanks for
the pointer.

> 
> > A new document file called "struct_page_filed" is added to explain
> > the newly reused field in "struct page".
> 
> Sounds rather ad-hoc for a single field, I'd rather document it via
> comments.

Dave would like to have a document to explain all those "struct page"
fields that are repurposed under different scenarios and this is the
very start of the document :-)

I probably should have explained the intent of the document more.

Thanks for taking a look at this.

> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  Documentation/vm/struct_page_field |  5 +++
> >  include/linux/mm_types.h           |  1 +
> >  mm/compaction.c                    | 13 +++++-
> >  mm/internal.h                      | 27 ++++++++++++
> >  mm/page_alloc.c                    | 89 +++++++++++++++++++++++++++++++++-----
> >  5 files changed, 122 insertions(+), 13 deletions(-)
> >  create mode 100644 Documentation/vm/struct_page_field
> > 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
@ 2018-03-20 22:29   ` Figo.zhang
  2018-03-21  1:52     ` Aaron Lu
  2018-03-21 12:55   ` Vlastimil Babka
  1 sibling, 1 reply; 30+ messages in thread
From: Figo.zhang @ 2018-03-20 22:29 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

[-- Attachment #1: Type: text/plain, Size: 25838 bytes --]

2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:

> Profile on Intel Skylake server shows the most time consuming part
> under zone->lock on allocation path is accessing those to-be-returned
> page's "struct page" on the free_list inside zone->lock. One explanation
> is, different CPUs are releasing pages to the head of free_list and
> those page's 'struct page' may very well be cache cold for the allocating
> CPU when it grabs these pages from free_list' head. The purpose here
> is to avoid touching these pages one by one inside zone->lock.
>
> One idea is, we just take the requested number of pages off free_list
> with something like list_cut_position() and then adjust nr_free of
> free_area accordingly inside zone->lock and other operations like
> clearing PageBuddy flag for these pages are done outside of zone->lock.
>

sounds good!
your idea is reducing the lock contention in rmqueue_bulk() function by
split the order-0
freelist into two list, one is without zone->lock, other is need zone->lock?

it seems that it is a big lock granularity holding the zone->lock in
rmqueue_bulk() ,
why not we change like it?

static int rmqueue_bulk(struct zone *zone, unsigned int order,
            unsigned long count, struct list_head *list,
            int migratetype, bool cold)
{

    for (i = 0; i < count; ++i) {
        spin_lock(&zone->lock);
        struct page *page = __rmqueue(zone, order, migratetype);
       spin_unlock(&zone->lock);
       ...
    }
    __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

    return i;
}


>
> list_cut_position() needs to know where to cut, that's what the new
> 'struct cluster' meant to provide. All pages on order 0's free_list
> belongs to a cluster so when a number of pages is needed, the cluster
> to which head page of free_list belongs is checked and then tail page
> of the cluster could be found. With tail page, list_cut_position() can
> be used to drop the cluster off free_list. The 'struct cluster' also has
> 'nr' to tell how many pages this cluster has so nr_free of free_area can
> be adjusted inside the lock too.
>
> This caused a race window though: from the moment zone->lock is dropped
> till these pages' PageBuddy flags get cleared, these pages are not in
> buddy but still have PageBuddy flag set.
>
> This doesn't cause problems for users that access buddy pages through
> free_list. But there are other users, like move_freepages() which is
> used to move a pageblock pages from one migratetype to another in
> fallback allocation path, will test PageBuddy flag of a page derived
> from PFN. The end result could be that for pages in the race window,
> they are moved back to free_list of another migratetype. For this
> reason, a synchronization function zone_wait_cluster_alloc() is
> introduced to wait till all pages are in correct state. This function
> is meant to be called with zone->lock held, so after this function
> returns, we do not need to worry about new pages becoming racy state.
>
> Another user is compaction, where it will scan a pageblock for
> migratable candidates. In this process, pages derived from PFN will
> be checked for PageBuddy flag to decide if it is a merge skipped page.
> To avoid a racy page getting merged back into buddy, the
> zone_wait_and_disable_cluster_alloc() function is introduced to:
> 1 disable clustered allocation by increasing zone->cluster.disable_depth;
> 2 wait till the race window pass by calling zone_wait_cluster_alloc().
> This function is also meant to be called with zone->lock held so after
> it returns, all pages are in correct state and no more cluster alloc
> will be attempted till zone_enable_cluster_alloc() is called to decrease
> zone->cluster.disable_depth.
>
> The two patches could eliminate zone->lock contention entirely but at
> the same time, pgdat->lru_lock contention rose to 82%. Final performance
> increased about 8.3%.
>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  Documentation/vm/struct_page_field |   5 +
>  include/linux/mm_types.h           |   2 +
>  include/linux/mmzone.h             |  35 +++++
>  mm/compaction.c                    |   4 +
>  mm/internal.h                      |  34 +++++
>  mm/page_alloc.c                    | 288 ++++++++++++++++++++++++++++++
> +++++--
>  6 files changed, 360 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_
> page_field
> index 1ab6c19ccc7a..bab738ea4e0a 100644
> --- a/Documentation/vm/struct_page_field
> +++ b/Documentation/vm/struct_page_field
> @@ -3,3 +3,8 @@ Used to indicate this page skipped merging when added to
> buddy. This
>  field only makes sense if the page is in Buddy and is order zero.
>  It's a bug if any higher order pages in Buddy has this field set.
>  Shares space with index.
> +
> +cluster:
> +Order 0 Buddy pages are grouped in cluster on free_list to speed up
> +allocation. This field stores the cluster pointer for them.
> +Shares space with mapping.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7edc4e102a8e..49fe9d755a7c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -84,6 +84,8 @@ struct page {
>                 void *s_mem;                    /* slab first object */
>                 atomic_t compound_mapcount;     /* first tail page */
>                 /* page_deferred_list().next     -- second tail page */
> +
> +               struct cluster *cluster;        /* order 0 cluster this
> page belongs to */
>         };
>
>         /* Second double word */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7522a6987595..09ba9d3cc385 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -355,6 +355,40 @@ enum zone_type {
>
>  #ifndef __GENERATING_BOUNDS_H
>
> +struct cluster {
> +       struct page     *tail;  /* tail page of the cluster */
> +       int             nr;     /* how many pages are in this cluster */
> +};
> +
> +struct order0_cluster {
> +       /* order 0 cluster array, dynamically allocated */
> +       struct cluster *array;
> +       /*
> +        * order 0 cluster array length, also used to indicate if cluster
> +        * allocation is enabled for this zone(cluster allocation is
> disabled
> +        * for small zones whose batch size is smaller than 1, like DMA
> zone)
> +        */
> +       int             len;
> +       /*
> +        * smallest position from where we search for an
> +        * empty cluster from the cluster array
> +        */
> +       int             zero_bit;
> +       /* bitmap used to quickly locate an empty cluster from cluster
> array */
> +       unsigned long   *bitmap;
> +
> +       /* disable cluster allocation to avoid new pages becoming racy
> state. */
> +       unsigned long   disable_depth;
> +
> +       /*
> +        * used to indicate if there are pages allocated in cluster mode
> +        * still in racy state. Caller with zone->lock held could use
> helper
> +        * function zone_wait_cluster_alloc() to wait all such pages to
> exit
> +        * the race window.
> +        */
> +       atomic_t        in_progress;
> +};
> +
>  struct zone {
>         /* Read-mostly fields */
>
> @@ -459,6 +493,7 @@ struct zone {
>
>         /* free areas of different sizes */
>         struct free_area        free_area[MAX_ORDER];
> +       struct order0_cluster   cluster;
>
>         /* zone flags, see below */
>         unsigned long           flags;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index fb9031fdca41..e71fa82786a1 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone
> *zone, struct compact_contro
>
>         migrate_prep_local();
>
> +       zone_wait_and_disable_cluster_alloc(zone);
> +
>         while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
>                 int err;
>
> @@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone
> *zone, struct compact_contro
>                         zone->compact_cached_free_pfn = free_pfn;
>         }
>
> +       zone_enable_cluster_alloc(zone);
> +
>         count_compact_events(COMPACTMIGRATE_SCANNED,
> cc->total_migrate_scanned);
>         count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 2bfbaae2d835..1b0535af1b49 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -557,12 +557,46 @@ static inline bool can_skip_merge(struct zone *zone,
> int order)
>         if (order)
>                 return false;
>
> +       /*
> +        * Clustered allocation is only disabled when high-order pages
> +        * are needed, e.g. in compaction and CMA alloc, so we should
> +        * also skip merging in that case.
> +        */
> +       if (zone->cluster.disable_depth)
> +               return false;
> +
>         return true;
>  }
> +
> +static inline void zone_wait_cluster_alloc(struct zone *zone)
> +{
> +       while (atomic_read(&zone->cluster.in_progress))
> +               cpu_relax();
> +}
> +
> +static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone)
> +{
> +       unsigned long flags;
> +       spin_lock_irqsave(&zone->lock, flags);
> +       zone->cluster.disable_depth++;
> +       zone_wait_cluster_alloc(zone);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static inline void zone_enable_cluster_alloc(struct zone *zone)
> +{
> +       unsigned long flags;
> +       spin_lock_irqsave(&zone->lock, flags);
> +       zone->cluster.disable_depth--;
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
>  #else /* CONFIG_COMPACTION */
>  static inline bool can_skip_merge(struct zone *zone, int order)
>  {
>         return false;
>  }
> +static inline void zone_wait_cluster_alloc(struct zone *zone) {}
> +static inline void zone_wait_and_disable_cluster_alloc(struct zone
> *zone) {}
> +static inline void zone_enable_cluster_alloc(struct zone *zone) {}
>  #endif  /* CONFIG_COMPACTION */
>  #endif /* __MM_INTERNAL_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eb78014dfbde..ac93833a2877 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -746,6 +746,82 @@ static inline void set_page_order(struct page *page,
> unsigned int order)
>         __SetPageBuddy(page);
>  }
>
> +static inline struct cluster *new_cluster(struct zone *zone, int nr,
> +                                               struct page *tail)
> +{
> +       struct order0_cluster *cluster = &zone->cluster;
> +       int n = find_next_zero_bit(cluster->bitmap, cluster->len,
> cluster->zero_bit);
> +       if (n == cluster->len) {
> +               printk_ratelimited("node%d zone %s cluster used up\n",
> +                               zone->zone_pgdat->node_id, zone->name);
> +               return NULL;
> +       }
> +       cluster->zero_bit = n;
> +       set_bit(n, cluster->bitmap);
> +       cluster->array[n].nr = nr;
> +       cluster->array[n].tail = tail;
> +       return &cluster->array[n];
> +}
> +
> +static inline struct cluster *add_to_cluster_common(struct page *page,
> +                       struct zone *zone, struct page *neighbor)
> +{
> +       struct cluster *c;
> +
> +       if (neighbor) {
> +               int batch = this_cpu_ptr(zone->pageset)->pcp.batch;
> +               c = neighbor->cluster;
> +               if (c && c->nr < batch) {
> +                       page->cluster = c;
> +                       c->nr++;
> +                       return c;
> +               }
> +       }
> +
> +       c = new_cluster(zone, 1, page);
> +       if (unlikely(!c))
> +               return NULL;
> +
> +       page->cluster = c;
> +       return c;
> +}
> +
> +/*
> + * Add this page to the cluster where the previous head page belongs.
> + * Called after page is added to free_list(and becoming the new head).
> + */
> +static inline void add_to_cluster_head(struct page *page, struct zone
> *zone,
> +                                      int order, int mt)
> +{
> +       struct page *neighbor;
> +
> +       if (order || !zone->cluster.len)
> +               return;
> +
> +       neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ?
> +                  NULL : list_entry(page->lru.next, struct page, lru);
> +       add_to_cluster_common(page, zone, neighbor);
> +}
> +
> +/*
> + * Add this page to the cluster where the previous tail page belongs.
> + * Called after page is added to free_list(and becoming the new tail).
> + */
> +static inline void add_to_cluster_tail(struct page *page, struct zone
> *zone,
> +                                      int order, int mt)
> +{
> +       struct page *neighbor;
> +       struct cluster *c;
> +
> +       if (order || !zone->cluster.len)
> +               return;
> +
> +       neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ?
> +                  NULL : list_entry(page->lru.prev, struct page, lru);
> +       c = add_to_cluster_common(page, zone, neighbor);
> +       c->tail = page;
> +}
> +
>  static inline void add_to_buddy_common(struct page *page, struct zone
> *zone,
>                                         unsigned int order, int mt)
>  {
> @@ -765,6 +841,7 @@ static inline void add_to_buddy_head(struct page
> *page, struct zone *zone,
>  {
>         add_to_buddy_common(page, zone, order, mt);
>         list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> +       add_to_cluster_head(page, zone, order, mt);
>  }
>
>  static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
> @@ -772,6 +849,7 @@ static inline void add_to_buddy_tail(struct page
> *page, struct zone *zone,
>  {
>         add_to_buddy_common(page, zone, order, mt);
>         list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
> +       add_to_cluster_tail(page, zone, order, mt);
>  }
>
>  static inline void rmv_page_order(struct page *page)
> @@ -780,9 +858,29 @@ static inline void rmv_page_order(struct page *page)
>         set_page_private(page, 0);
>  }
>
> +/* called before removed from free_list */
> +static inline void remove_from_cluster(struct page *page, struct zone
> *zone)
> +{
> +       struct cluster *c = page->cluster;
> +       if (!c)
> +               return;
> +
> +       page->cluster = NULL;
> +       c->nr--;
> +       if (!c->nr) {
> +               int bit = c - zone->cluster.array;
> +               c->tail = NULL;
> +               clear_bit(bit, zone->cluster.bitmap);
> +               if (bit < zone->cluster.zero_bit)
> +                       zone->cluster.zero_bit = bit;
> +       } else if (page == c->tail)
> +               c->tail = list_entry(page->lru.prev, struct page, lru);
> +}
> +
>  static inline void remove_from_buddy(struct page *page, struct zone *zone,
>                                         unsigned int order)
>  {
> +       remove_from_cluster(page, zone);
>         list_del(&page->lru);
>         zone->free_area[order].nr_free--;
>         rmv_page_order(page);
> @@ -2025,6 +2123,17 @@ static int move_freepages(struct zone *zone,
>         if (num_movable)
>                 *num_movable = 0;
>
> +       /*
> +        * Cluster alloced pages may have their PageBuddy flag unclear yet
> +        * after dropping zone->lock in rmqueue_bulk() and steal here could
> +        * move them back to free_list. So it's necessary to wait till all
> +        * those pages have their flags properly cleared.
> +        *
> +        * We do not need to disable cluster alloc though since we already
> +        * held zone->lock and no allocation could happen.
> +        */
> +       zone_wait_cluster_alloc(zone);
> +
>         for (page = start_page; page <= end_page;) {
>                 if (!pfn_valid_within(page_to_pfn(page))) {
>                         page++;
> @@ -2049,8 +2158,10 @@ static int move_freepages(struct zone *zone,
>                 }
>
>                 order = page_order(page);
> +               remove_from_cluster(page, zone);
>                 list_move(&page->lru,
>                           &zone->free_area[order].free_list[migratetype]);
> +               add_to_cluster_head(page, zone, order, migratetype);
>                 page += 1 << order;
>                 pages_moved += 1 << order;
>         }
> @@ -2199,7 +2310,9 @@ static void steal_suitable_fallback(struct zone
> *zone, struct page *page,
>
>  single_page:
>         area = &zone->free_area[current_order];
> +       remove_from_cluster(page, zone);
>         list_move(&page->lru, &area->free_list[start_type]);
> +       add_to_cluster_head(page, zone, current_order, start_type);
>  }
>
>  /*
> @@ -2460,6 +2573,145 @@ __rmqueue(struct zone *zone, unsigned int order,
> int migratetype)
>         return page;
>  }
>
> +static int __init zone_order0_cluster_init(void)
> +{
> +       struct zone *zone;
> +
> +       for_each_zone(zone) {
> +               int len, mt, batch;
> +               unsigned long flags;
> +               struct order0_cluster *cluster;
> +
> +               if (!managed_zone(zone))
> +                       continue;
> +
> +               /* no need to enable cluster allocation for batch<=1 zone
> */
> +               preempt_disable();
> +               batch = this_cpu_ptr(zone->pageset)->pcp.batch;
> +               preempt_enable();
> +               if (batch <= 1)
> +                       continue;
> +
> +               cluster = &zone->cluster;
> +               /* FIXME: possible overflow of int type */
> +               len = DIV_ROUND_UP(zone->managed_pages, batch);
> +               cluster->array = vzalloc(len * sizeof(struct cluster));
> +               if (!cluster->array)
> +                       return -ENOMEM;
> +               cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG)
> *
> +                               sizeof(unsigned long));
> +               if (!cluster->bitmap)
> +                       return -ENOMEM;
> +
> +               spin_lock_irqsave(&zone->lock, flags);
> +               cluster->len = len;
> +               for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +                       struct page *page;
> +                       list_for_each_entry_reverse(page,
> +                                       &zone->free_area[0].free_list[mt],
> lru)
> +                               add_to_cluster_head(page, zone, 0, mt);
> +               }
> +               spin_unlock_irqrestore(&zone->lock, flags);
> +       }
> +
> +       return 0;
> +}
> +subsys_initcall(zone_order0_cluster_init);
> +
> +static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long
> count,
> +                                               struct list_head *list,
> int mt)
> +{
> +       struct list_head *head = &zone->free_area[0].free_list[mt];
> +       int nr = 0;
> +
> +       while (nr < count) {
> +               struct page *head_page;
> +               struct list_head *tail, tmp_list;
> +               struct cluster *c;
> +               int bit;
> +
> +               head_page = list_first_entry_or_null(head, struct page,
> lru);
> +               if (!head_page || !head_page->cluster)
> +                       break;
> +
> +               c = head_page->cluster;
> +               tail = &c->tail->lru;
> +
> +               /* drop the cluster off free_list and attach to list */
> +               list_cut_position(&tmp_list, head, tail);
> +               list_splice_tail(&tmp_list, list);
> +
> +               nr += c->nr;
> +               zone->free_area[0].nr_free -= c->nr;
> +
> +               /* this cluster is empty now */
> +               c->tail = NULL;
> +               c->nr = 0;
> +               bit = c - zone->cluster.array;
> +               clear_bit(bit, zone->cluster.bitmap);
> +               if (bit < zone->cluster.zero_bit)
> +                       zone->cluster.zero_bit = bit;
> +       }
> +
> +       return nr;
> +}
> +
> +static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int
> order,
> +                               unsigned long count, struct list_head
> *list,
> +                               int migratetype)
> +{
> +       int alloced;
> +       struct page *page;
> +
> +       /*
> +        * Cluster alloc races with merging so don't try cluster alloc
> when we
> +        * can't skip merging. Note that can_skip_merge() keeps the same
> return
> +        * value from here till all pages have their flags properly
> processed,
> +        * i.e. the end of the function where in_progress is incremented,
> even
> +        * we have dropped the lock in the middle because the only place
> that
> +        * can change can_skip_merge()'s return value is compaction code
> and
> +        * compaction needs to wait on in_progress.
> +        */
> +       if (!can_skip_merge(zone, 0))
> +               return 0;
> +
> +       /* Cluster alloc is disabled, mostly compaction is already in
> progress */
> +       if (zone->cluster.disable_depth)
> +               return 0;
> +
> +       /* Cluster alloc is disabled for this zone */
> +       if (unlikely(!zone->cluster.len))
> +               return 0;
> +
> +       alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype);
> +       if (!alloced)
> +               return 0;
> +
> +       /*
> +        * Cache miss on page structure could slow things down
> +        * dramatically so accessing these alloced pages without
> +        * holding lock for better performance.
> +        *
> +        * Since these pages still have PageBuddy set, there is a race
> +        * window between now and when PageBuddy is cleared for them
> +        * below. Any operation that would scan a pageblock and check
> +        * PageBuddy(page), e.g. compaction, will need to wait till all
> +        * such pages are properly processed. in_progress is used for
> +        * such purpose so increase it now before dropping the lock.
> +        */
> +       atomic_inc(&zone->cluster.in_progress);
> +       spin_unlock(&zone->lock);
> +
> +       list_for_each_entry(page, list, lru) {
> +               rmv_page_order(page);
> +               page->cluster = NULL;
> +               set_pcppage_migratetype(page, migratetype);
> +       }
> +       atomic_dec(&zone->cluster.in_progress);
> +
> +       return alloced;
> +}
> +
>  /*
>   * Obtain a specified number of elements from the buddy allocator, all
> under
>   * a single hold of the lock, for efficiency.  Add them to the supplied
> list.
> @@ -2469,17 +2721,23 @@ static int rmqueue_bulk(struct zone *zone,
> unsigned int order,
>                         unsigned long count, struct list_head *list,
>                         int migratetype)
>  {
> -       int i, alloced = 0;
> +       int i, alloced;
> +       struct page *page, *tmp;
>
>         spin_lock(&zone->lock);
> -       for (i = 0; i < count; ++i) {
> -               struct page *page = __rmqueue(zone, order, migratetype);
> +       alloced = rmqueue_bulk_cluster(zone, order, count, list,
> migratetype);
> +       if (alloced > 0) {
> +               if (alloced >= count)
> +                       goto out;
> +               else
> +                       spin_lock(&zone->lock);
> +       }
> +
> +       for (; alloced < count; alloced++) {
> +               page = __rmqueue(zone, order, migratetype);
>                 if (unlikely(page == NULL))
>                         break;
>
> -               if (unlikely(check_pcp_refill(page)))
> -                       continue;
> -
>                 /*
>                  * Split buddy pages returned by expand() are received
> here in
>                  * physical page order. The page is added to the tail of
> @@ -2491,7 +2749,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned
> int order,
>                  * pages are ordered properly.
>                  */
>                 list_add_tail(&page->lru, list);
> -               alloced++;
> +       }
> +       spin_unlock(&zone->lock);
> +
> +out:
> +       i = alloced;
> +       list_for_each_entry_safe(page, tmp, list, lru) {
> +               if (unlikely(check_pcp_refill(page))) {
> +                       list_del(&page->lru);
> +                       alloced--;
> +                       continue;
> +               }
> +
>                 if (is_migrate_cma(get_pcppage_migratetype(page)))
>                         __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>                                               -(1 << order));
> @@ -2504,7 +2773,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned
> int order,
>          * pages added to the pcp list.
>          */
>         __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> -       spin_unlock(&zone->lock);
>         return alloced;
>  }
>
> @@ -7744,6 +8012,7 @@ int alloc_contig_range(unsigned long start, unsigned
> long end,
>         unsigned long outer_start, outer_end;
>         unsigned int order;
>         int ret = 0;
> +       struct zone *zone = page_zone(pfn_to_page(start));
>
>         struct compact_control cc = {
>                 .nr_migratepages = 0,
> @@ -7786,6 +8055,7 @@ int alloc_contig_range(unsigned long start, unsigned
> long end,
>         if (ret)
>                 return ret;
>
> +       zone_wait_and_disable_cluster_alloc(zone);
>         /*
>          * In case of -EBUSY, we'd like to know which page causes problem.
>          * So, just fall through. test_pages_isolated() has a tracepoint
> @@ -7868,6 +8138,8 @@ int alloc_contig_range(unsigned long start, unsigned
> long end,
>  done:
>         undo_isolate_page_range(pfn_max_align_down(start),
>                                 pfn_max_align_up(end), migratetype);
> +
> +       zone_enable_cluster_alloc(zone);
>         return ret;
>  }
>
> --
> 2.14.3
>
>

[-- Attachment #2: Type: text/html, Size: 47486 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
  2018-03-20 11:45   ` Vlastimil Babka
@ 2018-03-20 22:58   ` Figo.zhang
  2018-03-21  1:59     ` Aaron Lu
  1 sibling, 1 reply; 30+ messages in thread
From: Figo.zhang @ 2018-03-20 22:58 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

[-- Attachment #1: Type: text/plain, Size: 11026 bytes --]

2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:

> Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> Intel Skylake server showed severe lock contention of zone->lock, as
> high as about 80%(42% on allocation path and 35% on free path) CPU
> cycles are burnt spinning. With perf, the most time consuming part inside
> that lock on free path is cache missing on page structures, mostly on
> the to-be-freed page's buddy due to merging.
>
> One way to avoid this overhead is not do any merging at all for order-0
> pages. With this approach, the lock contention for zone->lock on free
> path dropped to 1.1% but allocation side still has as high as 42% lock
> contention. In the meantime, the dropped lock contention on free side
> doesn't translate to performance increase, instead, it's consumed by
> increased lock contention of the per node lru_lock(rose from 5% to 37%)
> and the final performance slightly dropped about 1%.
>
> Though performance dropped a little, it almost eliminated zone lock
> contention on free path and it is the foundation for the next patch
> that eliminates zone lock contention for allocation path.
>
> A new document file called "struct_page_filed" is added to explain
> the newly reused field in "struct page".
>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  Documentation/vm/struct_page_field |  5 +++
>  include/linux/mm_types.h           |  1 +
>  mm/compaction.c                    | 13 +++++-
>  mm/internal.h                      | 27 ++++++++++++
>  mm/page_alloc.c                    | 89 ++++++++++++++++++++++++++++++
> +++-----
>  5 files changed, 122 insertions(+), 13 deletions(-)
>  create mode 100644 Documentation/vm/struct_page_field
>
> diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_
> page_field
> new file mode 100644
> index 000000000000..1ab6c19ccc7a
> --- /dev/null
> +++ b/Documentation/vm/struct_page_field
> @@ -0,0 +1,5 @@
> +buddy_merge_skipped:
> +Used to indicate this page skipped merging when added to buddy. This
> +field only makes sense if the page is in Buddy and is order zero.
> +It's a bug if any higher order pages in Buddy has this field set.
> +Shares space with index.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fd1af6b9591d..7edc4e102a8e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -91,6 +91,7 @@ struct page {
>                 pgoff_t index;          /* Our offset within mapping. */
>                 void *freelist;         /* sl[aou]b first free object */
>                 /* page_deferred_list().prev    -- second tail page */
> +               bool buddy_merge_skipped; /* skipped merging when added to
> buddy */
>         };
>
>         union {
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2c8999d027ab..fb9031fdca41 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -776,8 +776,19 @@ isolate_migratepages_block(struct compact_control
> *cc, unsigned long low_pfn,
>                  * potential isolation targets.
>                  */
>                 if (PageBuddy(page)) {
> -                       unsigned long freepage_order =
> page_order_unsafe(page);
> +                       unsigned long freepage_order;
>
> +                       /*
> +                        * If this is a merge_skipped page, do merge now
> +                        * since high-order pages are needed. zone lock
> +                        * isn't taken for the merge_skipped check so the
> +                        * check could be wrong but the worst case is we
> +                        * lose a merge opportunity.
> +                        */
> +                       if (page_merge_was_skipped(page))
> +                               try_to_merge_page(page);
> +
> +                       freepage_order = page_order_unsafe(page);
>                         /*
>                          * Without lock, we cannot be sure that what we
> got is
>                          * a valid page order. Consider only values in the
>

when the system memory is very very low and try a lot of failures and then
go into
__alloc_pages_direct_compact() to has a opportunity to do your
try_to_merge_page(), is it the best timing for here to
do order-0 migration?

diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..2bfbaae2d835 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -538,4 +538,31 @@ static inline bool is_migrate_highatomic_page(struct
> page *page)
>  }
>
>  void setup_zone_pageset(struct zone *zone);
> +
> +static inline bool page_merge_was_skipped(struct page *page)
> +{
> +       return page->buddy_merge_skipped;
> +}
> +
> +void try_to_merge_page(struct page *page);
> +
> +#ifdef CONFIG_COMPACTION
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +       /* Compaction has failed in this zone, we shouldn't skip merging */
> +       if (zone->compact_considered)
> +               return false;
> +
> +       /* Only consider no_merge for order 0 pages */
> +       if (order)
> +               return false;
> +
> +       return true;
> +}
> +#else /* CONFIG_COMPACTION */
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +       return false;
> +}
> +#endif  /* CONFIG_COMPACTION */
>  #endif /* __MM_INTERNAL_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3cdf1e10d412..eb78014dfbde 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -730,6 +730,16 @@ static inline void clear_page_guard(struct zone
> *zone, struct page *page,
>                                 unsigned int order, int migratetype) {}
>  #endif
>
> +static inline void set_page_merge_skipped(struct page *page)
> +{
> +       page->buddy_merge_skipped = true;
> +}
> +
> +static inline void clear_page_merge_skipped(struct page *page)
> +{
> +       page->buddy_merge_skipped = false;
> +}
> +
>  static inline void set_page_order(struct page *page, unsigned int order)
>  {
>         set_page_private(page, order);
> @@ -739,6 +749,13 @@ static inline void set_page_order(struct page *page,
> unsigned int order)
>  static inline void add_to_buddy_common(struct page *page, struct zone
> *zone,
>                                         unsigned int order, int mt)
>  {
> +       /*
> +        * Always clear buddy_merge_skipped when added to buddy because
> +        * buddy_merge_skipped shares space with index and index could
> +        * be used as migratetype for PCP pages.
> +        */
> +       clear_page_merge_skipped(page);
> +
>         set_page_order(page, order);
>         zone->free_area[order].nr_free++;
>  }
> @@ -769,6 +786,7 @@ static inline void remove_from_buddy(struct page
> *page, struct zone *zone,
>         list_del(&page->lru);
>         zone->free_area[order].nr_free--;
>         rmv_page_order(page);
> +       clear_page_merge_skipped(page);
>  }
>
>  /*
> @@ -839,7 +857,7 @@ static inline int page_is_buddy(struct page *page,
> struct page *buddy,
>   * -- nyc
>   */
>
> -static inline void __free_one_page(struct page *page,
> +static inline void do_merge(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
>                 int migratetype)
> @@ -851,16 +869,6 @@ static inline void __free_one_page(struct page *page,
>
>         max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
>
> -       VM_BUG_ON(!zone_is_initialized(zone));
> -       VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> -
> -       VM_BUG_ON(migratetype == -1);
> -       if (likely(!is_migrate_isolate(migratetype)))
> -               __mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
> -       VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> -       VM_BUG_ON_PAGE(bad_range(zone, page), page);
> -
>  continue_merging:
>         while (order < max_order - 1) {
>                 buddy_pfn = __find_buddy_pfn(pfn, order);
> @@ -933,6 +941,61 @@ static inline void __free_one_page(struct page *page,
>         add_to_buddy_head(page, zone, order, migratetype);
>  }
>
> +void try_to_merge_page(struct page *page)
> +{
> +       unsigned long pfn, buddy_pfn, flags;
> +       struct page *buddy;
> +       struct zone *zone;
> +
> +       /*
> +        * No need to do merging if buddy is not free.
> +        * zone lock isn't taken so this could be wrong but worst case
> +        * is we lose a merge opportunity.
> +        */
> +       pfn = page_to_pfn(page);
> +       buddy_pfn = __find_buddy_pfn(pfn, 0);
> +       buddy = page + (buddy_pfn - pfn);
> +       if (!PageBuddy(buddy))
> +               return;
> +
> +       zone = page_zone(page);
> +       spin_lock_irqsave(&zone->lock, flags);
> +       /* Verify again after taking the lock */
> +       if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
> +                  PageBuddy(buddy))) {
> +               int mt = get_pageblock_migratetype(page);
> +
> +               remove_from_buddy(page, zone, 0);
> +               do_merge(page, pfn, zone, 0, mt);
> +       }
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static inline void __free_one_page(struct page *page,
> +               unsigned long pfn,
> +               struct zone *zone, unsigned int order,
> +               int migratetype)
> +{
> +       VM_BUG_ON(!zone_is_initialized(zone));
> +       VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> +
> +       VM_BUG_ON(migratetype == -1);
> +       if (likely(!is_migrate_isolate(migratetype)))
> +               __mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
> +       VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> +       VM_BUG_ON_PAGE(bad_range(zone, page), page);
> +
> +       if (can_skip_merge(zone, order)) {
> +               add_to_buddy_head(page, zone, 0, migratetype);
> +               set_page_merge_skipped(page);
> +               return;
> +       }
> +
> +       do_merge(page, pfn, zone, order, migratetype);
> +}
> +
> +
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple
> branches,
>   * try and check multiple fields with one check. The caller must do a
> detailed
> @@ -1183,8 +1246,10 @@ static void free_pcppages_bulk(struct zone *zone,
> int count,
>                          * can be offset by reduced memory latency later.
> To
>                          * avoid excessive prefetching due to large count,
> only
>                          * prefetch buddy for the last pcp->batch nr of
> pages.
> +                        *
> +                        * If merge can be skipped, no need to prefetch
> buddy.
>                          */
> -                       if (count > pcp->batch)
> +                       if (can_skip_merge(zone, 0) || count > pcp->batch)
>                                 continue;
>                         pfn = page_to_pfn(page);
>                         buddy_pfn = __find_buddy_pfn(pfn, 0);
> --
> 2.14.3
>
>

[-- Attachment #2: Type: text/html, Size: 13628 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-20 22:29   ` Figo.zhang
@ 2018-03-21  1:52     ` Aaron Lu
  0 siblings, 0 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-21  1:52 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 03:29:33PM -0700, Figo.zhang wrote:
> 2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:
> 
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
> >
> > One idea is, we just take the requested number of pages off free_list
> > with something like list_cut_position() and then adjust nr_free of
> > free_area accordingly inside zone->lock and other operations like
> > clearing PageBuddy flag for these pages are done outside of zone->lock.
> >
> 
> sounds good!
> your idea is reducing the lock contention in rmqueue_bulk() function by

Right, the idea is to reduce the lock held time.

> split the order-0
> freelist into two list, one is without zone->lock, other is need zone->lock?

But not by splitting freelist into two lists, I didn't do that.
I moved part of the things done previously inside the lock outside, i.e.
clearing PageBuddy flag etc. is now done outside so that we do not need
to take the penalty of cache miss on those "struct page"s inside the
lock and have all other CPUs waiting.

> 
> it seems that it is a big lock granularity holding the zone->lock in
> rmqueue_bulk() ,
> why not we change like it?

It is believed frequently taking and dropping lock is worse than taking
it and do all needed things and then drop.

> 
> static int rmqueue_bulk(struct zone *zone, unsigned int order,
>             unsigned long count, struct list_head *list,
>             int migratetype, bool cold)
> {
> 
>     for (i = 0; i < count; ++i) {
>         spin_lock(&zone->lock);
>         struct page *page = __rmqueue(zone, order, migratetype);
>        spin_unlock(&zone->lock);
>        ...
>     }

In this case, spin_lock() and spin_unlock() should be outside the loop.

>     __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> 
>     return i;
> }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20 22:58   ` Figo.zhang
@ 2018-03-21  1:59     ` Aaron Lu
  2018-03-21  4:21       ` Figo.zhang
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-21  1:59 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 03:58:51PM -0700, Figo.zhang wrote:
> 2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:
> 
> > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > Intel Skylake server showed severe lock contention of zone->lock, as
> > high as about 80%(42% on allocation path and 35% on free path) CPU
> > cycles are burnt spinning. With perf, the most time consuming part inside
> > that lock on free path is cache missing on page structures, mostly on
> > the to-be-freed page's buddy due to merging.
> >
> > One way to avoid this overhead is not do any merging at all for order-0
> > pages. With this approach, the lock contention for zone->lock on free
> > path dropped to 1.1% but allocation side still has as high as 42% lock
> > contention. In the meantime, the dropped lock contention on free side
> > doesn't translate to performance increase, instead, it's consumed by
> > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > and the final performance slightly dropped about 1%.
> >
> > Though performance dropped a little, it almost eliminated zone lock
> > contention on free path and it is the foundation for the next patch
> > that eliminates zone lock contention for allocation path.
> >
> > A new document file called "struct_page_filed" is added to explain
> > the newly reused field in "struct page".
> >
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  Documentation/vm/struct_page_field |  5 +++
> >  include/linux/mm_types.h           |  1 +
> >  mm/compaction.c                    | 13 +++++-
> >  mm/internal.h                      | 27 ++++++++++++
> >  mm/page_alloc.c                    | 89 ++++++++++++++++++++++++++++++
> > +++-----
> >  5 files changed, 122 insertions(+), 13 deletions(-)
> >  create mode 100644 Documentation/vm/struct_page_field
> >
> > diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_
> > page_field
> > new file mode 100644
> > index 000000000000..1ab6c19ccc7a
> > --- /dev/null
> > +++ b/Documentation/vm/struct_page_field
> > @@ -0,0 +1,5 @@
> > +buddy_merge_skipped:
> > +Used to indicate this page skipped merging when added to buddy. This
> > +field only makes sense if the page is in Buddy and is order zero.
> > +It's a bug if any higher order pages in Buddy has this field set.
> > +Shares space with index.
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index fd1af6b9591d..7edc4e102a8e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -91,6 +91,7 @@ struct page {
> >                 pgoff_t index;          /* Our offset within mapping. */
> >                 void *freelist;         /* sl[aou]b first free object */
> >                 /* page_deferred_list().prev    -- second tail page */
> > +               bool buddy_merge_skipped; /* skipped merging when added to
> > buddy */
> >         };
> >
> >         union {
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 2c8999d027ab..fb9031fdca41 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -776,8 +776,19 @@ isolate_migratepages_block(struct compact_control
> > *cc, unsigned long low_pfn,
> >                  * potential isolation targets.
> >                  */
> >                 if (PageBuddy(page)) {
> > -                       unsigned long freepage_order =
> > page_order_unsafe(page);
> > +                       unsigned long freepage_order;
> >
> > +                       /*
> > +                        * If this is a merge_skipped page, do merge now
> > +                        * since high-order pages are needed. zone lock
> > +                        * isn't taken for the merge_skipped check so the
> > +                        * check could be wrong but the worst case is we
> > +                        * lose a merge opportunity.
> > +                        */
> > +                       if (page_merge_was_skipped(page))
> > +                               try_to_merge_page(page);
> > +
> > +                       freepage_order = page_order_unsafe(page);
> >                         /*
> >                          * Without lock, we cannot be sure that what we
> > got is
> >                          * a valid page order. Consider only values in the
> >
> 
> when the system memory is very very low and try a lot of failures and then

If the system memory is very very low, it doesn't appear there is a need
to do compaction since compaction needs to have enough order 0 pages to
make a high order one.

> go into
> __alloc_pages_direct_compact() to has a opportunity to do your
> try_to_merge_page(), is it the best timing for here to
> do order-0 migration?

try_to_merge_page(), as I added in this patch, doesn't do any page
migration but merging. It will not cause page migration.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-21  1:59     ` Aaron Lu
@ 2018-03-21  4:21       ` Figo.zhang
  2018-03-21  4:53         ` Aaron Lu
  0 siblings, 1 reply; 30+ messages in thread
From: Figo.zhang @ 2018-03-21  4:21 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

[-- Attachment #1: Type: text/plain, Size: 5113 bytes --]

2018-03-20 18:59 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:

> On Tue, Mar 20, 2018 at 03:58:51PM -0700, Figo.zhang wrote:
> > 2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:
> >
> > > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > > Intel Skylake server showed severe lock contention of zone->lock, as
> > > high as about 80%(42% on allocation path and 35% on free path) CPU
> > > cycles are burnt spinning. With perf, the most time consuming part
> inside
> > > that lock on free path is cache missing on page structures, mostly on
> > > the to-be-freed page's buddy due to merging.
> > >
> > > One way to avoid this overhead is not do any merging at all for order-0
> > > pages. With this approach, the lock contention for zone->lock on free
> > > path dropped to 1.1% but allocation side still has as high as 42% lock
> > > contention. In the meantime, the dropped lock contention on free side
> > > doesn't translate to performance increase, instead, it's consumed by
> > > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > > and the final performance slightly dropped about 1%.
> > >
> > > Though performance dropped a little, it almost eliminated zone lock
> > > contention on free path and it is the foundation for the next patch
> > > that eliminates zone lock contention for allocation path.
> > >
> > > A new document file called "struct_page_filed" is added to explain
> > > the newly reused field in "struct page".
> > >
> > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > > ---
> > >  Documentation/vm/struct_page_field |  5 +++
> > >  include/linux/mm_types.h           |  1 +
> > >  mm/compaction.c                    | 13 +++++-
> > >  mm/internal.h                      | 27 ++++++++++++
> > >  mm/page_alloc.c                    | 89 ++++++++++++++++++++++++++++++
> > > +++-----
> > >  5 files changed, 122 insertions(+), 13 deletions(-)
> > >  create mode 100644 Documentation/vm/struct_page_field
> > >
> > > diff --git a/Documentation/vm/struct_page_field
> b/Documentation/vm/struct_
> > > page_field
> > > new file mode 100644
> > > index 000000000000..1ab6c19ccc7a
> > > --- /dev/null
> > > +++ b/Documentation/vm/struct_page_field
> > > @@ -0,0 +1,5 @@
> > > +buddy_merge_skipped:
> > > +Used to indicate this page skipped merging when added to buddy. This
> > > +field only makes sense if the page is in Buddy and is order zero.
> > > +It's a bug if any higher order pages in Buddy has this field set.
> > > +Shares space with index.
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index fd1af6b9591d..7edc4e102a8e 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -91,6 +91,7 @@ struct page {
> > >                 pgoff_t index;          /* Our offset within mapping.
> */
> > >                 void *freelist;         /* sl[aou]b first free object
> */
> > >                 /* page_deferred_list().prev    -- second tail page */
> > > +               bool buddy_merge_skipped; /* skipped merging when
> added to
> > > buddy */
> > >         };
> > >
> > >         union {
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 2c8999d027ab..fb9031fdca41 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -776,8 +776,19 @@ isolate_migratepages_block(struct compact_control
> > > *cc, unsigned long low_pfn,
> > >                  * potential isolation targets.
> > >                  */
> > >                 if (PageBuddy(page)) {
> > > -                       unsigned long freepage_order =
> > > page_order_unsafe(page);
> > > +                       unsigned long freepage_order;
> > >
> > > +                       /*
> > > +                        * If this is a merge_skipped page, do merge
> now
> > > +                        * since high-order pages are needed. zone lock
> > > +                        * isn't taken for the merge_skipped check so
> the
> > > +                        * check could be wrong but the worst case is
> we
> > > +                        * lose a merge opportunity.
> > > +                        */
> > > +                       if (page_merge_was_skipped(page))
> > > +                               try_to_merge_page(page);
> > > +
> > > +                       freepage_order = page_order_unsafe(page);
> > >                         /*
> > >                          * Without lock, we cannot be sure that what we
> > > got is
> > >                          * a valid page order. Consider only values in
> the
> > >
> >
> > when the system memory is very very low and try a lot of failures and
> then
>
> >If the system memory is very very low, it doesn't appear there is a need
> >to do compaction since compaction needs to have enough order 0 pages to
> >make a high order one.
>
>
suppose that in free_one_page() will try to merge to high order anytime ,
but now in your patch,
those merge has postponed when system in low memory status, it is very easy
let system trigger
low memory state and get poor performance.

[-- Attachment #2: Type: text/html, Size: 6879 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-21  4:21       ` Figo.zhang
@ 2018-03-21  4:53         ` Aaron Lu
  2018-03-21  5:59           ` Figo.zhang
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-21  4:53 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 09:21:33PM -0700, Figo.zhang wrote:
> suppose that in free_one_page() will try to merge to high order anytime ,
> but now in your patch,
> those merge has postponed when system in low memory status, it is very easy
> let system trigger
> low memory state and get poor performance.

Merge or not merge, the size of free memory is not affected.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-21  4:53         ` Aaron Lu
@ 2018-03-21  5:59           ` Figo.zhang
  2018-03-21  7:42             ` Aaron Lu
  0 siblings, 1 reply; 30+ messages in thread
From: Figo.zhang @ 2018-03-21  5:59 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

[-- Attachment #1: Type: text/plain, Size: 536 bytes --]

2018-03-20 21:53 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:

> On Tue, Mar 20, 2018 at 09:21:33PM -0700, Figo.zhang wrote:
> > suppose that in free_one_page() will try to merge to high order anytime ,
> > but now in your patch,
> > those merge has postponed when system in low memory status, it is very
> easy
> > let system trigger
> > low memory state and get poor performance.
>
> Merge or not merge, the size of free memory is not affected.
>

yes, the total free memory is not impact, but will influence the higher
order allocation.

[-- Attachment #2: Type: text/html, Size: 917 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-21  5:59           ` Figo.zhang
@ 2018-03-21  7:42             ` Aaron Lu
  0 siblings, 0 replies; 30+ messages in thread
From: Aaron Lu @ 2018-03-21  7:42 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Linux MM, LKML, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox, Daniel Jordan

On Tue, Mar 20, 2018 at 10:59:16PM -0700, Figo.zhang wrote:
> 2018-03-20 21:53 GMT-07:00 Aaron Lu <aaron.lu@intel.com>:
> 
> > On Tue, Mar 20, 2018 at 09:21:33PM -0700, Figo.zhang wrote:
> > > suppose that in free_one_page() will try to merge to high order anytime ,
> > > but now in your patch,
> > > those merge has postponed when system in low memory status, it is very
> > easy
> > > let system trigger
> > > low memory state and get poor performance.
> >
> > Merge or not merge, the size of free memory is not affected.
> >
> 
> yes, the total free memory is not impact, but will influence the higher
> order allocation.

Yes, that's correct.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20 14:11     ` Aaron Lu
@ 2018-03-21  7:53       ` Vlastimil Babka
  2018-03-22 17:15       ` Matthew Wilcox
  1 sibling, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2018-03-21  7:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

On 03/20/2018 03:11 PM, Aaron Lu wrote:
> On Tue, Mar 20, 2018 at 12:45:50PM +0100, Vlastimil Babka wrote:
>> But why, with all the prefetching in place?
> 
> The prefetch is just for its order 0 buddy, if merge happens, then its
> order 1 buddy will also be checked and on and on, so the cache misses
> are much more in merge mode.

I see.

>> Not thrilled about such disruptive change in the name of a
>> microbenchmark :/ Shouldn't normally the pcplists hide the overhead?
> 
> Sadly, with the default pcp count, it didn't avoid the lock contention.
> We can of course increase pcp->count to a large enough value to avoid
> entering buddy and thus avoid zone->lock contention, but that would
> require admin to manually change the value on a per-machine per-workload
> basis I believe.

Well, anyone who really cares about performance has to invest some time
to tuning anyway, I believe?

>> If not, wouldn't it make more sense to turn zone->lock into a range lock?
> 
> Not familiar with range lock, will need to take a look at it, thanks for
> the pointer.

The suggestion was rather quick and not well thought-out. Range lock
itself is insufficient - for merging/splitting buddies it's ok for
working with struct pages because the candidate buddies are within a
MAX_ORDER range. But the freelists contain pages from the whole zone.

>>
>>> A new document file called "struct_page_filed" is added to explain
>>> the newly reused field in "struct page".
>>
>> Sounds rather ad-hoc for a single field, I'd rather document it via
>> comments.
> 
> Dave would like to have a document to explain all those "struct page"
> fields that are repurposed under different scenarios and this is the
> very start of the document :-)

Oh, I see.

> I probably should have explained the intent of the document more.
> 
> Thanks for taking a look at this.
> 
>>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>> ---
>>>  Documentation/vm/struct_page_field |  5 +++
>>>  include/linux/mm_types.h           |  1 +
>>>  mm/compaction.c                    | 13 +++++-
>>>  mm/internal.h                      | 27 ++++++++++++
>>>  mm/page_alloc.c                    | 89 +++++++++++++++++++++++++++++++++-----
>>>  5 files changed, 122 insertions(+), 13 deletions(-)
>>>  create mode 100644 Documentation/vm/struct_page_field
>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
  2018-03-20 22:29   ` Figo.zhang
@ 2018-03-21 12:55   ` Vlastimil Babka
  2018-03-21 15:01     ` Aaron Lu
  1 sibling, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2018-03-21 12:55 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Mel Gorman, Matthew Wilcox,
	Daniel Jordan

On 03/20/2018 09:54 AM, Aaron Lu wrote:
> Profile on Intel Skylake server shows the most time consuming part
> under zone->lock on allocation path is accessing those to-be-returned
> page's "struct page" on the free_list inside zone->lock. One explanation
> is, different CPUs are releasing pages to the head of free_list and
> those page's 'struct page' may very well be cache cold for the allocating
> CPU when it grabs these pages from free_list' head. The purpose here
> is to avoid touching these pages one by one inside zone->lock.
> 
> One idea is, we just take the requested number of pages off free_list
> with something like list_cut_position() and then adjust nr_free of
> free_area accordingly inside zone->lock and other operations like
> clearing PageBuddy flag for these pages are done outside of zone->lock.
> 
> list_cut_position() needs to know where to cut, that's what the new
> 'struct cluster' meant to provide. All pages on order 0's free_list
> belongs to a cluster so when a number of pages is needed, the cluster
> to which head page of free_list belongs is checked and then tail page
> of the cluster could be found. With tail page, list_cut_position() can
> be used to drop the cluster off free_list. The 'struct cluster' also has
> 'nr' to tell how many pages this cluster has so nr_free of free_area can
> be adjusted inside the lock too.
> 
> This caused a race window though: from the moment zone->lock is dropped
> till these pages' PageBuddy flags get cleared, these pages are not in
> buddy but still have PageBuddy flag set.
> 
> This doesn't cause problems for users that access buddy pages through
> free_list. But there are other users, like move_freepages() which is
> used to move a pageblock pages from one migratetype to another in
> fallback allocation path, will test PageBuddy flag of a page derived
> from PFN. The end result could be that for pages in the race window,
> they are moved back to free_list of another migratetype. For this
> reason, a synchronization function zone_wait_cluster_alloc() is
> introduced to wait till all pages are in correct state. This function
> is meant to be called with zone->lock held, so after this function
> returns, we do not need to worry about new pages becoming racy state.
> 
> Another user is compaction, where it will scan a pageblock for
> migratable candidates. In this process, pages derived from PFN will
> be checked for PageBuddy flag to decide if it is a merge skipped page.
> To avoid a racy page getting merged back into buddy, the
> zone_wait_and_disable_cluster_alloc() function is introduced to:
> 1 disable clustered allocation by increasing zone->cluster.disable_depth;
> 2 wait till the race window pass by calling zone_wait_cluster_alloc().
> This function is also meant to be called with zone->lock held so after
> it returns, all pages are in correct state and no more cluster alloc
> will be attempted till zone_enable_cluster_alloc() is called to decrease
> zone->cluster.disable_depth.

I'm sorry, but I feel the added complexity here is simply too large to
justify the change. Especially if the motivation seems to be just the
microbenchmark. It would be better if this was motivated by a real
workload where zone lock contention was identified as the main issue,
and we would see the improvements on the workload. We could also e.g.
find out that the problem can be avoided at a different level.

Besides complexity, it may also add overhead to the non-contended case,
i.e. the atomic operations on in_progress. This goes against recent page
allocation optimizations by Mel Gorman etc.

Would perhaps prefetching the next page in freelist (in
remove_from_buddy()) help here?

Vlastimil

> The two patches could eliminate zone->lock contention entirely but at
> the same time, pgdat->lru_lock contention rose to 82%. Final performance
> increased about 8.3%.
> 
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  Documentation/vm/struct_page_field |   5 +
>  include/linux/mm_types.h           |   2 +
>  include/linux/mmzone.h             |  35 +++++
>  mm/compaction.c                    |   4 +
>  mm/internal.h                      |  34 +++++
>  mm/page_alloc.c                    | 288 +++++++++++++++++++++++++++++++++++--
>  6 files changed, 360 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_page_field
> index 1ab6c19ccc7a..bab738ea4e0a 100644
> --- a/Documentation/vm/struct_page_field
> +++ b/Documentation/vm/struct_page_field
> @@ -3,3 +3,8 @@ Used to indicate this page skipped merging when added to buddy. This
>  field only makes sense if the page is in Buddy and is order zero.
>  It's a bug if any higher order pages in Buddy has this field set.
>  Shares space with index.
> +
> +cluster:
> +Order 0 Buddy pages are grouped in cluster on free_list to speed up
> +allocation. This field stores the cluster pointer for them.
> +Shares space with mapping.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7edc4e102a8e..49fe9d755a7c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -84,6 +84,8 @@ struct page {
>  		void *s_mem;			/* slab first object */
>  		atomic_t compound_mapcount;	/* first tail page */
>  		/* page_deferred_list().next	 -- second tail page */
> +
> +		struct cluster *cluster;	/* order 0 cluster this page belongs to */
>  	};
>  
>  	/* Second double word */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7522a6987595..09ba9d3cc385 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -355,6 +355,40 @@ enum zone_type {
>  
>  #ifndef __GENERATING_BOUNDS_H
>  
> +struct cluster {
> +	struct page     *tail;  /* tail page of the cluster */
> +	int             nr;     /* how many pages are in this cluster */
> +};
> +
> +struct order0_cluster {
> +	/* order 0 cluster array, dynamically allocated */
> +	struct cluster *array;
> +	/*
> +	 * order 0 cluster array length, also used to indicate if cluster
> +	 * allocation is enabled for this zone(cluster allocation is disabled
> +	 * for small zones whose batch size is smaller than 1, like DMA zone)
> +	 */
> +	int             len;
> +	/*
> +	 * smallest position from where we search for an
> +	 * empty cluster from the cluster array
> +	 */
> +	int		zero_bit;
> +	/* bitmap used to quickly locate an empty cluster from cluster array */
> +	unsigned long   *bitmap;
> +
> +	/* disable cluster allocation to avoid new pages becoming racy state. */
> +	unsigned long	disable_depth;
> +
> +	/*
> +	 * used to indicate if there are pages allocated in cluster mode
> +	 * still in racy state. Caller with zone->lock held could use helper
> +	 * function zone_wait_cluster_alloc() to wait all such pages to exit
> +	 * the race window.
> +	 */
> +	atomic_t        in_progress;
> +};
> +
>  struct zone {
>  	/* Read-mostly fields */
>  
> @@ -459,6 +493,7 @@ struct zone {
>  
>  	/* free areas of different sizes */
>  	struct free_area	free_area[MAX_ORDER];
> +	struct order0_cluster	cluster;
>  
>  	/* zone flags, see below */
>  	unsigned long		flags;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index fb9031fdca41..e71fa82786a1 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  
>  	migrate_prep_local();
>  
> +	zone_wait_and_disable_cluster_alloc(zone);
> +
>  	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
>  		int err;
>  
> @@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  			zone->compact_cached_free_pfn = free_pfn;
>  	}
>  
> +	zone_enable_cluster_alloc(zone);
> +
>  	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
>  	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 2bfbaae2d835..1b0535af1b49 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -557,12 +557,46 @@ static inline bool can_skip_merge(struct zone *zone, int order)
>  	if (order)
>  		return false;
>  
> +	/*
> +	 * Clustered allocation is only disabled when high-order pages
> +	 * are needed, e.g. in compaction and CMA alloc, so we should
> +	 * also skip merging in that case.
> +	 */
> +	if (zone->cluster.disable_depth)
> +		return false;
> +
>  	return true;
>  }
> +
> +static inline void zone_wait_cluster_alloc(struct zone *zone)
> +{
> +	while (atomic_read(&zone->cluster.in_progress))
> +		cpu_relax();
> +}
> +
> +static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&zone->lock, flags);
> +	zone->cluster.disable_depth++;
> +	zone_wait_cluster_alloc(zone);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static inline void zone_enable_cluster_alloc(struct zone *zone)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&zone->lock, flags);
> +	zone->cluster.disable_depth--;
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
>  #else /* CONFIG_COMPACTION */
>  static inline bool can_skip_merge(struct zone *zone, int order)
>  {
>  	return false;
>  }
> +static inline void zone_wait_cluster_alloc(struct zone *zone) {}
> +static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone) {}
> +static inline void zone_enable_cluster_alloc(struct zone *zone) {}
>  #endif  /* CONFIG_COMPACTION */
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eb78014dfbde..ac93833a2877 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -746,6 +746,82 @@ static inline void set_page_order(struct page *page, unsigned int order)
>  	__SetPageBuddy(page);
>  }
>  
> +static inline struct cluster *new_cluster(struct zone *zone, int nr,
> +						struct page *tail)
> +{
> +	struct order0_cluster *cluster = &zone->cluster;
> +	int n = find_next_zero_bit(cluster->bitmap, cluster->len, cluster->zero_bit);
> +	if (n == cluster->len) {
> +		printk_ratelimited("node%d zone %s cluster used up\n",
> +				zone->zone_pgdat->node_id, zone->name);
> +		return NULL;
> +	}
> +	cluster->zero_bit = n;
> +	set_bit(n, cluster->bitmap);
> +	cluster->array[n].nr = nr;
> +	cluster->array[n].tail = tail;
> +	return &cluster->array[n];
> +}
> +
> +static inline struct cluster *add_to_cluster_common(struct page *page,
> +			struct zone *zone, struct page *neighbor)
> +{
> +	struct cluster *c;
> +
> +	if (neighbor) {
> +		int batch = this_cpu_ptr(zone->pageset)->pcp.batch;
> +		c = neighbor->cluster;
> +		if (c && c->nr < batch) {
> +			page->cluster = c;
> +			c->nr++;
> +			return c;
> +		}
> +	}
> +
> +	c = new_cluster(zone, 1, page);
> +	if (unlikely(!c))
> +		return NULL;
> +
> +	page->cluster = c;
> +	return c;
> +}
> +
> +/*
> + * Add this page to the cluster where the previous head page belongs.
> + * Called after page is added to free_list(and becoming the new head).
> + */
> +static inline void add_to_cluster_head(struct page *page, struct zone *zone,
> +				       int order, int mt)
> +{
> +	struct page *neighbor;
> +
> +	if (order || !zone->cluster.len)
> +		return;
> +
> +	neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ?
> +		   NULL : list_entry(page->lru.next, struct page, lru);
> +	add_to_cluster_common(page, zone, neighbor);
> +}
> +
> +/*
> + * Add this page to the cluster where the previous tail page belongs.
> + * Called after page is added to free_list(and becoming the new tail).
> + */
> +static inline void add_to_cluster_tail(struct page *page, struct zone *zone,
> +				       int order, int mt)
> +{
> +	struct page *neighbor;
> +	struct cluster *c;
> +
> +	if (order || !zone->cluster.len)
> +		return;
> +
> +	neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ?
> +		   NULL : list_entry(page->lru.prev, struct page, lru);
> +	c = add_to_cluster_common(page, zone, neighbor);
> +	c->tail = page;
> +}
> +
>  static inline void add_to_buddy_common(struct page *page, struct zone *zone,
>  					unsigned int order, int mt)
>  {
> @@ -765,6 +841,7 @@ static inline void add_to_buddy_head(struct page *page, struct zone *zone,
>  {
>  	add_to_buddy_common(page, zone, order, mt);
>  	list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> +	add_to_cluster_head(page, zone, order, mt);
>  }
>  
>  static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
> @@ -772,6 +849,7 @@ static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
>  {
>  	add_to_buddy_common(page, zone, order, mt);
>  	list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
> +	add_to_cluster_tail(page, zone, order, mt);
>  }
>  
>  static inline void rmv_page_order(struct page *page)
> @@ -780,9 +858,29 @@ static inline void rmv_page_order(struct page *page)
>  	set_page_private(page, 0);
>  }
>  
> +/* called before removed from free_list */
> +static inline void remove_from_cluster(struct page *page, struct zone *zone)
> +{
> +	struct cluster *c = page->cluster;
> +	if (!c)
> +		return;
> +
> +	page->cluster = NULL;
> +	c->nr--;
> +	if (!c->nr) {
> +		int bit = c - zone->cluster.array;
> +		c->tail = NULL;
> +		clear_bit(bit, zone->cluster.bitmap);
> +		if (bit < zone->cluster.zero_bit)
> +			zone->cluster.zero_bit = bit;
> +	} else if (page == c->tail)
> +		c->tail = list_entry(page->lru.prev, struct page, lru);
> +}
> +
>  static inline void remove_from_buddy(struct page *page, struct zone *zone,
>  					unsigned int order)
>  {
> +	remove_from_cluster(page, zone);
>  	list_del(&page->lru);
>  	zone->free_area[order].nr_free--;
>  	rmv_page_order(page);
> @@ -2025,6 +2123,17 @@ static int move_freepages(struct zone *zone,
>  	if (num_movable)
>  		*num_movable = 0;
>  
> +	/*
> +	 * Cluster alloced pages may have their PageBuddy flag unclear yet
> +	 * after dropping zone->lock in rmqueue_bulk() and steal here could
> +	 * move them back to free_list. So it's necessary to wait till all
> +	 * those pages have their flags properly cleared.
> +	 *
> +	 * We do not need to disable cluster alloc though since we already
> +	 * held zone->lock and no allocation could happen.
> +	 */
> +	zone_wait_cluster_alloc(zone);
> +
>  	for (page = start_page; page <= end_page;) {
>  		if (!pfn_valid_within(page_to_pfn(page))) {
>  			page++;
> @@ -2049,8 +2158,10 @@ static int move_freepages(struct zone *zone,
>  		}
>  
>  		order = page_order(page);
> +		remove_from_cluster(page, zone);
>  		list_move(&page->lru,
>  			  &zone->free_area[order].free_list[migratetype]);
> +		add_to_cluster_head(page, zone, order, migratetype);
>  		page += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> @@ -2199,7 +2310,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  
>  single_page:
>  	area = &zone->free_area[current_order];
> +	remove_from_cluster(page, zone);
>  	list_move(&page->lru, &area->free_list[start_type]);
> +	add_to_cluster_head(page, zone, current_order, start_type);
>  }
>  
>  /*
> @@ -2460,6 +2573,145 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
>  	return page;
>  }
>  
> +static int __init zone_order0_cluster_init(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		int len, mt, batch;
> +		unsigned long flags;
> +		struct order0_cluster *cluster;
> +
> +		if (!managed_zone(zone))
> +			continue;
> +
> +		/* no need to enable cluster allocation for batch<=1 zone */
> +		preempt_disable();
> +		batch = this_cpu_ptr(zone->pageset)->pcp.batch;
> +		preempt_enable();
> +		if (batch <= 1)
> +			continue;
> +
> +		cluster = &zone->cluster;
> +		/* FIXME: possible overflow of int type */
> +		len = DIV_ROUND_UP(zone->managed_pages, batch);
> +		cluster->array = vzalloc(len * sizeof(struct cluster));
> +		if (!cluster->array)
> +			return -ENOMEM;
> +		cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) *
> +				sizeof(unsigned long));
> +		if (!cluster->bitmap)
> +			return -ENOMEM;
> +
> +		spin_lock_irqsave(&zone->lock, flags);
> +		cluster->len = len;
> +		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +			struct page *page;
> +			list_for_each_entry_reverse(page,
> +					&zone->free_area[0].free_list[mt], lru)
> +				add_to_cluster_head(page, zone, 0, mt);
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +
> +	return 0;
> +}
> +subsys_initcall(zone_order0_cluster_init);
> +
> +static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long count,
> +						struct list_head *list, int mt)
> +{
> +	struct list_head *head = &zone->free_area[0].free_list[mt];
> +	int nr = 0;
> +
> +	while (nr < count) {
> +		struct page *head_page;
> +		struct list_head *tail, tmp_list;
> +		struct cluster *c;
> +		int bit;
> +
> +		head_page = list_first_entry_or_null(head, struct page, lru);
> +		if (!head_page || !head_page->cluster)
> +			break;
> +
> +		c = head_page->cluster;
> +		tail = &c->tail->lru;
> +
> +		/* drop the cluster off free_list and attach to list */
> +		list_cut_position(&tmp_list, head, tail);
> +		list_splice_tail(&tmp_list, list);
> +
> +		nr += c->nr;
> +		zone->free_area[0].nr_free -= c->nr;
> +
> +		/* this cluster is empty now */
> +		c->tail = NULL;
> +		c->nr = 0;
> +		bit = c - zone->cluster.array;
> +		clear_bit(bit, zone->cluster.bitmap);
> +		if (bit < zone->cluster.zero_bit)
> +			zone->cluster.zero_bit = bit;
> +	}
> +
> +	return nr;
> +}
> +
> +static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order,
> +				unsigned long count, struct list_head *list,
> +				int migratetype)
> +{
> +	int alloced;
> +	struct page *page;
> +
> +	/*
> +	 * Cluster alloc races with merging so don't try cluster alloc when we
> +	 * can't skip merging. Note that can_skip_merge() keeps the same return
> +	 * value from here till all pages have their flags properly processed,
> +	 * i.e. the end of the function where in_progress is incremented, even
> +	 * we have dropped the lock in the middle because the only place that
> +	 * can change can_skip_merge()'s return value is compaction code and
> +	 * compaction needs to wait on in_progress.
> +	 */
> +	if (!can_skip_merge(zone, 0))
> +		return 0;
> +
> +	/* Cluster alloc is disabled, mostly compaction is already in progress */
> +	if (zone->cluster.disable_depth)
> +		return 0;
> +
> +	/* Cluster alloc is disabled for this zone */
> +	if (unlikely(!zone->cluster.len))
> +		return 0;
> +
> +	alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype);
> +	if (!alloced)
> +		return 0;
> +
> +	/*
> +	 * Cache miss on page structure could slow things down
> +	 * dramatically so accessing these alloced pages without
> +	 * holding lock for better performance.
> +	 *
> +	 * Since these pages still have PageBuddy set, there is a race
> +	 * window between now and when PageBuddy is cleared for them
> +	 * below. Any operation that would scan a pageblock and check
> +	 * PageBuddy(page), e.g. compaction, will need to wait till all
> +	 * such pages are properly processed. in_progress is used for
> +	 * such purpose so increase it now before dropping the lock.
> +	 */
> +	atomic_inc(&zone->cluster.in_progress);
> +	spin_unlock(&zone->lock);
> +
> +	list_for_each_entry(page, list, lru) {
> +		rmv_page_order(page);
> +		page->cluster = NULL;
> +		set_pcppage_migratetype(page, migratetype);
> +	}
> +	atomic_dec(&zone->cluster.in_progress);
> +
> +	return alloced;
> +}
> +
>  /*
>   * Obtain a specified number of elements from the buddy allocator, all under
>   * a single hold of the lock, for efficiency.  Add them to the supplied list.
> @@ -2469,17 +2721,23 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype)
>  {
> -	int i, alloced = 0;
> +	int i, alloced;
> +	struct page *page, *tmp;
>  
>  	spin_lock(&zone->lock);
> -	for (i = 0; i < count; ++i) {
> -		struct page *page = __rmqueue(zone, order, migratetype);
> +	alloced = rmqueue_bulk_cluster(zone, order, count, list, migratetype);
> +	if (alloced > 0) {
> +		if (alloced >= count)
> +			goto out;
> +		else
> +			spin_lock(&zone->lock);
> +	}
> +
> +	for (; alloced < count; alloced++) {
> +		page = __rmqueue(zone, order, migratetype);
>  		if (unlikely(page == NULL))
>  			break;
>  
> -		if (unlikely(check_pcp_refill(page)))
> -			continue;
> -
>  		/*
>  		 * Split buddy pages returned by expand() are received here in
>  		 * physical page order. The page is added to the tail of
> @@ -2491,7 +2749,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->lru, list);
> -		alloced++;
> +	}
> +	spin_unlock(&zone->lock);
> +
> +out:
> +	i = alloced;
> +	list_for_each_entry_safe(page, tmp, list, lru) {
> +		if (unlikely(check_pcp_refill(page))) {
> +			list_del(&page->lru);
> +			alloced--;
> +			continue;
> +		}
> +
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
> @@ -2504,7 +2773,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  	 * pages added to the pcp list.
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> -	spin_unlock(&zone->lock);
>  	return alloced;
>  }
>  
> @@ -7744,6 +8012,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	unsigned long outer_start, outer_end;
>  	unsigned int order;
>  	int ret = 0;
> +	struct zone *zone = page_zone(pfn_to_page(start));
>  
>  	struct compact_control cc = {
>  		.nr_migratepages = 0,
> @@ -7786,6 +8055,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	if (ret)
>  		return ret;
>  
> +	zone_wait_and_disable_cluster_alloc(zone);
>  	/*
>  	 * In case of -EBUSY, we'd like to know which page causes problem.
>  	 * So, just fall through. test_pages_isolated() has a tracepoint
> @@ -7868,6 +8138,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  done:
>  	undo_isolate_page_range(pfn_max_align_down(start),
>  				pfn_max_align_up(end), migratetype);
> +
> +	zone_enable_cluster_alloc(zone);
>  	return ret;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-21 12:55   ` Vlastimil Babka
@ 2018-03-21 15:01     ` Aaron Lu
  2018-03-29 19:16       ` Daniel Jordan
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-21 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Mel Gorman,
	Matthew Wilcox, Daniel Jordan

Hi Vlastimil,

Thanks for taking the time to reivew the patch, I appreciate that.

On Wed, Mar 21, 2018 at 01:55:01PM +0100, Vlastimil Babka wrote:
> On 03/20/2018 09:54 AM, Aaron Lu wrote:
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
> > 
> > One idea is, we just take the requested number of pages off free_list
> > with something like list_cut_position() and then adjust nr_free of
> > free_area accordingly inside zone->lock and other operations like
> > clearing PageBuddy flag for these pages are done outside of zone->lock.
> > 
> > list_cut_position() needs to know where to cut, that's what the new
> > 'struct cluster' meant to provide. All pages on order 0's free_list
> > belongs to a cluster so when a number of pages is needed, the cluster
> > to which head page of free_list belongs is checked and then tail page
> > of the cluster could be found. With tail page, list_cut_position() can
> > be used to drop the cluster off free_list. The 'struct cluster' also has
> > 'nr' to tell how many pages this cluster has so nr_free of free_area can
> > be adjusted inside the lock too.
> > 
> > This caused a race window though: from the moment zone->lock is dropped
> > till these pages' PageBuddy flags get cleared, these pages are not in
> > buddy but still have PageBuddy flag set.
> > 
> > This doesn't cause problems for users that access buddy pages through
> > free_list. But there are other users, like move_freepages() which is
> > used to move a pageblock pages from one migratetype to another in
> > fallback allocation path, will test PageBuddy flag of a page derived
> > from PFN. The end result could be that for pages in the race window,
> > they are moved back to free_list of another migratetype. For this
> > reason, a synchronization function zone_wait_cluster_alloc() is
> > introduced to wait till all pages are in correct state. This function
> > is meant to be called with zone->lock held, so after this function
> > returns, we do not need to worry about new pages becoming racy state.
> > 
> > Another user is compaction, where it will scan a pageblock for
> > migratable candidates. In this process, pages derived from PFN will
> > be checked for PageBuddy flag to decide if it is a merge skipped page.
> > To avoid a racy page getting merged back into buddy, the
> > zone_wait_and_disable_cluster_alloc() function is introduced to:
> > 1 disable clustered allocation by increasing zone->cluster.disable_depth;
> > 2 wait till the race window pass by calling zone_wait_cluster_alloc().
> > This function is also meant to be called with zone->lock held so after
> > it returns, all pages are in correct state and no more cluster alloc
> > will be attempted till zone_enable_cluster_alloc() is called to decrease
> > zone->cluster.disable_depth.
> 
> I'm sorry, but I feel the added complexity here is simply too large to
> justify the change. Especially if the motivation seems to be just the
> microbenchmark. It would be better if this was motivated by a real
> workload where zone lock contention was identified as the main issue,
> and we would see the improvements on the workload. We could also e.g.
> find out that the problem can be avoided at a different level.

One thing I'm aware of is there is some app that consumes a ton of
memory and when it misbehaves or crashes, it takes some 10-20 minutes to
have it exit(munmap() takes a long time to free all those consumed
memory).

THP could help a lot, but it's beyond my understanding why they didn't
use it.

> 
> Besides complexity, it may also add overhead to the non-contended case,
> i.e. the atomic operations on in_progress. This goes against recent page
> allocation optimizations by Mel Gorman etc.
> 
> Would perhaps prefetching the next page in freelist (in
> remove_from_buddy()) help here?

I'm afraid there isn't enough a window for prefetch() to actually make
a difference, but I could give it a try.

We also considered prefetching free_list before taking the lock but
that prefetch could be useless(i.e. the prefetched page can be taken by
another CPU and gone after we actually acquired the lock) and iterate
the list outside lock can be dangerous.

> > The two patches could eliminate zone->lock contention entirely but at
> > the same time, pgdat->lru_lock contention rose to 82%. Final performance
> > increased about 8.3%.
> > 
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  Documentation/vm/struct_page_field |   5 +
> >  include/linux/mm_types.h           |   2 +
> >  include/linux/mmzone.h             |  35 +++++
> >  mm/compaction.c                    |   4 +
> >  mm/internal.h                      |  34 +++++
> >  mm/page_alloc.c                    | 288 +++++++++++++++++++++++++++++++++++--
> >  6 files changed, 360 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
                   ` (3 preceding siblings ...)
  2018-03-20  8:54 ` [RFC PATCH v2 4/4] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path Aaron Lu
@ 2018-03-21 17:44 ` Daniel Jordan
  2018-03-22  1:30   ` Aaron Lu
  2018-03-29 19:19 ` Daniel Jordan
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2018-03-21 17:44 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, Vlastimil Babka
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Mel Gorman, Matthew Wilcox

On 03/20/2018 04:54 AM, Aaron Lu wrote:
...snip...
> reduced zone->lock contention on free path from 35% to 1.1%. Also, it
> shows good result on parallel free(*) workload by reducing zone->lock
> contention from 90% to almost zero(lru lock increased from almost 0 to
> 90% though).

Hi Aaron, I'm looking through your series now.  Just wanted to mention that I'm seeing the same interaction between zone->lock and lru_lock in my own testing.  IOW, it's not enough to fix just one or the other: both need attention to get good performance on a big system, at least in this microbenchmark we've both been using.

There's anti-scaling at high core counts where overall system page faults per second actually decrease with more CPUs added to the test.  This happens when either zone->lock or lru_lock contention are completely removed, but the anti-scaling goes away when both locks are fixed.

Anyway, I'll post some actual data on this stuff soon.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-21 17:44 ` [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Daniel Jordan
@ 2018-03-22  1:30   ` Aaron Lu
  2018-03-22 11:20     ` Daniel Jordan
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-22  1:30 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Andrew Morton,
	Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Mel Gorman, Matthew Wilcox

On Wed, Mar 21, 2018 at 01:44:25PM -0400, Daniel Jordan wrote:
> On 03/20/2018 04:54 AM, Aaron Lu wrote:
> ...snip...
> > reduced zone->lock contention on free path from 35% to 1.1%. Also, it
> > shows good result on parallel free(*) workload by reducing zone->lock
> > contention from 90% to almost zero(lru lock increased from almost 0 to
> > 90% though).
> 
> Hi Aaron, I'm looking through your series now.  Just wanted to mention that I'm seeing the same interaction between zone->lock and lru_lock in my own testing.  IOW, it's not enough to fix just one or the other: both need attention to get good performance on a big system, at least in this microbenchmark we've both been using.

Agree.

> 
> There's anti-scaling at high core counts where overall system page faults per second actually decrease with more CPUs added to the test.  This happens when either zone->lock or lru_lock contention are completely removed, but the anti-scaling goes away when both locks are fixed.
> 
> Anyway, I'll post some actual data on this stuff soon.

Looking forward to that, thanks.

In the meantime, I'll also try your lru_lock optimization work on top of
this patchset to see if the lock contention shifts back to zone->lock.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-22  1:30   ` Aaron Lu
@ 2018-03-22 11:20     ` Daniel Jordan
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2018-03-22 11:20 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Andrew Morton,
	Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Mel Gorman, Matthew Wilcox

On 03/21/2018 09:30 PM, Aaron Lu wrote:
> On Wed, Mar 21, 2018 at 01:44:25PM -0400, Daniel Jordan wrote:
>> On 03/20/2018 04:54 AM, Aaron Lu wrote:
>> ...snip...
>>> reduced zone->lock contention on free path from 35% to 1.1%. Also, it
>>> shows good result on parallel free(*) workload by reducing zone->lock
>>> contention from 90% to almost zero(lru lock increased from almost 0 to
>>> 90% though).
>>
>> Hi Aaron, I'm looking through your series now.  Just wanted to mention that I'm seeing the same interaction between zone->lock and lru_lock in my own testing.  IOW, it's not enough to fix just one or the other: both need attention to get good performance on a big system, at least in this microbenchmark we've both been using.
> 
> Agree.
> 
>>
>> There's anti-scaling at high core counts where overall system page faults per second actually decrease with more CPUs added to the test.  This happens when either zone->lock or lru_lock contention are completely removed, but the anti-scaling goes away when both locks are fixed.
>>
>> Anyway, I'll post some actual data on this stuff soon.
> 
> Looking forward to that, thanks.
> 
> In the meantime, I'll also try your lru_lock optimization work on top of
> this patchset to see if the lock contention shifts back to zone->lock.

The lru_lock series I posted is pretty outdated by now, and I've got a 
totally new approach I plan to post soon, so it might make sense to wait 
for that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-20 14:11     ` Aaron Lu
  2018-03-21  7:53       ` Vlastimil Babka
@ 2018-03-22 17:15       ` Matthew Wilcox
  2018-03-22 18:39         ` Daniel Jordan
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2018-03-22 17:15 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
	Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Mel Gorman, Daniel Jordan

On Tue, Mar 20, 2018 at 10:11:01PM +0800, Aaron Lu wrote:
> > > A new document file called "struct_page_filed" is added to explain
> > > the newly reused field in "struct page".
> > 
> > Sounds rather ad-hoc for a single field, I'd rather document it via
> > comments.
> 
> Dave would like to have a document to explain all those "struct page"
> fields that are repurposed under different scenarios and this is the
> very start of the document :-)
> 
> I probably should have explained the intent of the document more.

Dave and I are in agreement on "Shouldn't struct page be better documented".
I came up with this a few weeks ago; never quite got round to turning it
into a patch:

+---+-----------+-----------+--------------+----------+--------+--------------+
| B | slab      | pagecache | tail 1       | anon     | tail 1 | hugetlb      |
+===+===========+===========+==============+==========+========+==============+
| 0 | flags                                                                   |
+---+                                                                         |
| 4 |                                                                         |
+---+-----------+-----------+--------------+----------+--------+--------------+
| 8 | s_mem     | mapping   | cmp_mapcount | anon_vma | defer  | mapping      |
+---+           |           +--------------+          | list   |              |
|12 |           |           |              |          |        |              |
+---+-----------+-----------+--------------+----------+        +--------------+
|16 | freelist  | index                               |        | index        |
+---+           |                                     |        | (shifted)    |
|20 |           |                                     |        |              |
+---+-----------+-------------------------------------+--------+--------------+
|24 | counters  | mapcount                                                    |
+---+           +-----------+--------------+----------+--------+--------------+
|28 |           | refcount  |              |          |        | refcount     |
+---+-----------+-----------+--------------+----------+--------+--------------+
|32 | next      | lru       | cmpd_head    |                   | lru          |
+---+           |           |              +-------------------+              +
|36 |           |           |              |                   |              |
+---+-----------+           +--------------+-------------------+              +
|40 | pages     |           | dtor / order |                   |              |
+---+-----------+           +--------------+-------------------+              +
|44 | pobjects  |           |              |                   |              |
+---+-----------+-----------+--------------+----------------------------------+
|48 | slb_cache | private   |              |                                  |
+---+           |           +--------------+----------------------------------+
|52 |           |           |              |                                  |
+---+-----------+-----------+--------------+----------------------------------+

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-22 17:15       ` Matthew Wilcox
@ 2018-03-22 18:39         ` Daniel Jordan
  2018-03-22 18:50           ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2018-03-22 18:39 UTC (permalink / raw)
  To: Matthew Wilcox, Aaron Lu
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
	Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Mel Gorman



On 03/22/2018 01:15 PM, Matthew Wilcox wrote:
> On Tue, Mar 20, 2018 at 10:11:01PM +0800, Aaron Lu wrote:
>>>> A new document file called "struct_page_filed" is added to explain
>>>> the newly reused field in "struct page".
>>>
>>> Sounds rather ad-hoc for a single field, I'd rather document it via
>>> comments.
>>
>> Dave would like to have a document to explain all those "struct page"
>> fields that are repurposed under different scenarios and this is the
>> very start of the document :-)
>>
>> I probably should have explained the intent of the document more.
> 
> Dave and I are in agreement on "Shouldn't struct page be better documented".
> I came up with this a few weeks ago; never quite got round to turning it
> into a patch:
> 
> +---+-----------+-----------+--------------+----------+--------+--------------+
> | B | slab      | pagecache | tail 1       | anon     | tail 1 | hugetlb      |
> +===+===========+===========+==============+==========+========+==============+
> | 0 | flags                                                                   |
> +---+                                                                         |
> | 4 |                                                                         |
> +---+-----------+-----------+--------------+----------+--------+--------------+
> | 8 | s_mem     | mapping   | cmp_mapcount | anon_vma | defer  | mapping      |
> +---+           |           +--------------+          | list   |              |
> |12 |           |           |              |          |        |              |
> +---+-----------+-----------+--------------+----------+        +--------------+
> |16 | freelist  | index                               |        | index        |
> +---+           |                                     |        | (shifted)    |
> |20 |           |                                     |        |              |
> +---+-----------+-------------------------------------+--------+--------------+
> |24 | counters  | mapcount                                                    |
> +---+           +-----------+--------------+----------+--------+--------------+
> |28 |           | refcount  |              |          |        | refcount     |
> +---+-----------+-----------+--------------+----------+--------+--------------+
> |32 | next      | lru       | cmpd_head    |                   | lru          |
> +---+           |           |              +-------------------+              +
> |36 |           |           |              |                   |              |
> +---+-----------+           +--------------+-------------------+              +
> |40 | pages     |           | dtor / order |                   |              |
> +---+-----------+           +--------------+-------------------+              +
> |44 | pobjects  |           |              |                   |              |
> +---+-----------+-----------+--------------+----------------------------------+
> |48 | slb_cache | private   |              |                                  |
> +---+           |           +--------------+----------------------------------+
> |52 |           |           |              |                                  |
> +---+-----------+-----------+--------------+----------------------------------+

Shouldn't the anon column also contain lru?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed
  2018-03-22 18:39         ` Daniel Jordan
@ 2018-03-22 18:50           ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2018-03-22 18:50 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Aaron Lu, Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
	Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen,
	Michal Hocko, Mel Gorman

On Thu, Mar 22, 2018 at 02:39:17PM -0400, Daniel Jordan wrote:
> Shouldn't the anon column also contain lru?

Probably; I didn't finish investigating everything.  There should probably
also be another column for swap pages.  Maybe some other users use a
significant amount of the struct page ... ?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
  2018-03-21 15:01     ` Aaron Lu
@ 2018-03-29 19:16       ` Daniel Jordan
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2018-03-29 19:16 UTC (permalink / raw)
  To: Aaron Lu, Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Mel Gorman,
	Matthew Wilcox

On 03/21/2018 11:01 AM, Aaron Lu wrote:
>> I'm sorry, but I feel the added complexity here is simply too large to
>> justify the change. Especially if the motivation seems to be just the
>> microbenchmark. It would be better if this was motivated by a real
>> workload where zone lock contention was identified as the main issue,
>> and we would see the improvements on the workload. We could also e.g.
>> find out that the problem can be avoided at a different level.
> 
> One thing I'm aware of is there is some app that consumes a ton of
> memory and when it misbehaves or crashes, it takes some 10-20 minutes to
> have it exit(munmap() takes a long time to free all those consumed
> memory).
> 
> THP could help a lot, but it's beyond my understanding why they didn't
> use it.

One of our apps has the same issue with taking a long time to exit.  The 
time is in the kernel's munmap/exit path.

Also, Vlastimil, to your point about real workloads, I've seen 
zone->lock and lru_lock heavily contended in a decision support 
benchmark.  Setting the pcp list sizes artificially high with 
percpu_pagelist_fraction didn't make it go any faster, but given that 
Aaron and I have seen the contention shift to lru_lock in this case, I'm 
curious what will happen to the benchmark when both locks are no longer 
contended.  Will report back once this experiment is done.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
                   ` (4 preceding siblings ...)
  2018-03-21 17:44 ` [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Daniel Jordan
@ 2018-03-29 19:19 ` Daniel Jordan
  2018-03-30  1:42   ` Aaron Lu
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2018-03-29 19:19 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

On 03/20/2018 04:54 AM, Aaron Lu wrote:
> This series is meant to improve zone->lock scalability for order 0 pages.
> With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
> server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
> Perf profile shows the most time consuming part under zone->lock is the
> cache miss on "struct page", so here I'm trying to avoid those cache
> misses.

I ran page_fault1 comparing 4.16-rc5 to your recent work, these four 
patches plus the three others from your github branch zone_lock_rfc_v2. 
Out of curiosity I also threw in another 4.16-rc5 with the pcp batch 
size adjusted so high (10922 pages) that we always stay in the pcp lists 
and out of buddy completely.  I used your patch[*] in this last kernel.

This was on a 2-socket, 20-core broadwell server.

There were some small regressions a bit outside the noise at low process 
counts (2-5) but I'm not sure they're repeatable.  Anyway, it does 
improve the microbenchmark across the board.

[*] lkml.kernel.org/r/20170919072342.GB7263 () intel ! com


[-- Attachment #2: gnuplot-pgfs-vs-ntask-iter1.png --]
[-- Type: image/png, Size: 87315 bytes --]

[-- Attachment #3: pgfs-vs-ntask-iter1.csv --]
[-- Type: text/csv, Size: 4563 bytes --]

,,586305.0,747,587731.0,1766
4.0,3.4,609505.0,1563,608007.0,1170
8.0,5.9,633145.0,1752,622690.0,1287
,,1131428.0,7397,1022890.0,7334
-1.0,-0.3,1119974.0,2558,1020102.0,5707
3.0,3.3,1165004.0,6232,1056689.0,6411
,,1590413.0,6412,1346900.0,8925
-0.7,1.0,1579816.0,7217,1360376.0,4418
2.3,3.4,1626925.0,5321,1392515.0,8180
,,2064476.0,5035,1656475.0,14714
-1.0,0.9,2043240.0,9070,1672036.0,5342
1.2,3.9,2089959.0,8287,1721614.0,7797
,,2486090.0,11178,1878085.0,15286
-0.1,1.1,2483021.0,15100,1898459.0,9295
1.3,4.0,2517602.0,13717,1952995.0,7481
,,2869756.0,9194,2058398.0,20580
0.4,3.5,2882220.0,20584,2129444.0,9689
2.4,6.0,2937618.0,14126,2182859.0,9650
,,3242589.0,15354,2231977.0,20188
0.9,3.2,3270796.0,15125,2303780.0,6607
2.0,6.7,3306528.0,17683,2381279.0,16507
,,3598209.0,10765,2361819.0,13509
1.2,4.5,3642407.0,14894,2469191.0,16250
2.0,8.1,3671834.0,17501,2552786.0,12112
,,3974345.0,12605,2511565.0,31986
2.3,4.8,4067553.0,11070,2632608.0,8158
2.5,9.8,4073111.0,12464,2758075.0,31433
,,4333026.0,12187,2636914.0,15065
2.4,5.8,4435852.0,21692,2789949.0,16400
3.2,10.3,4470666.0,23663,2907263.0,15052
,,4932423.0,12184,2675769.0,23925
2.7,3.6,5064666.0,18600,2771476.0,22438
3.6,7.9,5110434.0,21460,2888419.0,18181
,,5461255.0,14704,2631232.0,24390
1.7,2.5,5554957.0,20979,2697554.0,19370
3.1,5.8,5629143.0,22781,2782902.0,20347
,,5924367.0,11835,2445607.0,25821
1.4,4.9,6004723.0,19071,2566547.0,26031
2.9,8.4,6094087.0,17676,2651793.0,20051
,,6381611.0,16792,2277611.0,39558
1.2,4.4,6459693.0,18869,2377795.0,25094
2.4,12.4,6534837.0,24991,2560085.0,12638
,,6804737.0,13654,2232121.0,19409
1.1,4.8,6881730.0,18995,2338868.0,18923
2.4,12.5,6970318.0,28677,2510594.0,25891
,,7197145.0,17500,2313168.0,16694
1.2,4.5,7287072.0,28727,2418232.0,23120
2.6,7.6,7383613.0,17992,2489382.0,28385
,,7550498.0,15101,2226427.0,24769
1.6,4.4,7667641.0,24306,2324675.0,16265
2.8,5.3,7761917.0,29855,2345195.0,20660
,,7902794.0,12579,2188399.0,37454
1.7,6.0,8033876.0,21158,2320641.0,13053
2.8,8.9,8126732.0,27620,2383083.0,25173
,,8277506.0,15448,2198021.0,33075
2.1,6.4,8453255.0,17411,2339395.0,20221
3.0,7.7,8529130.0,29853,2366800.0,20139
,,8651034.0,19706,2239626.0,25694
2.2,3.2,8840988.0,24387,2311966.0,25694
3.1,9.3,8918721.0,34762,2448589.0,26228
,,8777023.0,11833,2259622.0,32155
2.5,5.7,8993348.0,25481,2389000.0,29464
3.5,6.3,9085319.0,27791,2401713.0,28706
,,8855202.0,27455,2268030.0,35914
3.0,4.1,9123705.0,34289,2361876.0,31917
4.2,11.9,9228843.0,33484,2536867.0,26001
,,8952897.0,21601,2280539.0,30530
3.5,10.1,9268365.0,30804,2510883.0,26312
4.6,10.3,9367048.0,30681,2514740.0,32897
,,9036582.0,19483,2374728.0,42892
3.7,2.2,9369541.0,33253,2425993.0,45953
5.0,4.5,9489640.0,28066,2482179.0,33821
,,9136041.0,18233,2336090.0,30037
4.0,6.1,9497501.0,34409,2478602.0,31960
4.7,10.5,9563832.0,34507,2581056.0,38516
,,9226630.0,17998,2326070.0,33782
3.7,7.8,9570547.0,27052,2508396.0,41848
4.4,10.4,9634192.0,40218,2566842.0,31116
,,9305784.0,24574,2391261.0,29548
3.8,6.1,9656252.0,39164,2536624.0,45738
4.5,5.2,9720210.0,31172,2516447.0,32348
,,9381004.0,19378,2442774.0,35745
2.6,1.3,9626125.0,65187,2474560.0,29978
4.1,4.4,9766045.0,54298,2549227.0,31200
,,9401844.0,27746,2456372.0,40550
2.7,3.8,9652161.0,51629,2549004.0,39991
3.5,6.9,9731681.0,51852,2625589.0,27822
,,9428320.0,17562,2509119.0,39752
2.1,6.1,9630472.0,50106,2662447.0,44347
3.1,5.7,9722152.0,50349,2651891.0,28519
,,9561062.0,21910,2392883.0,24181
2.0,7.1,9755774.0,60382,2563573.0,32132
3.5,15.0,9894735.0,45967,2752506.0,28517
,,9624859.0,30462,2480667.0,27055
2.7,5.5,9883943.0,61851,2618326.0,36656
4.5,15.4,10057320.0,46352,2863788.0,34022
,,9739896.0,35436,2476666.0,30301
3.1,8.2,10043706.0,60944,2680570.0,42346
4.6,16.8,10191082.0,51348,2893385.0,32717
,,9833955.0,39366,2628480.0,36567
3.5,2.7,10180871.0,50050,2699941.0,42805
5.0,7.3,10323136.0,50768,2820628.0,30552
,,9908832.0,20826,2666415.0,51379
3.5,0.5,10251385.0,58551,2679925.0,49144
5.1,5.7,10418155.0,51726,2817192.0,34043
,,9969311.0,20378,2563399.0,36720
3.5,4.8,10314449.0,60867,2686176.0,42926
5.4,9.1,10504881.0,53101,2796816.0,37461
,,10077169.0,36182,2584728.0,32672
3.1,7.4,10393453.0,63048,2775523.0,39745
4.7,11.9,10549870.0,45281,2893000.0,39102
,,10115997.0,25835,2653036.0,33259
2.7,5.7,10388901.0,63402,2803290.0,36021
4.6,11.2,10580796.0,63517,2949834.0,31422
,,10162757.0,33119,2681195.0,30592
2.5,3.0,10413010.0,76720,2761752.0,32472
4.0,9.5,10568061.0,65614,2935127.0,38463
,,10223472.0,41882,2670421.0,26049
2.4,5.0,10470977.0,58009,2803478.0,37111
4.1,7.4,10646450.0,54810,2868986.0,52724


[-- Attachment #4: pgfs-vs-ntask-iter1.txt --]
[-- Type: text/plain, Size: 10165 bytes --]

     kernel (#)  ntask     proc      thr        proc    stdev         thr    stdev
                        speedup  speedup       pgf/s                pgf/s         
   4.16-rc5 (1)      1                       586,305      747     587,731    1,766
    lu-zone (2)      1     4.0%     3.4%     609,505    1,562     608,007    1,169
4.16-rc5-nz (3)      1     8.0%     5.9%     633,145    1,752     622,690    1,286

   4.16-rc5 (1)      2                     1,131,428    7,396   1,022,890    7,333
    lu-zone (2)      2    -1.0%    -0.3%   1,119,974    2,557   1,020,102    5,706
4.16-rc5-nz (3)      2     3.0%     3.3%   1,165,004    6,232   1,056,689    6,411

   4.16-rc5 (1)      3                     1,590,413    6,411   1,346,900    8,924
    lu-zone (2)      3    -0.7%     1.0%   1,579,816    7,216   1,360,376    4,418
4.16-rc5-nz (3)      3     2.3%     3.4%   1,626,925    5,321   1,392,515    8,180

   4.16-rc5 (1)      4                     2,064,476    5,034   1,656,475   14,713
    lu-zone (2)      4    -1.0%     0.9%   2,043,240    9,069   1,672,036    5,342
4.16-rc5-nz (3)      4     1.2%     3.9%   2,089,959    8,287   1,721,614    7,796

   4.16-rc5 (1)      5                     2,486,090   11,178   1,878,085   15,286
    lu-zone (2)      5    -0.1%     1.1%   2,483,021   15,100   1,898,459    9,295
4.16-rc5-nz (3)      5     1.3%     4.0%   2,517,602   13,717   1,952,995    7,481

   4.16-rc5 (1)      6                     2,869,756    9,194   2,058,398   20,580
    lu-zone (2)      6     0.4%     3.5%   2,882,220   20,583   2,129,444    9,689
4.16-rc5-nz (3)      6     2.4%     6.0%   2,937,618   14,126   2,182,859    9,650

   4.16-rc5 (1)      7                     3,242,589   15,354   2,231,977   20,188
    lu-zone (2)      7     0.9%     3.2%   3,270,796   15,124   2,303,780    6,607
4.16-rc5-nz (3)      7     2.0%     6.7%   3,306,528   17,683   2,381,279   16,507

   4.16-rc5 (1)      8                     3,598,209   10,764   2,361,819   13,508
    lu-zone (2)      8     1.2%     4.5%   3,642,407   14,893   2,469,191   16,250
4.16-rc5-nz (3)      8     2.0%     8.1%   3,671,834   17,501   2,552,786   12,112

   4.16-rc5 (1)      9                     3,974,345   12,605   2,511,565   31,986
    lu-zone (2)      9     2.3%     4.8%   4,067,553   11,069   2,632,608    8,158
4.16-rc5-nz (3)      9     2.5%     9.8%   4,073,111   12,463   2,758,075   31,432

   4.16-rc5 (1)     10                     4,333,026   12,187   2,636,914   15,064
    lu-zone (2)     10     2.4%     5.8%   4,435,852   21,691   2,789,949   16,399
4.16-rc5-nz (3)     10     3.2%    10.3%   4,470,666   23,663   2,907,263   15,052

   4.16-rc5 (1)     11                     4,932,423   12,183   2,675,769   23,924
    lu-zone (2)     11     2.7%     3.6%   5,064,666   18,600   2,771,476   22,438
4.16-rc5-nz (3)     11     3.6%     7.9%   5,110,434   21,459   2,888,419   18,180

   4.16-rc5 (1)     12                     5,461,255   14,704   2,631,232   24,390
    lu-zone (2)     12     1.7%     2.5%   5,554,957   20,978   2,697,554   19,369
4.16-rc5-nz (3)     12     3.1%     5.8%   5,629,143   22,781   2,782,902   20,346

   4.16-rc5 (1)     13                     5,924,367   11,835   2,445,607   25,821
    lu-zone (2)     13     1.4%     4.9%   6,004,723   19,070   2,566,547   26,031
4.16-rc5-nz (3)     13     2.9%     8.4%   6,094,087   17,676   2,651,793   20,050

   4.16-rc5 (1)     14                     6,381,611   16,791   2,277,611   39,557
    lu-zone (2)     14     1.2%     4.4%   6,459,693   18,869   2,377,795   25,093
4.16-rc5-nz (3)     14     2.4%    12.4%   6,534,837   24,990   2,560,085   12,638

   4.16-rc5 (1)     15                     6,804,737   13,653   2,232,121   19,408
    lu-zone (2)     15     1.1%     4.8%   6,881,730   18,995   2,338,868   18,922
4.16-rc5-nz (3)     15     2.4%    12.5%   6,970,318   28,677   2,510,594   25,890

   4.16-rc5 (1)     16                     7,197,145   17,499   2,313,168   16,694
    lu-zone (2)     16     1.2%     4.5%   7,287,072   28,727   2,418,232   23,120
4.16-rc5-nz (3)     16     2.6%     7.6%   7,383,613   17,991   2,489,382   28,385

   4.16-rc5 (1)     17                     7,550,498   15,101   2,226,427   24,768
    lu-zone (2)     17     1.6%     4.4%   7,667,641   24,305   2,324,675   16,265
4.16-rc5-nz (3)     17     2.8%     5.3%   7,761,917   29,854   2,345,195   20,659

   4.16-rc5 (1)     18                     7,902,794   12,578   2,188,399   37,453
    lu-zone (2)     18     1.7%     6.0%   8,033,876   21,158   2,320,641   13,053
4.16-rc5-nz (3)     18     2.8%     8.9%   8,126,732   27,619   2,383,083   25,172

   4.16-rc5 (1)     19                     8,277,506   15,448   2,198,021   33,074
    lu-zone (2)     19     2.1%     6.4%   8,453,255   17,411   2,339,395   20,220
4.16-rc5-nz (3)     19     3.0%     7.7%   8,529,130   29,852   2,366,800   20,139

   4.16-rc5 (1)     20                     8,651,034   19,705   2,239,626   25,694
    lu-zone (2)     20     2.2%     3.2%   8,840,988   24,387   2,311,966   25,693
4.16-rc5-nz (3)     20     3.1%     9.3%   8,918,721   34,761   2,448,589   26,227

   4.16-rc5 (1)     21                     8,777,023   11,833   2,259,622   32,154
    lu-zone (2)     21     2.5%     5.7%   8,993,348   25,480   2,389,000   29,464
4.16-rc5-nz (3)     21     3.5%     6.3%   9,085,319   27,790   2,401,713   28,706

   4.16-rc5 (1)     22                     8,855,202   27,455   2,268,030   35,914
    lu-zone (2)     22     3.0%     4.1%   9,123,705   34,288   2,361,876   31,917
4.16-rc5-nz (3)     22     4.2%    11.9%   9,228,843   33,483   2,536,867   26,000

   4.16-rc5 (1)     23                     8,952,897   21,601   2,280,539   30,530
    lu-zone (2)     23     3.5%    10.1%   9,268,365   30,803   2,510,883   26,312
4.16-rc5-nz (3)     23     4.6%    10.3%   9,367,048   30,681   2,514,740   32,896

   4.16-rc5 (1)     24                     9,036,582   19,482   2,374,728   42,891
    lu-zone (2)     24     3.7%     2.2%   9,369,541   33,253   2,425,993   45,952
4.16-rc5-nz (3)     24     5.0%     4.5%   9,489,640   28,066   2,482,179   33,820

   4.16-rc5 (1)     25                     9,136,041   18,232   2,336,090   30,036
    lu-zone (2)     25     4.0%     6.1%   9,497,501   34,408   2,478,602   31,959
4.16-rc5-nz (3)     25     4.7%    10.5%   9,563,832   34,506   2,581,056   38,516

   4.16-rc5 (1)     26                     9,226,630   17,998   2,326,070   33,782
    lu-zone (2)     26     3.7%     7.8%   9,570,547   27,052   2,508,396   41,848
4.16-rc5-nz (3)     26     4.4%    10.4%   9,634,192   40,217   2,566,842   31,115

   4.16-rc5 (1)     27                     9,305,784   24,573   2,391,261   29,547
    lu-zone (2)     27     3.8%     6.1%   9,656,252   39,164   2,536,624   45,738
4.16-rc5-nz (3)     27     4.5%     5.2%   9,720,210   31,171   2,516,447   32,347

   4.16-rc5 (1)     28                     9,381,004   19,377   2,442,774   35,745
    lu-zone (2)     28     2.6%     1.3%   9,626,125   65,187   2,474,560   29,977
4.16-rc5-nz (3)     28     4.1%     4.4%   9,766,045   54,298   2,549,227   31,199

   4.16-rc5 (1)     29                     9,401,844   27,746   2,456,372   40,549
    lu-zone (2)     29     2.7%     3.8%   9,652,161   51,629   2,549,004   39,990
4.16-rc5-nz (3)     29     3.5%     6.9%   9,731,681   51,852   2,625,589   27,821

   4.16-rc5 (1)     30                     9,428,320   17,561   2,509,119   39,752
    lu-zone (2)     30     2.1%     6.1%   9,630,472   50,106   2,662,447   44,347
4.16-rc5-nz (3)     30     3.1%     5.7%   9,722,152   50,348   2,651,891   28,518

   4.16-rc5 (1)     31                     9,561,062   21,909   2,392,883   24,180
    lu-zone (2)     31     2.0%     7.1%   9,755,774   60,381   2,563,573   32,132
4.16-rc5-nz (3)     31     3.5%    15.0%   9,894,735   45,966   2,752,506   28,516

   4.16-rc5 (1)     32                     9,624,859   30,462   2,480,667   27,055
    lu-zone (2)     32     2.7%     5.5%   9,883,943   61,850   2,618,326   36,655
4.16-rc5-nz (3)     32     4.5%    15.4%  10,057,320   46,352   2,863,788   34,021

   4.16-rc5 (1)     33                     9,739,896   35,435   2,476,666   30,301
    lu-zone (2)     33     3.1%     8.2%  10,043,706   60,943   2,680,570   42,346
4.16-rc5-nz (3)     33     4.6%    16.8%  10,191,082   51,348   2,893,385   32,717

   4.16-rc5 (1)     34                     9,833,955   39,366   2,628,480   36,567
    lu-zone (2)     34     3.5%     2.7%  10,180,871   50,050   2,699,941   42,804
4.16-rc5-nz (3)     34     5.0%     7.3%  10,323,136   50,767   2,820,628   30,551

   4.16-rc5 (1)     35                     9,908,832   20,826   2,666,415   51,379
    lu-zone (2)     35     3.5%     0.5%  10,251,385   58,551   2,679,925   49,143
4.16-rc5-nz (3)     35     5.1%     5.7%  10,418,155   51,726   2,817,192   34,042

   4.16-rc5 (1)     36                     9,969,311   20,377   2,563,399   36,720
    lu-zone (2)     36     3.5%     4.8%  10,314,449   60,867   2,686,176   42,925
4.16-rc5-nz (3)     36     5.4%     9.1%  10,504,881   53,100   2,796,816   37,461

   4.16-rc5 (1)     37                    10,077,169   36,182   2,584,728   32,672
    lu-zone (2)     37     3.1%     7.4%  10,393,453   63,048   2,775,523   39,745
4.16-rc5-nz (3)     37     4.7%    11.9%  10,549,870   45,280   2,893,000   39,102

   4.16-rc5 (1)     38                    10,115,997   25,835   2,653,036   33,259
    lu-zone (2)     38     2.7%     5.7%  10,388,901   63,402   2,803,290   36,020
4.16-rc5-nz (3)     38     4.6%    11.2%  10,580,796   63,516   2,949,834   31,421

   4.16-rc5 (1)     39                    10,162,757   33,118   2,681,195   30,591
    lu-zone (2)     39     2.5%     3.0%  10,413,010   76,719   2,761,752   32,471
4.16-rc5-nz (3)     39     4.0%     9.5%  10,568,061   65,614   2,935,127   38,463

   4.16-rc5 (1)     40                    10,223,472   41,882   2,670,421   26,049
    lu-zone (2)     40     2.4%     5.0%  10,470,977   58,008   2,803,478   37,111
4.16-rc5-nz (3)     40     4.1%     7.4%  10,646,450   54,810   2,868,986   52,724

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-29 19:19 ` Daniel Jordan
@ 2018-03-30  1:42   ` Aaron Lu
  2018-03-30 14:27     ` Daniel Jordan
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Lu @ 2018-03-30  1:42 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox

On Thu, Mar 29, 2018 at 03:19:46PM -0400, Daniel Jordan wrote:
> On 03/20/2018 04:54 AM, Aaron Lu wrote:
> > This series is meant to improve zone->lock scalability for order 0 pages.
> > With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
> > server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
> > Perf profile shows the most time consuming part under zone->lock is the
> > cache miss on "struct page", so here I'm trying to avoid those cache
> > misses.
> 
> I ran page_fault1 comparing 4.16-rc5 to your recent work, these four patches
> plus the three others from your github branch zone_lock_rfc_v2. Out of
> curiosity I also threw in another 4.16-rc5 with the pcp batch size adjusted
> so high (10922 pages) that we always stay in the pcp lists and out of buddy
> completely.  I used your patch[*] in this last kernel.
> 
> This was on a 2-socket, 20-core broadwell server.
> 
> There were some small regressions a bit outside the noise at low process
> counts (2-5) but I'm not sure they're repeatable.  Anyway, it does improve
> the microbenchmark across the board.

Thanks for the result.

The limited improvement is expected since lock contention only shifts,
not entirely gone. So what is interesting to see is how it performs with
v4.16-rc5 + my_zone_lock_patchset + your_lru_lock_patchset

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free
  2018-03-30  1:42   ` Aaron Lu
@ 2018-03-30 14:27     ` Daniel Jordan
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2018-03-30 14:27 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Matthew Wilcox



On 03/29/2018 09:42 PM, Aaron Lu wrote:
> On Thu, Mar 29, 2018 at 03:19:46PM -0400, Daniel Jordan wrote:
>> On 03/20/2018 04:54 AM, Aaron Lu wrote:
>>> This series is meant to improve zone->lock scalability for order 0 pages.
>>> With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
>>> server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
>>> Perf profile shows the most time consuming part under zone->lock is the
>>> cache miss on "struct page", so here I'm trying to avoid those cache
>>> misses.
>>
>> I ran page_fault1 comparing 4.16-rc5 to your recent work, these four patches
>> plus the three others from your github branch zone_lock_rfc_v2. Out of
>> curiosity I also threw in another 4.16-rc5 with the pcp batch size adjusted
>> so high (10922 pages) that we always stay in the pcp lists and out of buddy
>> completely.  I used your patch[*] in this last kernel.
>>
>> This was on a 2-socket, 20-core broadwell server.
>>
>> There were some small regressions a bit outside the noise at low process
>> counts (2-5) but I'm not sure they're repeatable.  Anyway, it does improve
>> the microbenchmark across the board.
> 
> Thanks for the result.
> 
> The limited improvement is expected since lock contention only shifts,
> not entirely gone. So what is interesting to see is how it performs with
> v4.16-rc5 + my_zone_lock_patchset + your_lru_lock_patchset

Yep, that's 'coming soon.'

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-03-30 14:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
2018-03-20 11:35   ` Vlastimil Babka
2018-03-20 13:50     ` Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
2018-03-20 11:45   ` Vlastimil Babka
2018-03-20 14:11     ` Aaron Lu
2018-03-21  7:53       ` Vlastimil Babka
2018-03-22 17:15       ` Matthew Wilcox
2018-03-22 18:39         ` Daniel Jordan
2018-03-22 18:50           ` Matthew Wilcox
2018-03-20 22:58   ` Figo.zhang
2018-03-21  1:59     ` Aaron Lu
2018-03-21  4:21       ` Figo.zhang
2018-03-21  4:53         ` Aaron Lu
2018-03-21  5:59           ` Figo.zhang
2018-03-21  7:42             ` Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
2018-03-20 22:29   ` Figo.zhang
2018-03-21  1:52     ` Aaron Lu
2018-03-21 12:55   ` Vlastimil Babka
2018-03-21 15:01     ` Aaron Lu
2018-03-29 19:16       ` Daniel Jordan
2018-03-20  8:54 ` [RFC PATCH v2 4/4] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path Aaron Lu
2018-03-21 17:44 ` [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Daniel Jordan
2018-03-22  1:30   ` Aaron Lu
2018-03-22 11:20     ` Daniel Jordan
2018-03-29 19:19 ` Daniel Jordan
2018-03-30  1:42   ` Aaron Lu
2018-03-30 14:27     ` Daniel Jordan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.