All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] High-order per-cpu cache v5
@ 2016-12-02  0:22 ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

The following is two patches that implement a per-cpu cache for high-order
allocations, primarily aimed at SLUB. The first patch is a bug fix that
is technically unrelated but was discovered by review and so batched
together. The second is the patch that implements the cache.

 include/linux/mmzone.h |  20 +++++++-
 mm/page_alloc.c        | 122 +++++++++++++++++++++++++++++++------------------
 2 files changed, 96 insertions(+), 46 deletions(-)

-- 
2.10.2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] High-order per-cpu cache v5
@ 2016-12-02  0:22 ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

The following is two patches that implement a per-cpu cache for high-order
allocations, primarily aimed at SLUB. The first patch is a bug fix that
is technically unrelated but was discovered by review and so batched
together. The second is the patch that implements the cache.

 include/linux/mmzone.h |  20 +++++++-
 mm/page_alloc.c        | 122 +++++++++++++++++++++++++++++++------------------
 2 files changed, 96 insertions(+), 46 deletions(-)

-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  0:22 ` Mel Gorman
@ 2016-12-02  0:22   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents
if a struct page is corrupted. This patch keeps the accounting in sync.

Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Signed-off-by: Mel Gorman <mgorman@suse.de>
cc: stable@vger.kernel.org [4.7+]
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440e3ae2..777ed59570df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, bool cold)
 {
-	int i;
+	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		else
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
+		alloced++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
-	return i;
+	return alloced;
 }
 
 #ifdef CONFIG_NUMA
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  0:22   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents
if a struct page is corrupted. This patch keeps the accounting in sync.

Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Signed-off-by: Mel Gorman <mgorman@suse.de>
cc: stable@vger.kernel.org [4.7+]
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440e3ae2..777ed59570df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, bool cold)
 {
-	int i;
+	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		else
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
+		alloced++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
-	return i;
+	return alloced;
 }
 
 #ifdef CONFIG_NUMA
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  0:22 ` Mel Gorman
@ 2016-12-02  0:22   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted

Changelog since v3
o Allow high-order atomic allocations to use reserves

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v5
Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)

1-socket 6 year old machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v4
Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v5
Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)

sockperf-udp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v5
Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)

Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.

This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.

hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted

1-socket machine
hackbench-process-pipes
                        4.9.0-rc5             4.9.0-rc5
                          vanilla           highmark-v5
Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)

hackbench-process-sockets
Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)

It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.

fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.

While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mmzone.h |  20 ++++++++-
 mm/page_alloc.c        | 117 +++++++++++++++++++++++++++++++------------------
 2 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..54032ab2f4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+	return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+	return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
 #define min_wmark_pages(z) (z->watermark[WMARK_MIN])
 #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
 #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -265,7 +283,7 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
-	struct list_head lists[MIGRATE_PCPTYPES];
+	struct list_head lists[NR_PCP_LISTS];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 777ed59570df..5f1faaf0a5e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 }
 
 #ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, true);
+	return free_pages_prepare(page, order, true);
 }
 
 static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
 	return false;
 }
 #else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, false);
+	return free_pages_prepare(page, order, false);
 }
 
 static bool bulkfree_pcp_prepare(struct page *page)
@@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
 {
-	int migratetype = 0;
-	int batch_free = 0;
+	unsigned int pindex = UINT_MAX;	/* Reclaim will start at 0 */
+	unsigned int batch_free = 0;
+	unsigned int nr_freed = 0;
 	unsigned long nr_scanned;
 	bool isolated_pageblocks;
 
@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
-	while (count) {
+	while (count > 0) {
 		struct page *page;
 		struct list_head *list;
+		unsigned int order;
 
 		/*
 		 * Remove pages from lists in a round-robin fashion. A
 		 * batch_free count is maintained that is incremented when an
-		 * empty list is encountered.  This is so more pages are freed
-		 * off fuller lists instead of spinning excessively around empty
-		 * lists
+		 * empty list is encountered. This is not exact due to
+		 * high-order but percision is not required.
 		 */
 		do {
 			batch_free++;
-			if (++migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
+			if (++pindex == NR_PCP_LISTS)
+				pindex = 0;
+			list = &pcp->lists[pindex];
 		} while (list_empty(list));
 
 		/* This is the only non-empty list. Free them all. */
-		if (batch_free == MIGRATE_PCPTYPES)
+		if (batch_free == NR_PCP_LISTS)
 			batch_free = count;
 
+		order = pindex_to_order(pindex);
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
 
@@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (unlikely(isolated_pageblocks))
 				mt = get_pageblock_migratetype(page);
 
+			nr_freed += (1 << order);
+			count -= (1 << order);
 			if (bulkfree_pcp_prepare(page))
 				continue;
 
-			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
-		} while (--count && --batch_free && !list_empty(list));
+			__free_one_page(page, page_to_pfn(page), zone, order, mt);
+			trace_mm_page_pcpu_drain(page, order, mt);
+		} while (count > 0 && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+	pcp->count -= nr_freed;
 }
 
 static void free_one_page(struct zone *zone,
@@ -2244,10 +2249,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	local_irq_save(flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0) {
+	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-		pcp->count -= to_drain;
-	}
 	local_irq_restore(flags);
 }
 #endif
@@ -2269,10 +2272,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	pset = per_cpu_ptr(zone->pageset, cpu);
 
 	pcp = &pset->pcp;
-	if (pcp->count) {
+	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
-		pcp->count = 0;
-	}
 	local_irq_restore(flags);
 }
 
@@ -2404,18 +2405,18 @@ void mark_free_pages(struct zone *zone)
 #endif /* CONFIG_PM */
 
 /*
- * Free a 0-order page
+ * Free a pcp page
  * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
-	int migratetype;
+	int migratetype, pindex;
 
-	if (!free_pcp_prepare(page))
+	if (!free_pcp_prepare(page, order))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2432,28 +2433,33 @@ void free_hot_cold_page(struct page *page, bool cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, order, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pindex = order_to_pindex(migratetype, order);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (!cold)
-		list_add(&page->lru, &pcp->lists[migratetype]);
+		list_add(&page->lru, &pcp->lists[pindex]);
 	else
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	pcp->count++;
+		list_add_tail(&page->lru, &pcp->lists[pindex]);
+	pcp->count += 1 << order;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = READ_ONCE(pcp->batch);
 		free_pcppages_bulk(zone, batch, pcp);
-		pcp->count -= batch;
 	}
 
 out:
 	local_irq_restore(flags);
 }
 
+void free_hot_cold_page(struct page *page, bool cold)
+{
+	__free_hot_cold_page(page, cold, 0);
+}
+
 /*
  * Free a list of 0-order pages
  */
@@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	struct page *page;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
-	if (likely(order == 0)) {
+	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
 		local_irq_save(flags);
 		do {
+			unsigned int pindex;
+
+			pindex = order_to_pindex(migratetype, order);
 			pcp = &this_cpu_ptr(zone->pageset)->pcp;
-			list = &pcp->lists[migratetype];
+			list = &pcp->lists[pindex];
 			if (list_empty(list)) {
-				pcp->count += rmqueue_bulk(zone, 0,
+				int nr_pages = rmqueue_bulk(zone, order,
 						pcp->batch, list,
 						migratetype, cold);
-				if (unlikely(list_empty(list)))
+				if (unlikely(list_empty(list))) {
+					/*
+					 * Retry high-order atomic allocs
+					 * from the buddy list which may
+					 * use MIGRATE_HIGHATOMIC.
+					 */
+					if (order && (alloc_flags & ALLOC_HARDER))
+						goto try_buddylist;
+
 					goto failed;
+				}
+				pcp->count += (nr_pages << order);
 			}
 
 			if (cold)
@@ -2611,10 +2630,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 				page = list_first_entry(list, struct page, lru);
 
 			list_del(&page->lru);
-			pcp->count--;
+			pcp->count -= (1 << order);
 
 		} while (check_new_pcp(page));
 	} else {
+try_buddylist:
 		/*
 		 * We most definitely don't want callers attempting to
 		 * allocate greater than order-1 page units with __GFP_NOFAIL.
@@ -3838,8 +3858,8 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
-			free_hot_cold_page(page, false);
+		if (order <= PAGE_ALLOC_COSTLY_ORDER)
+			__free_hot_cold_page(page, false, order);
 		else
 			__free_pages_ok(page, order);
 	}
@@ -5161,20 +5181,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 /* a companion to pageset_set_high() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
-	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+	unsigned long high;
+
+	/*
+	 * per-cpu refills occur when a per-cpu list for a migratetype
+	 * or a high-order is depleted even if pages are free overall.
+	 * Tune the high watermark such that it's unlikely, but not
+	 * impossible, that a single refill event will trigger a
+	 * shrink on the next free to the per-cpu list.
+	 */
+	high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+	pageset_update(&p->pcp, high, max(1UL, 1 * batch));
 }
 
 static void pageset_init(struct per_cpu_pageset *p)
 {
 	struct per_cpu_pages *pcp;
-	int migratetype;
+	unsigned int pindex;
 
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp;
 	pcp->count = 0;
-	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
-		INIT_LIST_HEAD(&pcp->lists[migratetype]);
+	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+		INIT_LIST_HEAD(&pcp->lists[pindex]);
 }
 
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-02  0:22   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel,
	Mel Gorman

Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted

Changelog since v3
o Allow high-order atomic allocations to use reserves

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v5
Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)

1-socket 6 year old machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v4
Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v5
Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)

sockperf-udp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v5
Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)

Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.

This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.

hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted

1-socket machine
hackbench-process-pipes
                        4.9.0-rc5             4.9.0-rc5
                          vanilla           highmark-v5
Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)

hackbench-process-sockets
Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)

It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.

fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.

While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mmzone.h |  20 ++++++++-
 mm/page_alloc.c        | 117 +++++++++++++++++++++++++++++++------------------
 2 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..54032ab2f4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+	return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+	return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
 #define min_wmark_pages(z) (z->watermark[WMARK_MIN])
 #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
 #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -265,7 +283,7 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
-	struct list_head lists[MIGRATE_PCPTYPES];
+	struct list_head lists[NR_PCP_LISTS];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 777ed59570df..5f1faaf0a5e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 }
 
 #ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, true);
+	return free_pages_prepare(page, order, true);
 }
 
 static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
 	return false;
 }
 #else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, false);
+	return free_pages_prepare(page, order, false);
 }
 
 static bool bulkfree_pcp_prepare(struct page *page)
@@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
 {
-	int migratetype = 0;
-	int batch_free = 0;
+	unsigned int pindex = UINT_MAX;	/* Reclaim will start at 0 */
+	unsigned int batch_free = 0;
+	unsigned int nr_freed = 0;
 	unsigned long nr_scanned;
 	bool isolated_pageblocks;
 
@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
-	while (count) {
+	while (count > 0) {
 		struct page *page;
 		struct list_head *list;
+		unsigned int order;
 
 		/*
 		 * Remove pages from lists in a round-robin fashion. A
 		 * batch_free count is maintained that is incremented when an
-		 * empty list is encountered.  This is so more pages are freed
-		 * off fuller lists instead of spinning excessively around empty
-		 * lists
+		 * empty list is encountered. This is not exact due to
+		 * high-order but percision is not required.
 		 */
 		do {
 			batch_free++;
-			if (++migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
+			if (++pindex == NR_PCP_LISTS)
+				pindex = 0;
+			list = &pcp->lists[pindex];
 		} while (list_empty(list));
 
 		/* This is the only non-empty list. Free them all. */
-		if (batch_free == MIGRATE_PCPTYPES)
+		if (batch_free == NR_PCP_LISTS)
 			batch_free = count;
 
+		order = pindex_to_order(pindex);
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
 
@@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (unlikely(isolated_pageblocks))
 				mt = get_pageblock_migratetype(page);
 
+			nr_freed += (1 << order);
+			count -= (1 << order);
 			if (bulkfree_pcp_prepare(page))
 				continue;
 
-			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
-		} while (--count && --batch_free && !list_empty(list));
+			__free_one_page(page, page_to_pfn(page), zone, order, mt);
+			trace_mm_page_pcpu_drain(page, order, mt);
+		} while (count > 0 && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+	pcp->count -= nr_freed;
 }
 
 static void free_one_page(struct zone *zone,
@@ -2244,10 +2249,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	local_irq_save(flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0) {
+	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-		pcp->count -= to_drain;
-	}
 	local_irq_restore(flags);
 }
 #endif
@@ -2269,10 +2272,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	pset = per_cpu_ptr(zone->pageset, cpu);
 
 	pcp = &pset->pcp;
-	if (pcp->count) {
+	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
-		pcp->count = 0;
-	}
 	local_irq_restore(flags);
 }
 
@@ -2404,18 +2405,18 @@ void mark_free_pages(struct zone *zone)
 #endif /* CONFIG_PM */
 
 /*
- * Free a 0-order page
+ * Free a pcp page
  * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
-	int migratetype;
+	int migratetype, pindex;
 
-	if (!free_pcp_prepare(page))
+	if (!free_pcp_prepare(page, order))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2432,28 +2433,33 @@ void free_hot_cold_page(struct page *page, bool cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, order, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pindex = order_to_pindex(migratetype, order);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (!cold)
-		list_add(&page->lru, &pcp->lists[migratetype]);
+		list_add(&page->lru, &pcp->lists[pindex]);
 	else
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	pcp->count++;
+		list_add_tail(&page->lru, &pcp->lists[pindex]);
+	pcp->count += 1 << order;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = READ_ONCE(pcp->batch);
 		free_pcppages_bulk(zone, batch, pcp);
-		pcp->count -= batch;
 	}
 
 out:
 	local_irq_restore(flags);
 }
 
+void free_hot_cold_page(struct page *page, bool cold)
+{
+	__free_hot_cold_page(page, cold, 0);
+}
+
 /*
  * Free a list of 0-order pages
  */
@@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	struct page *page;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
-	if (likely(order == 0)) {
+	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
 		local_irq_save(flags);
 		do {
+			unsigned int pindex;
+
+			pindex = order_to_pindex(migratetype, order);
 			pcp = &this_cpu_ptr(zone->pageset)->pcp;
-			list = &pcp->lists[migratetype];
+			list = &pcp->lists[pindex];
 			if (list_empty(list)) {
-				pcp->count += rmqueue_bulk(zone, 0,
+				int nr_pages = rmqueue_bulk(zone, order,
 						pcp->batch, list,
 						migratetype, cold);
-				if (unlikely(list_empty(list)))
+				if (unlikely(list_empty(list))) {
+					/*
+					 * Retry high-order atomic allocs
+					 * from the buddy list which may
+					 * use MIGRATE_HIGHATOMIC.
+					 */
+					if (order && (alloc_flags & ALLOC_HARDER))
+						goto try_buddylist;
+
 					goto failed;
+				}
+				pcp->count += (nr_pages << order);
 			}
 
 			if (cold)
@@ -2611,10 +2630,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 				page = list_first_entry(list, struct page, lru);
 
 			list_del(&page->lru);
-			pcp->count--;
+			pcp->count -= (1 << order);
 
 		} while (check_new_pcp(page));
 	} else {
+try_buddylist:
 		/*
 		 * We most definitely don't want callers attempting to
 		 * allocate greater than order-1 page units with __GFP_NOFAIL.
@@ -3838,8 +3858,8 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
-			free_hot_cold_page(page, false);
+		if (order <= PAGE_ALLOC_COSTLY_ORDER)
+			__free_hot_cold_page(page, false, order);
 		else
 			__free_pages_ok(page, order);
 	}
@@ -5161,20 +5181,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 /* a companion to pageset_set_high() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
-	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+	unsigned long high;
+
+	/*
+	 * per-cpu refills occur when a per-cpu list for a migratetype
+	 * or a high-order is depleted even if pages are free overall.
+	 * Tune the high watermark such that it's unlikely, but not
+	 * impossible, that a single refill event will trigger a
+	 * shrink on the next free to the per-cpu list.
+	 */
+	high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+	pageset_update(&p->pcp, high, max(1UL, 1 * batch));
 }
 
 static void pageset_init(struct per_cpu_pageset *p)
 {
 	struct per_cpu_pages *pcp;
-	int migratetype;
+	unsigned int pindex;
 
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp;
 	pcp->count = 0;
-	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
-		INIT_LIST_HEAD(&pcp->lists[migratetype]);
+	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+		INIT_LIST_HEAD(&pcp->lists[pindex]);
 }
 
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  0:22   ` Mel Gorman
@ 2016-12-02  3:47     ` Hillf Danton
  -1 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2016-12-02  3:47 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Vlastimil Babka', 'Johannes Weiner',
	'Jesper Dangaard Brouer', 'Linux-MM',
	'Linux-Kernel'

On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
> 
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> cc: stable@vger.kernel.org [4.7+]
> ---
>  mm/page_alloc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, bool cold)
>  {
> -	int i;
> +	int i, alloced = 0;
> 
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		else
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
> +		alloced++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
>  	}
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

Now i is a pure index, yes?

>  	spin_unlock(&zone->lock);
> -	return i;
> +	return alloced;
>  }
> 
>  #ifdef CONFIG_NUMA
> --
> 2.10.2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  3:47     ` Hillf Danton
  0 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2016-12-02  3:47 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Vlastimil Babka', 'Johannes Weiner',
	'Jesper Dangaard Brouer', 'Linux-MM',
	'Linux-Kernel'

On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
> 
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> cc: stable@vger.kernel.org [4.7+]
> ---
>  mm/page_alloc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, bool cold)
>  {
> -	int i;
> +	int i, alloced = 0;
> 
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		else
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
> +		alloced++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
>  	}
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

Now i is a pure index, yes?

>  	spin_unlock(&zone->lock);
> -	return i;
> +	return alloced;
>  }
> 
>  #ifdef CONFIG_NUMA
> --
> 2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  0:22   ` Mel Gorman
@ 2016-12-02  6:03     ` Joonsoo Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-02  6:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

Hello, Mel.

I didn't follow up previous discussion so what I raise here would be
duplicated. Please let me know the link if it is answered before.

On Fri, Dec 02, 2016 at 12:22:44AM +0000, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v5
> Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
> Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
> Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
> Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
> Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
> Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
> Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
> Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
> Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
> Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
> Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
> Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
> Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
> Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
> Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
> Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
> Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
> Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v4
> Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
> Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
> Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
> Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
> Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
> Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
> Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
> Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
> Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
> Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
> Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
> Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
> Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
> Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
> Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
> Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
> Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
> Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
> 
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
> 
> The observations on sockperf are different.
> 
> 2-socket modern machine
> sockperf-tcp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
> Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
> Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
> Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
> Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
> Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
> Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
> Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
> Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
> Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)
> 
> sockperf-udp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
> Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
> Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
> Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
> Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
> Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
> Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
> Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
> Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
> Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)
> 
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
> 
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
> 
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
> 
> 1-socket machine
> hackbench-process-pipes
>                         4.9.0-rc5             4.9.0-rc5
>                           vanilla           highmark-v5
> Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
> Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
> Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
> Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
> Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
> Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
> Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
> Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
> Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
> Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
> Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
> Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)
> 
> hackbench-process-sockets
> Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
> Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
> Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
> Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
> Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
> Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
> Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
> Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
> Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
> Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
> Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
> Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)
> 
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
> 
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
> 
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/mmzone.h |  20 ++++++++-
>  mm/page_alloc.c        | 117 +++++++++++++++++++++++++++++++------------------
>  2 files changed, 93 insertions(+), 44 deletions(-)
> 
  
>  static bool bulkfree_pcp_prepare(struct page *page)
> @@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
>  static void free_pcppages_bulk(struct zone *zone, int count,
>  					struct per_cpu_pages *pcp)
>  {
> -	int migratetype = 0;
> -	int batch_free = 0;
> +	unsigned int pindex = UINT_MAX;	/* Reclaim will start at 0 */
> +	unsigned int batch_free = 0;
> +	unsigned int nr_freed = 0;
>  	unsigned long nr_scanned;
>  	bool isolated_pageblocks;
>  
> @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	if (nr_scanned)
>  		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>  
> -	while (count) {
> +	while (count > 0) {
>  		struct page *page;
>  		struct list_head *list;
> +		unsigned int order;
>  
>  		/*
>  		 * Remove pages from lists in a round-robin fashion. A
>  		 * batch_free count is maintained that is incremented when an
> -		 * empty list is encountered.  This is so more pages are freed
> -		 * off fuller lists instead of spinning excessively around empty
> -		 * lists
> +		 * empty list is encountered. This is not exact due to
> +		 * high-order but percision is not required.
>  		 */
>  		do {
>  			batch_free++;
> -			if (++migratetype == MIGRATE_PCPTYPES)
> -				migratetype = 0;
> -			list = &pcp->lists[migratetype];
> +			if (++pindex == NR_PCP_LISTS)
> +				pindex = 0;
> +			list = &pcp->lists[pindex];
>  		} while (list_empty(list));
>  
>  		/* This is the only non-empty list. Free them all. */
> -		if (batch_free == MIGRATE_PCPTYPES)
> +		if (batch_free == NR_PCP_LISTS)
>  			batch_free = count;
>  
> +		order = pindex_to_order(pindex);
>  		do {
>  			int mt;	/* migratetype of the to-be-freed page */
>  
> @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			if (unlikely(isolated_pageblocks))
>  				mt = get_pageblock_migratetype(page);
>  
> +			nr_freed += (1 << order);
> +			count -= (1 << order);
>  			if (bulkfree_pcp_prepare(page))
>  				continue;
>  
> -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> -			trace_mm_page_pcpu_drain(page, 0, mt);
> -		} while (--count && --batch_free && !list_empty(list));
> +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> +			trace_mm_page_pcpu_drain(page, order, mt);
> +		} while (count > 0 && --batch_free && !list_empty(list));
>  	}
>  	spin_unlock(&zone->lock);
> +	pcp->count -= nr_freed;
>  }

I guess that this patch would cause following problems.

1. If pcp->batch is too small, high order page will not be freed
easily and survive longer. Think about following situation.

Batch count: 7
MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
-> order 2...

free count: 1 + 1 + 1 + 2 + 4 = 9
so order 3 would not be freed.

2. And, It seems that this logic penalties high order pages. One free
to high order page means 1 << order pages free rather than just
one page free. This logic do round-robin to choose the target page so
amount of freed page will be different by the order. I think that it
makes some sense because high order page are less important to cache
in pcp than lower order but I'd like to know if it is intended or not.
If intended, it deserves the comment.

3. I guess that order-0 file/anon page alloc/free is dominent in many
workloads. If this case happen, it invalidates effect of high order
cache in pcp since cached high order pages would be also freed to the
buddy when burst order-0 free happens.

> @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  	struct page *page;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  
> -	if (likely(order == 0)) {
> +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
>  
>  		local_irq_save(flags);
>  		do {
> +			unsigned int pindex;
> +
> +			pindex = order_to_pindex(migratetype, order);
>  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -			list = &pcp->lists[migratetype];
> +			list = &pcp->lists[pindex];
>  			if (list_empty(list)) {
> -				pcp->count += rmqueue_bulk(zone, 0,
> +				int nr_pages = rmqueue_bulk(zone, order,
>  						pcp->batch, list,
>  						migratetype, cold);

Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
* (1 << order) pages and pcp->count can easily overflow pcp->high
* because list empty here doesn't mean that pcp->count is zero.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-02  6:03     ` Joonsoo Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-02  6:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

Hello, Mel.

I didn't follow up previous discussion so what I raise here would be
duplicated. Please let me know the link if it is answered before.

On Fri, Dec 02, 2016 at 12:22:44AM +0000, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v5
> Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
> Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
> Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
> Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
> Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
> Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
> Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
> Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
> Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
> Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
> Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
> Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
> Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
> Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
> Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
> Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
> Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
> Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v4
> Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
> Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
> Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
> Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
> Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
> Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
> Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
> Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
> Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
> Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
> Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
> Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
> Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
> Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
> Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
> Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
> Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
> Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
> 
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
> 
> The observations on sockperf are different.
> 
> 2-socket modern machine
> sockperf-tcp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
> Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
> Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
> Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
> Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
> Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
> Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
> Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
> Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
> Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)
> 
> sockperf-udp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
> Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
> Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
> Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
> Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
> Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
> Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
> Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
> Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
> Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)
> 
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
> 
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
> 
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
> 
> 1-socket machine
> hackbench-process-pipes
>                         4.9.0-rc5             4.9.0-rc5
>                           vanilla           highmark-v5
> Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
> Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
> Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
> Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
> Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
> Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
> Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
> Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
> Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
> Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
> Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
> Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)
> 
> hackbench-process-sockets
> Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
> Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
> Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
> Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
> Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
> Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
> Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
> Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
> Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
> Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
> Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
> Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)
> 
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
> 
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
> 
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/mmzone.h |  20 ++++++++-
>  mm/page_alloc.c        | 117 +++++++++++++++++++++++++++++++------------------
>  2 files changed, 93 insertions(+), 44 deletions(-)
> 
  
>  static bool bulkfree_pcp_prepare(struct page *page)
> @@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
>  static void free_pcppages_bulk(struct zone *zone, int count,
>  					struct per_cpu_pages *pcp)
>  {
> -	int migratetype = 0;
> -	int batch_free = 0;
> +	unsigned int pindex = UINT_MAX;	/* Reclaim will start at 0 */
> +	unsigned int batch_free = 0;
> +	unsigned int nr_freed = 0;
>  	unsigned long nr_scanned;
>  	bool isolated_pageblocks;
>  
> @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	if (nr_scanned)
>  		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>  
> -	while (count) {
> +	while (count > 0) {
>  		struct page *page;
>  		struct list_head *list;
> +		unsigned int order;
>  
>  		/*
>  		 * Remove pages from lists in a round-robin fashion. A
>  		 * batch_free count is maintained that is incremented when an
> -		 * empty list is encountered.  This is so more pages are freed
> -		 * off fuller lists instead of spinning excessively around empty
> -		 * lists
> +		 * empty list is encountered. This is not exact due to
> +		 * high-order but percision is not required.
>  		 */
>  		do {
>  			batch_free++;
> -			if (++migratetype == MIGRATE_PCPTYPES)
> -				migratetype = 0;
> -			list = &pcp->lists[migratetype];
> +			if (++pindex == NR_PCP_LISTS)
> +				pindex = 0;
> +			list = &pcp->lists[pindex];
>  		} while (list_empty(list));
>  
>  		/* This is the only non-empty list. Free them all. */
> -		if (batch_free == MIGRATE_PCPTYPES)
> +		if (batch_free == NR_PCP_LISTS)
>  			batch_free = count;
>  
> +		order = pindex_to_order(pindex);
>  		do {
>  			int mt;	/* migratetype of the to-be-freed page */
>  
> @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			if (unlikely(isolated_pageblocks))
>  				mt = get_pageblock_migratetype(page);
>  
> +			nr_freed += (1 << order);
> +			count -= (1 << order);
>  			if (bulkfree_pcp_prepare(page))
>  				continue;
>  
> -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> -			trace_mm_page_pcpu_drain(page, 0, mt);
> -		} while (--count && --batch_free && !list_empty(list));
> +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> +			trace_mm_page_pcpu_drain(page, order, mt);
> +		} while (count > 0 && --batch_free && !list_empty(list));
>  	}
>  	spin_unlock(&zone->lock);
> +	pcp->count -= nr_freed;
>  }

I guess that this patch would cause following problems.

1. If pcp->batch is too small, high order page will not be freed
easily and survive longer. Think about following situation.

Batch count: 7
MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
-> order 2...

free count: 1 + 1 + 1 + 2 + 4 = 9
so order 3 would not be freed.

2. And, It seems that this logic penalties high order pages. One free
to high order page means 1 << order pages free rather than just
one page free. This logic do round-robin to choose the target page so
amount of freed page will be different by the order. I think that it
makes some sense because high order page are less important to cache
in pcp than lower order but I'd like to know if it is intended or not.
If intended, it deserves the comment.

3. I guess that order-0 file/anon page alloc/free is dominent in many
workloads. If this case happen, it invalidates effect of high order
cache in pcp since cached high order pages would be also freed to the
buddy when burst order-0 free happens.

> @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  	struct page *page;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  
> -	if (likely(order == 0)) {
> +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
>  
>  		local_irq_save(flags);
>  		do {
> +			unsigned int pindex;
> +
> +			pindex = order_to_pindex(migratetype, order);
>  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -			list = &pcp->lists[migratetype];
> +			list = &pcp->lists[pindex];
>  			if (list_empty(list)) {
> -				pcp->count += rmqueue_bulk(zone, 0,
> +				int nr_pages = rmqueue_bulk(zone, order,
>  						pcp->batch, list,
>  						migratetype, cold);

Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
* (1 << order) pages and pcp->count can easily overflow pcp->high
* because list empty here doesn't mean that pcp->count is zero.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  3:47     ` Hillf Danton
@ 2016-12-02  6:19       ` Vlastimil Babka
  -1 siblings, 0 replies; 40+ messages in thread
From: Vlastimil Babka @ 2016-12-02  6:19 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On 12/02/2016 04:47 AM, Hillf Danton wrote:
> On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
>> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
>> defer debugging checks of pages allocated from the PCP") will allow the
>> per-cpu list counter to be out of sync with the per-cpu list contents
>> if a struct page is corrupted. This patch keeps the accounting in sync.
>>
>> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> cc: stable@vger.kernel.org [4.7+]

Acked-by: Vlastimil Babka <vbabka@suse.cz>

>> ---
>>  mm/page_alloc.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6de9440e3ae2..777ed59570df 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>  			unsigned long count, struct list_head *list,
>>  			int migratetype, bool cold)
>>  {
>> -	int i;
>> +	int i, alloced = 0;
>>
>>  	spin_lock(&zone->lock);
>>  	for (i = 0; i < count; ++i) {
>> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>  		else
>>  			list_add_tail(&page->lru, list);
>>  		list = &page->lru;
>> +		alloced++;
>>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>  					      -(1 << order));
>>  	}
>>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>
> Now i is a pure index, yes?

No, even if a page fails the check_pcp_refill() check and is not 
"allocated", it is also no longer a free page, so it's correct to 
subtract it from NR_FREE_PAGES.

>>  	spin_unlock(&zone->lock);
>> -	return i;
>> +	return alloced;
>>  }
>>
>>  #ifdef CONFIG_NUMA
>> --
>> 2.10.2
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  6:19       ` Vlastimil Babka
  0 siblings, 0 replies; 40+ messages in thread
From: Vlastimil Babka @ 2016-12-02  6:19 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On 12/02/2016 04:47 AM, Hillf Danton wrote:
> On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
>> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
>> defer debugging checks of pages allocated from the PCP") will allow the
>> per-cpu list counter to be out of sync with the per-cpu list contents
>> if a struct page is corrupted. This patch keeps the accounting in sync.
>>
>> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> cc: stable@vger.kernel.org [4.7+]

Acked-by: Vlastimil Babka <vbabka@suse.cz>

>> ---
>>  mm/page_alloc.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6de9440e3ae2..777ed59570df 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>  			unsigned long count, struct list_head *list,
>>  			int migratetype, bool cold)
>>  {
>> -	int i;
>> +	int i, alloced = 0;
>>
>>  	spin_lock(&zone->lock);
>>  	for (i = 0; i < count; ++i) {
>> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>  		else
>>  			list_add_tail(&page->lru, list);
>>  		list = &page->lru;
>> +		alloced++;
>>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>  					      -(1 << order));
>>  	}
>>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>
> Now i is a pure index, yes?

No, even if a page fails the check_pcp_refill() check and is not 
"allocated", it is also no longer a free page, so it's correct to 
subtract it from NR_FREE_PAGES.

>>  	spin_unlock(&zone->lock);
>> -	return i;
>> +	return alloced;
>>  }
>>
>>  #ifdef CONFIG_NUMA
>> --
>> 2.10.2
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  0:22   ` Mel Gorman
@ 2016-12-02  8:12     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
>
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> cc: stable@vger.kernel.org [4.7+]

I am trying to think about what would happen if we did go out of sync
and cannot spot a problem. Vlastimil has mentioned something about
free_pcppages_bulk looping for ever but I cannot see it happening right
now. So why is this worth stable backport?

Anyway the patch looks correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, bool cold)
>  {
> -	int i;
> +	int i, alloced = 0;
>  
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		else
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
> +		alloced++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
>  	}
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

I guess this deserves a comment (i vs. alloced is confusing and I bet
somebody will come up with a cleanup...). We leak corrupted pages
intentionally so we should uncharge them from the NR_FREE_PAGES.

>  	spin_unlock(&zone->lock);
> -	return i;
> +	return alloced;
>  }
>  
>  #ifdef CONFIG_NUMA
> -- 
> 2.10.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  8:12     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
>
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> cc: stable@vger.kernel.org [4.7+]

I am trying to think about what would happen if we did go out of sync
and cannot spot a problem. Vlastimil has mentioned something about
free_pcppages_bulk looping for ever but I cannot see it happening right
now. So why is this worth stable backport?

Anyway the patch looks correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, bool cold)
>  {
> -	int i;
> +	int i, alloced = 0;
>  
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		else
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
> +		alloced++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
>  	}
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

I guess this deserves a comment (i vs. alloced is confusing and I bet
somebody will come up with a cleanup...). We leak corrupted pages
intentionally so we should uncharge them from the NR_FREE_PAGES.

>  	spin_unlock(&zone->lock);
> -	return i;
> +	return alloced;
>  }
>  
>  #ifdef CONFIG_NUMA
> -- 
> 2.10.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  6:03     ` Joonsoo Kim
@ 2016-12-02  8:21       ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
[...]
> > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> >   impossible for the caller to know exactly how many pages were freed.
> >   Due to the high-order caches, the number of pages drained for a request
> >   is no longer precise.
> > 
> > o The high watermark for per-cpu pages is increased to reduce the probability
> >   that a single refill causes a drain on the next free.
[...]
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.

I guess the second paragraph above in the changelog tries to clarify
that...
 
> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order.

Yes this is indeed possible. The first paragraph above mentions this
problem.

> I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 
> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.

Yes this is true and I was wondering the same but I believe this can be
enahanced later on. E.g. we can check the order when crossing pcp->high
mark and only the given order portion of the batch. I just wouldn't over
optimize at this stage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-02  8:21       ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
[...]
> > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> >   impossible for the caller to know exactly how many pages were freed.
> >   Due to the high-order caches, the number of pages drained for a request
> >   is no longer precise.
> > 
> > o The high watermark for per-cpu pages is increased to reduce the probability
> >   that a single refill causes a drain on the next free.
[...]
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.

I guess the second paragraph above in the changelog tries to clarify
that...
 
> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order.

Yes this is indeed possible. The first paragraph above mentions this
problem.

> I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 
> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.

Yes this is true and I was wondering the same but I believe this can be
enahanced later on. E.g. we can check the order when crossing pcp->high
mark and only the given order portion of the batch. I just wouldn't over
optimize at this stage.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  0:22   ` Mel Gorman
@ 2016-12-02  8:25     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 00:22:44, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v5
> Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
> Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
> Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
> Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
> Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
> Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
> Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
> Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
> Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
> Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
> Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
> Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
> Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
> Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
> Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
> Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
> Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
> Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v4
> Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
> Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
> Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
> Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
> Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
> Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
> Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
> Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
> Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
> Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
> Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
> Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
> Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
> Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
> Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
> Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
> Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
> Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
> 
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
> 
> The observations on sockperf are different.
> 
> 2-socket modern machine
> sockperf-tcp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
> Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
> Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
> Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
> Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
> Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
> Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
> Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
> Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
> Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)
> 
> sockperf-udp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
> Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
> Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
> Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
> Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
> Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
> Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
> Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
> Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
> Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)
> 
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
> 
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
> 
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
> 
> 1-socket machine
> hackbench-process-pipes
>                         4.9.0-rc5             4.9.0-rc5
>                           vanilla           highmark-v5
> Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
> Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
> Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
> Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
> Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
> Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
> Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
> Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
> Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
> Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
> Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
> Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)
> 
> hackbench-process-sockets
> Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
> Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
> Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
> Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
> Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
> Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
> Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
> Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
> Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
> Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
> Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
> Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)
> 
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
> 
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
> 
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

I though I already did but anyway
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-02  8:25     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02  8:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 00:22:44, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v5
> Hmean    send-64         178.38 (  0.00%)      260.54 ( 46.06%)
> Hmean    send-128        351.49 (  0.00%)      518.56 ( 47.53%)
> Hmean    send-256        671.23 (  0.00%)     1005.72 ( 49.83%)
> Hmean    send-1024      2663.60 (  0.00%)     3880.54 ( 45.69%)
> Hmean    send-2048      5126.53 (  0.00%)     7545.38 ( 47.18%)
> Hmean    send-3312      7949.99 (  0.00%)    11324.34 ( 42.44%)
> Hmean    send-4096      9433.56 (  0.00%)    12818.85 ( 35.89%)
> Hmean    send-8192     15940.64 (  0.00%)    21404.98 ( 34.28%)
> Hmean    send-16384    26699.54 (  0.00%)    32810.08 ( 22.89%)
> Hmean    recv-64         178.38 (  0.00%)      260.52 ( 46.05%)
> Hmean    recv-128        351.49 (  0.00%)      518.53 ( 47.53%)
> Hmean    recv-256        671.20 (  0.00%)     1005.42 ( 49.79%)
> Hmean    recv-1024      2663.45 (  0.00%)     3879.75 ( 45.67%)
> Hmean    recv-2048      5126.26 (  0.00%)     7544.23 ( 47.17%)
> Hmean    recv-3312      7949.50 (  0.00%)    11322.52 ( 42.43%)
> Hmean    recv-4096      9433.04 (  0.00%)    12816.68 ( 35.87%)
> Hmean    recv-8192     15939.64 (  0.00%)    21402.75 ( 34.27%)
> Hmean    recv-16384    26698.44 (  0.00%)    32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v4
> Hmean    send-64          87.47 (  0.00%)      127.01 ( 45.21%)
> Hmean    send-128        174.36 (  0.00%)      254.86 ( 46.17%)
> Hmean    send-256        347.52 (  0.00%)      505.91 ( 45.58%)
> Hmean    send-1024      1363.03 (  0.00%)     1962.49 ( 43.98%)
> Hmean    send-2048      2632.68 (  0.00%)     3731.74 ( 41.75%)
> Hmean    send-3312      4123.19 (  0.00%)     5859.08 ( 42.10%)
> Hmean    send-4096      5056.48 (  0.00%)     7058.00 ( 39.58%)
> Hmean    send-8192      8784.22 (  0.00%)    12134.53 ( 38.14%)
> Hmean    send-16384    15081.60 (  0.00%)    19638.90 ( 30.22%)
> Hmean    recv-64          86.19 (  0.00%)      126.34 ( 46.58%)
> Hmean    recv-128        173.93 (  0.00%)      253.51 ( 45.75%)
> Hmean    recv-256        346.19 (  0.00%)      503.34 ( 45.40%)
> Hmean    recv-1024      1358.28 (  0.00%)     1951.63 ( 43.68%)
> Hmean    recv-2048      2623.45 (  0.00%)     3701.67 ( 41.10%)
> Hmean    recv-3312      4108.63 (  0.00%)     5817.75 ( 41.60%)
> Hmean    recv-4096      5037.25 (  0.00%)     7004.79 ( 39.06%)
> Hmean    recv-8192      8762.32 (  0.00%)    12059.83 ( 37.63%)
> Hmean    recv-16384    15042.36 (  0.00%)    19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
> 
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
> 
> The observations on sockperf are different.
> 
> 2-socket modern machine
> sockperf-tcp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        93.90 (  0.00%)       92.74 ( -1.23%)
> Hmean    100     1211.02 (  0.00%)     1284.36 (  6.05%)
> Hmean    300     6016.95 (  0.00%)     6149.26 (  2.20%)
> Hmean    500     8846.20 (  0.00%)     8988.84 (  1.61%)
> Hmean    850    12280.71 (  0.00%)    12434.78 (  1.25%)
> Stddev   14         5.32 (  0.00%)        4.79 (  9.88%)
> Stddev   100       35.32 (  0.00%)       74.20 (-110.06%)
> Stddev   300      132.63 (  0.00%)       65.50 ( 50.61%)
> Stddev   500      152.90 (  0.00%)      188.67 (-23.40%)
> Stddev   850      221.46 (  0.00%)      257.61 (-16.32%)
> 
> sockperf-udp-throughput
>                          4.9.0-rc5             4.9.0-rc5
>                            vanilla             hopcpu-v5
> Hmean    14        36.32 (  0.00%)       51.25 ( 41.09%)
> Hmean    100      258.41 (  0.00%)      355.76 ( 37.67%)
> Hmean    300      773.96 (  0.00%)     1054.13 ( 36.20%)
> Hmean    500     1291.07 (  0.00%)     1758.21 ( 36.18%)
> Hmean    850     2137.88 (  0.00%)     2939.52 ( 37.50%)
> Stddev   14         0.75 (  0.00%)        1.21 (-61.36%)
> Stddev   100        9.02 (  0.00%)       11.53 (-27.89%)
> Stddev   300       13.66 (  0.00%)       31.24 (-128.62%)
> Stddev   500       25.01 (  0.00%)       53.44 (-113.67%)
> Stddev   850       37.72 (  0.00%)       70.05 (-85.71%)
> 
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
> 
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
> 
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
> 
> 1-socket machine
> hackbench-process-pipes
>                         4.9.0-rc5             4.9.0-rc5
>                           vanilla           highmark-v5
> Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
> Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
> Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
> Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
> Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
> Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
> Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
> Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
> Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
> Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
> Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
> Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)
> 
> hackbench-process-sockets
> Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
> Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
> Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
> Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
> Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
> Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
> Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
> Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
> Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
> Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
> Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
> Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)
> 
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
> 
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
> 
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

I though I already did but anyway
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  6:03     ` Joonsoo Kim
@ 2016-12-02  9:04       ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  9:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  			if (unlikely(isolated_pageblocks))
> >  				mt = get_pageblock_migratetype(page);
> >  
> > +			nr_freed += (1 << order);
> > +			count -= (1 << order);
> >  			if (bulkfree_pcp_prepare(page))
> >  				continue;
> >  
> > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > -		} while (--count && --batch_free && !list_empty(list));
> > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > +			trace_mm_page_pcpu_drain(page, order, mt);
> > +		} while (count > 0 && --batch_free && !list_empty(list));
> >  	}
> >  	spin_unlock(&zone->lock);
> > +	pcp->count -= nr_freed;
> >  }
> 
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.
> 

You're relying on the batch count to be 7 where in a lot of cases it's
31. Even if low batch counts are common on another platform or you adjusted
the other counts to be higher values until they equal 30, it would be for
this drain that no order-3 pages were freed. It's not a permanent situation.

When or if it gets freed depends on the allocation request stream but the
same applies to the existing caches. If a high-order request arrives, it'll
be used. If all the requests are for the other orders, then eventually
the frees will hit the high watermark enough that the round-robin batch
freeing fill free the order-3 entry in the cache.

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order. I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 

It's intended but I'm not sure what else you want me to explain outside
the code itself in this case. The round-robin nature of the bulk drain
already doesn't attach any special important to the migrate type of the
list and there is no good reason to assume that high-order pages in the
cache when the high watermark is reached deserve special protection.

> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.
> 

A large burst of order-0 frees will free the high-order cache if it's not
being used but I don't see what your point is or why that is a problem.
It is pretty much guaranteed that there will be workloads that benefit
from protecting the high-order cache (SLUB-intensive alloc/free
intensive workloads) while others suffer (Fault-intensive map/unmap
workloads).

What's there at the moment behaves reasonably on a variety of workloads
across 8 machines.

> > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  	struct page *page;
> >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -	if (likely(order == 0)) {
> > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> >  		struct per_cpu_pages *pcp;
> >  		struct list_head *list;
> >  
> >  		local_irq_save(flags);
> >  		do {
> > +			unsigned int pindex;
> > +
> > +			pindex = order_to_pindex(migratetype, order);
> >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > -			list = &pcp->lists[migratetype];
> > +			list = &pcp->lists[pindex];
> >  			if (list_empty(list)) {
> > -				pcp->count += rmqueue_bulk(zone, 0,
> > +				int nr_pages = rmqueue_bulk(zone, order,
> >  						pcp->batch, list,
> >  						migratetype, cold);
> 
> Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> * (1 << order) pages and pcp->count can easily overflow pcp->high
> * because list empty here doesn't mean that pcp->count is zero.
> 

Potentially a refill can cause a drain on another list. However, I adjusted
the high watermark in pageset_set_batch to make it unlikely that a single
refill will cause a drain and added a comment about it. I say unlikely
because it's not guaranteed. A carefully created workload could potentially
bring all the order-0 and some of the high-order caches close to the
watermark and then trigger a drain due to a refill of order-3.  The impact
is marginal and in itself does not warrent increasing the high watermark
to guarantee that no single refill can cause a drain on the next free.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-02  9:04       ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  9:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  			if (unlikely(isolated_pageblocks))
> >  				mt = get_pageblock_migratetype(page);
> >  
> > +			nr_freed += (1 << order);
> > +			count -= (1 << order);
> >  			if (bulkfree_pcp_prepare(page))
> >  				continue;
> >  
> > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > -		} while (--count && --batch_free && !list_empty(list));
> > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > +			trace_mm_page_pcpu_drain(page, order, mt);
> > +		} while (count > 0 && --batch_free && !list_empty(list));
> >  	}
> >  	spin_unlock(&zone->lock);
> > +	pcp->count -= nr_freed;
> >  }
> 
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.
> 

You're relying on the batch count to be 7 where in a lot of cases it's
31. Even if low batch counts are common on another platform or you adjusted
the other counts to be higher values until they equal 30, it would be for
this drain that no order-3 pages were freed. It's not a permanent situation.

When or if it gets freed depends on the allocation request stream but the
same applies to the existing caches. If a high-order request arrives, it'll
be used. If all the requests are for the other orders, then eventually
the frees will hit the high watermark enough that the round-robin batch
freeing fill free the order-3 entry in the cache.

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order. I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 

It's intended but I'm not sure what else you want me to explain outside
the code itself in this case. The round-robin nature of the bulk drain
already doesn't attach any special important to the migrate type of the
list and there is no good reason to assume that high-order pages in the
cache when the high watermark is reached deserve special protection.

> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.
> 

A large burst of order-0 frees will free the high-order cache if it's not
being used but I don't see what your point is or why that is a problem.
It is pretty much guaranteed that there will be workloads that benefit
from protecting the high-order cache (SLUB-intensive alloc/free
intensive workloads) while others suffer (Fault-intensive map/unmap
workloads).

What's there at the moment behaves reasonably on a variety of workloads
across 8 machines.

> > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  	struct page *page;
> >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -	if (likely(order == 0)) {
> > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> >  		struct per_cpu_pages *pcp;
> >  		struct list_head *list;
> >  
> >  		local_irq_save(flags);
> >  		do {
> > +			unsigned int pindex;
> > +
> > +			pindex = order_to_pindex(migratetype, order);
> >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > -			list = &pcp->lists[migratetype];
> > +			list = &pcp->lists[pindex];
> >  			if (list_empty(list)) {
> > -				pcp->count += rmqueue_bulk(zone, 0,
> > +				int nr_pages = rmqueue_bulk(zone, order,
> >  						pcp->batch, list,
> >  						migratetype, cold);
> 
> Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> * (1 << order) pages and pcp->count can easily overflow pcp->high
> * because list empty here doesn't mean that pcp->count is zero.
> 

Potentially a refill can cause a drain on another list. However, I adjusted
the high watermark in pageset_set_batch to make it unlikely that a single
refill will cause a drain and added a comment about it. I say unlikely
because it's not guaranteed. A carefully created workload could potentially
bring all the order-0 and some of the high-order caches close to the
watermark and then trigger a drain due to a refill of order-3.  The impact
is marginal and in itself does not warrent increasing the high watermark
to guarantee that no single refill can cause a drain on the next free.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  6:19       ` Vlastimil Babka
@ 2016-12-02  9:30         ` Hillf Danton
  -1 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2016-12-02  9:30 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On Friday, December 02, 2016 2:19 PM Vlastimil Babka wrote:
> On 12/02/2016 04:47 AM, Hillf Danton wrote:
> > On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> >> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> >> defer debugging checks of pages allocated from the PCP") will allow the
> >> per-cpu list counter to be out of sync with the per-cpu list contents
> >> if a struct page is corrupted. This patch keeps the accounting in sync.
> >>
> >> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> cc: stable@vger.kernel.org [4.7+]
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> >> ---
> >>  mm/page_alloc.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 6de9440e3ae2..777ed59570df 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >>  			unsigned long count, struct list_head *list,
> >>  			int migratetype, bool cold)
> >>  {
> >> -	int i;
> >> +	int i, alloced = 0;
> >>
> >>  	spin_lock(&zone->lock);
> >>  	for (i = 0; i < count; ++i) {
> >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >>  		else
> >>  			list_add_tail(&page->lru, list);
> >>  		list = &page->lru;
> >> +		alloced++;
> >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >>  					      -(1 << order));
> >>  	}
> >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> >
> > Now i is a pure index, yes?
> 
> No, even if a page fails the check_pcp_refill() check and is not
> "allocated", it is also no longer a free page, so it's correct to
> subtract it from NR_FREE_PAGES.
> 
Yes, we can allocate free page   next time.

thanks
Hillf

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  9:30         ` Hillf Danton
  0 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2016-12-02  9:30 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman', 'Andrew Morton'
  Cc: 'Christoph Lameter', 'Michal Hocko',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On Friday, December 02, 2016 2:19 PM Vlastimil Babka wrote:
> On 12/02/2016 04:47 AM, Hillf Danton wrote:
> > On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> >> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> >> defer debugging checks of pages allocated from the PCP") will allow the
> >> per-cpu list counter to be out of sync with the per-cpu list contents
> >> if a struct page is corrupted. This patch keeps the accounting in sync.
> >>
> >> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> cc: stable@vger.kernel.org [4.7+]
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> >> ---
> >>  mm/page_alloc.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 6de9440e3ae2..777ed59570df 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >>  			unsigned long count, struct list_head *list,
> >>  			int migratetype, bool cold)
> >>  {
> >> -	int i;
> >> +	int i, alloced = 0;
> >>
> >>  	spin_lock(&zone->lock);
> >>  	for (i = 0; i < count; ++i) {
> >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >>  		else
> >>  			list_add_tail(&page->lru, list);
> >>  		list = &page->lru;
> >> +		alloced++;
> >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >>  					      -(1 << order));
> >>  	}
> >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> >
> > Now i is a pure index, yes?
> 
> No, even if a page fails the check_pcp_refill() check and is not
> "allocated", it is also no longer a free page, so it's correct to
> subtract it from NR_FREE_PAGES.
> 
Yes, we can allocate free page   next time.

thanks
Hillf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  8:12     ` Michal Hocko
@ 2016-12-02  9:49       ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  9:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > defer debugging checks of pages allocated from the PCP") will allow the
> > per-cpu list counter to be out of sync with the per-cpu list contents
> > if a struct page is corrupted. This patch keeps the accounting in sync.
> >
> > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > cc: stable@vger.kernel.org [4.7+]
> 
> I am trying to think about what would happen if we did go out of sync
> and cannot spot a problem. Vlastimil has mentioned something about
> free_pcppages_bulk looping for ever but I cannot see it happening right
> now.

free_pcppages_bulk can infinite loop if the page count is positive and
there are no pages. While I've only seen this during development, a
corrupted count loops here

                do {
                        batch_free++;
                        if (++pindex == NR_PCP_LISTS)
                                pindex = 0;
                        list = &pcp->lists[pindex];
                } while (list_empty(list));

It would only be seen in a situation where struct page corruption was
detected so it's rare.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02  9:49       ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02  9:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > defer debugging checks of pages allocated from the PCP") will allow the
> > per-cpu list counter to be out of sync with the per-cpu list contents
> > if a struct page is corrupted. This patch keeps the accounting in sync.
> >
> > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > cc: stable@vger.kernel.org [4.7+]
> 
> I am trying to think about what would happen if we did go out of sync
> and cannot spot a problem. Vlastimil has mentioned something about
> free_pcppages_bulk looping for ever but I cannot see it happening right
> now.

free_pcppages_bulk can infinite loop if the page count is positive and
there are no pages. While I've only seen this during development, a
corrupted count loops here

                do {
                        batch_free++;
                        if (++pindex == NR_PCP_LISTS)
                                pindex = 0;
                        list = &pcp->lists[pindex];
                } while (list_empty(list));

It would only be seen in a situation where struct page corruption was
detected so it's rare.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  9:49       ` Mel Gorman
@ 2016-12-02 10:03         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02 10:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 09:49:33, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> > On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > > defer debugging checks of pages allocated from the PCP") will allow the
> > > per-cpu list counter to be out of sync with the per-cpu list contents
> > > if a struct page is corrupted. This patch keeps the accounting in sync.
> > >
> > > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > cc: stable@vger.kernel.org [4.7+]
> > 
> > I am trying to think about what would happen if we did go out of sync
> > and cannot spot a problem. Vlastimil has mentioned something about
> > free_pcppages_bulk looping for ever but I cannot see it happening right
> > now.
> 
> free_pcppages_bulk can infinite loop if the page count is positive and
> there are no pages. While I've only seen this during development, a
> corrupted count loops here
> 
>                 do {
>                         batch_free++;
>                         if (++pindex == NR_PCP_LISTS)
>                                 pindex = 0;
>                         list = &pcp->lists[pindex];
>                 } while (list_empty(list));
> 
> It would only be seen in a situation where struct page corruption was
> detected so it's rare.

OK, I was apparently sleeping when responding. I focused on t he outer
loop and that should just converge. But it is true that this inner loop
can just runaway... Could you add that to the changelog please? This
definitely warrants stable backport.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02 10:03         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02 10:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri 02-12-16 09:49:33, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> > On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > > defer debugging checks of pages allocated from the PCP") will allow the
> > > per-cpu list counter to be out of sync with the per-cpu list contents
> > > if a struct page is corrupted. This patch keeps the accounting in sync.
> > >
> > > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > cc: stable@vger.kernel.org [4.7+]
> > 
> > I am trying to think about what would happen if we did go out of sync
> > and cannot spot a problem. Vlastimil has mentioned something about
> > free_pcppages_bulk looping for ever but I cannot see it happening right
> > now.
> 
> free_pcppages_bulk can infinite loop if the page count is positive and
> there are no pages. While I've only seen this during development, a
> corrupted count loops here
> 
>                 do {
>                         batch_free++;
>                         if (++pindex == NR_PCP_LISTS)
>                                 pindex = 0;
>                         list = &pcp->lists[pindex];
>                 } while (list_empty(list));
> 
> It would only be seen in a situation where struct page corruption was
> detected so it's rare.

OK, I was apparently sleeping when responding. I focused on t he outer
loop and that should just converge. But it is true that this inner loop
can just runaway... Could you add that to the changelog please? This
definitely warrants stable backport.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02  9:30         ` Hillf Danton
@ 2016-12-02 10:04           ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02 10:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Vlastimil Babka', 'Mel Gorman',
	'Andrew Morton', 'Christoph Lameter',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On Fri 02-12-16 17:30:07, Hillf Danton wrote:
[...]
> > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > >>  		else
> > >>  			list_add_tail(&page->lru, list);
> > >>  		list = &page->lru;
> > >> +		alloced++;
> > >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > >>  					      -(1 << order));
> > >>  	}
> > >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > >
> > > Now i is a pure index, yes?
> > 
> > No, even if a page fails the check_pcp_refill() check and is not
> > "allocated", it is also no longer a free page, so it's correct to
> > subtract it from NR_FREE_PAGES.
> > 
> Yes, we can allocate free page   next time.

No we cannot. The page is gone from the free list. We have effectively
leaked it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02 10:04           ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2016-12-02 10:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Vlastimil Babka', 'Mel Gorman',
	'Andrew Morton', 'Christoph Lameter',
	'Johannes Weiner', 'Jesper Dangaard Brouer',
	'Linux-MM', 'Linux-Kernel'

On Fri 02-12-16 17:30:07, Hillf Danton wrote:
[...]
> > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > >>  		else
> > >>  			list_add_tail(&page->lru, list);
> > >>  		list = &page->lru;
> > >> +		alloced++;
> > >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > >>  					      -(1 << order));
> > >>  	}
> > >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > >
> > > Now i is a pure index, yes?
> > 
> > No, even if a page fails the check_pcp_refill() check and is not
> > "allocated", it is also no longer a free page, so it's correct to
> > subtract it from NR_FREE_PAGES.
> > 
> Yes, we can allocate free page   next time.

No we cannot. The page is gone from the free list. We have effectively
leaked it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
  2016-12-02 10:04           ` Michal Hocko
@ 2016-12-02 11:02             ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02 11:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, 'Vlastimil Babka', 'Andrew Morton',
	'Christoph Lameter', 'Johannes Weiner',
	'Jesper Dangaard Brouer', 'Linux-MM',
	'Linux-Kernel'

On Fri, Dec 02, 2016 at 11:04:11AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 17:30:07, Hillf Danton wrote:
> [...]
> > > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > > >>  		else
> > > >>  			list_add_tail(&page->lru, list);
> > > >>  		list = &page->lru;
> > > >> +		alloced++;
> > > >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > > >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > > >>  					      -(1 << order));
> > > >>  	}
> > > >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > > >
> > > > Now i is a pure index, yes?
> > > 
> > > No, even if a page fails the check_pcp_refill() check and is not
> > > "allocated", it is also no longer a free page, so it's correct to
> > > subtract it from NR_FREE_PAGES.
> > > 
> > Yes, we can allocate free page   next time.
> 
> No we cannot. The page is gone from the free list. We have effectively
> leaked it.

And deliberately so. It's in an unknown state, possibly due to memory
corruption or a use-after free bug. The machine can continue limping on
with warnings in the kernel log but the VM stops going near the page
itself as much as possible.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted
@ 2016-12-02 11:02             ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-02 11:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, 'Vlastimil Babka', 'Andrew Morton',
	'Christoph Lameter', 'Johannes Weiner',
	'Jesper Dangaard Brouer', 'Linux-MM',
	'Linux-Kernel'

On Fri, Dec 02, 2016 at 11:04:11AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 17:30:07, Hillf Danton wrote:
> [...]
> > > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > > >>  		else
> > > >>  			list_add_tail(&page->lru, list);
> > > >>  		list = &page->lru;
> > > >> +		alloced++;
> > > >>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > > >>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > > >>  					      -(1 << order));
> > > >>  	}
> > > >>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > > >
> > > > Now i is a pure index, yes?
> > > 
> > > No, even if a page fails the check_pcp_refill() check and is not
> > > "allocated", it is also no longer a free page, so it's correct to
> > > subtract it from NR_FREE_PAGES.
> > > 
> > Yes, we can allocate free page   next time.
> 
> No we cannot. The page is gone from the free list. We have effectively
> leaked it.

And deliberately so. It's in an unknown state, possibly due to memory
corruption or a use-after free bug. The machine can continue limping on
with warnings in the kernel log but the VM stops going near the page
itself as much as possible.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  9:04       ` Mel Gorman
@ 2016-12-05  3:06         ` Joonsoo Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-05  3:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  			if (unlikely(isolated_pageblocks))
> > >  				mt = get_pageblock_migratetype(page);
> > >  
> > > +			nr_freed += (1 << order);
> > > +			count -= (1 << order);
> > >  			if (bulkfree_pcp_prepare(page))
> > >  				continue;
> > >  
> > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > -		} while (--count && --batch_free && !list_empty(list));
> > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > >  	}
> > >  	spin_unlock(&zone->lock);
> > > +	pcp->count -= nr_freed;
> > >  }
> > 
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> > 
> 
> You're relying on the batch count to be 7 where in a lot of cases it's
> 31. Even if low batch counts are common on another platform or you adjusted
> the other counts to be higher values until they equal 30, it would be for
> this drain that no order-3 pages were freed. It's not a permanent situation.
> 
> When or if it gets freed depends on the allocation request stream but the
> same applies to the existing caches. If a high-order request arrives, it'll
> be used. If all the requests are for the other orders, then eventually
> the frees will hit the high watermark enough that the round-robin batch
> freeing fill free the order-3 entry in the cache.

I know that it isn't a permanent situation and it depends on workload.
However, it is clearly an unfair situation. We don't have any good reason
to cache higher order freepage longer. Even, batch count 7 means
that it is a small system. In this kind of system, there is no reason
to keep high order freepage longer in the cache.

The other potential problem is that if we change
PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
doesn't guarantee that free_pcppages_bulk() will work fairly and we
will not notice it easily.

I think that it can be simply solved by maintaining a last pindex in
pcp. How about it?

> 
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order. I think that it
> > makes some sense because high order page are less important to cache
> > in pcp than lower order but I'd like to know if it is intended or not.
> > If intended, it deserves the comment.
> > 
> 
> It's intended but I'm not sure what else you want me to explain outside
> the code itself in this case. The round-robin nature of the bulk drain
> already doesn't attach any special important to the migrate type of the
> list and there is no good reason to assume that high-order pages in the
> cache when the high watermark is reached deserve special protection.

Non-trivial part is that round-robin approach penalties high-order
pages caching. We usually think that round-robin is fair, but, in this
case, it isn't. Some people can notice that amount of freepage in turn is
different but some may not. It is a different situation in the past that
amount of freepage in turn is same even if migratetype is different. I
think that it deserve some comment but I don't feel it strongly.

> > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > workloads. If this case happen, it invalidates effect of high order
> > cache in pcp since cached high order pages would be also freed to the
> > buddy when burst order-0 free happens.
> > 
> 
> A large burst of order-0 frees will free the high-order cache if it's not
> being used but I don't see what your point is or why that is a problem.
> It is pretty much guaranteed that there will be workloads that benefit
> from protecting the high-order cache (SLUB-intensive alloc/free
> intensive workloads) while others suffer (Fault-intensive map/unmap
> workloads).
> 
> What's there at the moment behaves reasonably on a variety of workloads
> across 8 machines.

Yes, I see that this patch improves some workloads. What I like to say
is that I find some weakness and if it is fixed, we can get better
result. This patch implement unified pcp cache for migratetype and
high-order but if we separate them and manage number of cached items
separately, we would not have above problem. Could you teach me the
reason not to implement the separate cache for high order?

> 
> > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > >  	struct page *page;
> > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > >  
> > > -	if (likely(order == 0)) {
> > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > >  		struct per_cpu_pages *pcp;
> > >  		struct list_head *list;
> > >  
> > >  		local_irq_save(flags);
> > >  		do {
> > > +			unsigned int pindex;
> > > +
> > > +			pindex = order_to_pindex(migratetype, order);
> > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > -			list = &pcp->lists[migratetype];
> > > +			list = &pcp->lists[pindex];
> > >  			if (list_empty(list)) {
> > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > +				int nr_pages = rmqueue_bulk(zone, order,
> > >  						pcp->batch, list,
> > >  						migratetype, cold);
> > 
> > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > * because list empty here doesn't mean that pcp->count is zero.
> > 
> 
> Potentially a refill can cause a drain on another list. However, I adjusted
> the high watermark in pageset_set_batch to make it unlikely that a single
> refill will cause a drain and added a comment about it. I say unlikely
> because it's not guaranteed. A carefully created workload could potentially
> bring all the order-0 and some of the high-order caches close to the
> watermark and then trigger a drain due to a refill of order-3.  The impact
> is marginal and in itself does not warrent increasing the high watermark
> to guarantee that no single refill can cause a drain on the next free.

Hmm... What makes me wonder is that alloc/free isn't symmetric.
Free in free_pcppages_bulk() are done until number of freed pages
becomes the batch. High order pages are counted as 1 << order. But, in
refill here, counting high order pages is single one rather than
1 << order. This asymmetric alloc/free is intended? Why do we cache same
number of high order page with the number of order-0 page in one
batch?

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-05  3:06         ` Joonsoo Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-05  3:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  			if (unlikely(isolated_pageblocks))
> > >  				mt = get_pageblock_migratetype(page);
> > >  
> > > +			nr_freed += (1 << order);
> > > +			count -= (1 << order);
> > >  			if (bulkfree_pcp_prepare(page))
> > >  				continue;
> > >  
> > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > -		} while (--count && --batch_free && !list_empty(list));
> > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > >  	}
> > >  	spin_unlock(&zone->lock);
> > > +	pcp->count -= nr_freed;
> > >  }
> > 
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> > 
> 
> You're relying on the batch count to be 7 where in a lot of cases it's
> 31. Even if low batch counts are common on another platform or you adjusted
> the other counts to be higher values until they equal 30, it would be for
> this drain that no order-3 pages were freed. It's not a permanent situation.
> 
> When or if it gets freed depends on the allocation request stream but the
> same applies to the existing caches. If a high-order request arrives, it'll
> be used. If all the requests are for the other orders, then eventually
> the frees will hit the high watermark enough that the round-robin batch
> freeing fill free the order-3 entry in the cache.

I know that it isn't a permanent situation and it depends on workload.
However, it is clearly an unfair situation. We don't have any good reason
to cache higher order freepage longer. Even, batch count 7 means
that it is a small system. In this kind of system, there is no reason
to keep high order freepage longer in the cache.

The other potential problem is that if we change
PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
doesn't guarantee that free_pcppages_bulk() will work fairly and we
will not notice it easily.

I think that it can be simply solved by maintaining a last pindex in
pcp. How about it?

> 
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order. I think that it
> > makes some sense because high order page are less important to cache
> > in pcp than lower order but I'd like to know if it is intended or not.
> > If intended, it deserves the comment.
> > 
> 
> It's intended but I'm not sure what else you want me to explain outside
> the code itself in this case. The round-robin nature of the bulk drain
> already doesn't attach any special important to the migrate type of the
> list and there is no good reason to assume that high-order pages in the
> cache when the high watermark is reached deserve special protection.

Non-trivial part is that round-robin approach penalties high-order
pages caching. We usually think that round-robin is fair, but, in this
case, it isn't. Some people can notice that amount of freepage in turn is
different but some may not. It is a different situation in the past that
amount of freepage in turn is same even if migratetype is different. I
think that it deserve some comment but I don't feel it strongly.

> > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > workloads. If this case happen, it invalidates effect of high order
> > cache in pcp since cached high order pages would be also freed to the
> > buddy when burst order-0 free happens.
> > 
> 
> A large burst of order-0 frees will free the high-order cache if it's not
> being used but I don't see what your point is or why that is a problem.
> It is pretty much guaranteed that there will be workloads that benefit
> from protecting the high-order cache (SLUB-intensive alloc/free
> intensive workloads) while others suffer (Fault-intensive map/unmap
> workloads).
> 
> What's there at the moment behaves reasonably on a variety of workloads
> across 8 machines.

Yes, I see that this patch improves some workloads. What I like to say
is that I find some weakness and if it is fixed, we can get better
result. This patch implement unified pcp cache for migratetype and
high-order but if we separate them and manage number of cached items
separately, we would not have above problem. Could you teach me the
reason not to implement the separate cache for high order?

> 
> > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > >  	struct page *page;
> > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > >  
> > > -	if (likely(order == 0)) {
> > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > >  		struct per_cpu_pages *pcp;
> > >  		struct list_head *list;
> > >  
> > >  		local_irq_save(flags);
> > >  		do {
> > > +			unsigned int pindex;
> > > +
> > > +			pindex = order_to_pindex(migratetype, order);
> > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > -			list = &pcp->lists[migratetype];
> > > +			list = &pcp->lists[pindex];
> > >  			if (list_empty(list)) {
> > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > +				int nr_pages = rmqueue_bulk(zone, order,
> > >  						pcp->batch, list,
> > >  						migratetype, cold);
> > 
> > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > * because list empty here doesn't mean that pcp->count is zero.
> > 
> 
> Potentially a refill can cause a drain on another list. However, I adjusted
> the high watermark in pageset_set_batch to make it unlikely that a single
> refill will cause a drain and added a comment about it. I say unlikely
> because it's not guaranteed. A carefully created workload could potentially
> bring all the order-0 and some of the high-order caches close to the
> watermark and then trigger a drain due to a refill of order-3.  The impact
> is marginal and in itself does not warrent increasing the high watermark
> to guarantee that no single refill can cause a drain on the next free.

Hmm... What makes me wonder is that alloc/free isn't symmetric.
Free in free_pcppages_bulk() are done until number of freed pages
becomes the batch. High order pages are counted as 1 << order. But, in
refill here, counting high order pages is single one rather than
1 << order. This asymmetric alloc/free is intended? Why do we cache same
number of high order page with the number of order-0 page in one
batch?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-02  8:21       ` Michal Hocko
@ 2016-12-05  3:10         ` Joonsoo Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-05  3:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
> [...]
> > > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > >   impossible for the caller to know exactly how many pages were freed.
> > >   Due to the high-order caches, the number of pages drained for a request
> > >   is no longer precise.
> > > 
> > > o The high watermark for per-cpu pages is increased to reduce the probability
> > >   that a single refill causes a drain on the next free.
> [...]
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> 
> I guess the second paragraph above in the changelog tries to clarify
> that...

It doesn't perfectly clarify my concern. This is a different problem.

>  
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order.
> 
> Yes this is indeed possible. The first paragraph above mentions this
> problem.

Yes, it is mentioned simply but we cannot easily notice that the above
penalty for high order page is there.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-05  3:10         ` Joonsoo Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-05  3:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
> [...]
> > > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > >   impossible for the caller to know exactly how many pages were freed.
> > >   Due to the high-order caches, the number of pages drained for a request
> > >   is no longer precise.
> > > 
> > > o The high watermark for per-cpu pages is increased to reduce the probability
> > >   that a single refill causes a drain on the next free.
> [...]
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> 
> I guess the second paragraph above in the changelog tries to clarify
> that...

It doesn't perfectly clarify my concern. This is a different problem.

>  
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order.
> 
> Yes this is indeed possible. The first paragraph above mentions this
> problem.

Yes, it is mentioned simply but we cannot easily notice that the above
penalty for high order page is there.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-05  3:06         ` Joonsoo Kim
@ 2016-12-05  9:57           ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-05  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > >  			if (unlikely(isolated_pageblocks))
> > > >  				mt = get_pageblock_migratetype(page);
> > > >  
> > > > +			nr_freed += (1 << order);
> > > > +			count -= (1 << order);
> > > >  			if (bulkfree_pcp_prepare(page))
> > > >  				continue;
> > > >  
> > > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > > -		} while (--count && --batch_free && !list_empty(list));
> > > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > > >  	}
> > > >  	spin_unlock(&zone->lock);
> > > > +	pcp->count -= nr_freed;
> > > >  }
> > > 
> > > I guess that this patch would cause following problems.
> > > 
> > > 1. If pcp->batch is too small, high order page will not be freed
> > > easily and survive longer. Think about following situation.
> > > 
> > > Batch count: 7
> > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > -> order 2...
> > > 
> > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > so order 3 would not be freed.
> > > 
> > 
> > You're relying on the batch count to be 7 where in a lot of cases it's
> > 31. Even if low batch counts are common on another platform or you adjusted
> > the other counts to be higher values until they equal 30, it would be for
> > this drain that no order-3 pages were freed. It's not a permanent situation.
> > 
> > When or if it gets freed depends on the allocation request stream but the
> > same applies to the existing caches. If a high-order request arrives, it'll
> > be used. If all the requests are for the other orders, then eventually
> > the frees will hit the high watermark enough that the round-robin batch
> > freeing fill free the order-3 entry in the cache.
> 
> I know that it isn't a permanent situation and it depends on workload.
> However, it is clearly an unfair situation. We don't have any good reason
> to cache higher order freepage longer. Even, batch count 7 means
> that it is a small system. In this kind of system, there is no reason
> to keep high order freepage longer in the cache.
> 

Without knowing the future allocation request stream, there is no reason
to favour one part of the per-cpu cache over another. To me, it's not
actually clear at all it's an unfair situation, particularly given that the
vanilla code is also unfair -- the vanilla code can artifically preserve
MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
The only deciding factor there was a fault-intensive workload would mask
overhead of the page allocator due to page zeroing cost which UNMOVABLE
allocations may or may not require. Even that is vague considering that
page-table allocations are zeroing even if many kernel allocations are not.

> The other potential problem is that if we change
> PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> doesn't guarantee that free_pcppages_bulk() will work fairly and we
> will not notice it easily.
> 

In the event the high-order cache is increased, then the high watermark
would also need to be adjusted to account for that just as this patch
does.

> I think that it can be simply solved by maintaining a last pindex in
> pcp. How about it?
> 

That would rely on the previous allocation stream to drive the freeing
which is slightly related to the fact the per-cpu cache contents are
related to the previous request stream. It's still not guaranteed to be
related to the future request stream.

Adding a new pindex cache adds complexity to the free path without any
guarantee it benefits anything. The use of such a heuristic should be
driven by a workload demonstrating it's a problem. Granted, half of the
cost of a free operations is due to irq enable/disable but there is no
reason to make it unnecessarily expensive.

> > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > workloads. If this case happen, it invalidates effect of high order
> > > cache in pcp since cached high order pages would be also freed to the
> > > buddy when burst order-0 free happens.
> > > 
> > 
> > A large burst of order-0 frees will free the high-order cache if it's not
> > being used but I don't see what your point is or why that is a problem.
> > It is pretty much guaranteed that there will be workloads that benefit
> > from protecting the high-order cache (SLUB-intensive alloc/free
> > intensive workloads) while others suffer (Fault-intensive map/unmap
> > workloads).
> > 
> > What's there at the moment behaves reasonably on a variety of workloads
> > across 8 machines.
> 
> Yes, I see that this patch improves some workloads. What I like to say
> is that I find some weakness and if it is fixed, we can get better
> result. This patch implement unified pcp cache for migratetype and
> high-order but if we separate them and manage number of cached items
> separately, we would not have above problem. Could you teach me the
> reason not to implement the separate cache for high order?
> 

It's additional complexity and a separate cache that would require separate
batch counts and high watermarks for order-0 and high-order for a problem
that is not demonstrated as being necessary by a workload on any platform.

> > 
> > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > >  	struct page *page;
> > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > >  
> > > > -	if (likely(order == 0)) {
> > > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > >  		struct per_cpu_pages *pcp;
> > > >  		struct list_head *list;
> > > >  
> > > >  		local_irq_save(flags);
> > > >  		do {
> > > > +			unsigned int pindex;
> > > > +
> > > > +			pindex = order_to_pindex(migratetype, order);
> > > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > -			list = &pcp->lists[migratetype];
> > > > +			list = &pcp->lists[pindex];
> > > >  			if (list_empty(list)) {
> > > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > > +				int nr_pages = rmqueue_bulk(zone, order,
> > > >  						pcp->batch, list,
> > > >  						migratetype, cold);
> > > 
> > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > * because list empty here doesn't mean that pcp->count is zero.
> > > 
> > 
> > Potentially a refill can cause a drain on another list. However, I adjusted
> > the high watermark in pageset_set_batch to make it unlikely that a single
> > refill will cause a drain and added a comment about it. I say unlikely
> > because it's not guaranteed. A carefully created workload could potentially
> > bring all the order-0 and some of the high-order caches close to the
> > watermark and then trigger a drain due to a refill of order-3.  The impact
> > is marginal and in itself does not warrent increasing the high watermark
> > to guarantee that no single refill can cause a drain on the next free.
> 
> Hmm... What makes me wonder is that alloc/free isn't symmetric.
> Free in free_pcppages_bulk() are done until number of freed pages
> becomes the batch. High order pages are counted as 1 << order.
> But, in refill here, counting high order pages is single one rather than
> 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> number of high order page with the number of order-0 page in one
> batch?
> 

Because on the alloc side, we're batching the number of operations done
under the lock and on the free side, we're concerned with how much memory
is pinned by the per-cpu cache. There are different options that could
be taken such as accounting for the number of list elements instead of
order-0 pages but that will make the size of the per-cpu cache variable
without necessarily being beneficial. Pretty much anything related to the
per-cpu cache is an optimistic heuristic on what is beneficial to cache
and when with the view to reducing operations taken under the zone->lock.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-05  9:57           ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-05  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > >  			if (unlikely(isolated_pageblocks))
> > > >  				mt = get_pageblock_migratetype(page);
> > > >  
> > > > +			nr_freed += (1 << order);
> > > > +			count -= (1 << order);
> > > >  			if (bulkfree_pcp_prepare(page))
> > > >  				continue;
> > > >  
> > > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > > -		} while (--count && --batch_free && !list_empty(list));
> > > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > > >  	}
> > > >  	spin_unlock(&zone->lock);
> > > > +	pcp->count -= nr_freed;
> > > >  }
> > > 
> > > I guess that this patch would cause following problems.
> > > 
> > > 1. If pcp->batch is too small, high order page will not be freed
> > > easily and survive longer. Think about following situation.
> > > 
> > > Batch count: 7
> > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > -> order 2...
> > > 
> > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > so order 3 would not be freed.
> > > 
> > 
> > You're relying on the batch count to be 7 where in a lot of cases it's
> > 31. Even if low batch counts are common on another platform or you adjusted
> > the other counts to be higher values until they equal 30, it would be for
> > this drain that no order-3 pages were freed. It's not a permanent situation.
> > 
> > When or if it gets freed depends on the allocation request stream but the
> > same applies to the existing caches. If a high-order request arrives, it'll
> > be used. If all the requests are for the other orders, then eventually
> > the frees will hit the high watermark enough that the round-robin batch
> > freeing fill free the order-3 entry in the cache.
> 
> I know that it isn't a permanent situation and it depends on workload.
> However, it is clearly an unfair situation. We don't have any good reason
> to cache higher order freepage longer. Even, batch count 7 means
> that it is a small system. In this kind of system, there is no reason
> to keep high order freepage longer in the cache.
> 

Without knowing the future allocation request stream, there is no reason
to favour one part of the per-cpu cache over another. To me, it's not
actually clear at all it's an unfair situation, particularly given that the
vanilla code is also unfair -- the vanilla code can artifically preserve
MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
The only deciding factor there was a fault-intensive workload would mask
overhead of the page allocator due to page zeroing cost which UNMOVABLE
allocations may or may not require. Even that is vague considering that
page-table allocations are zeroing even if many kernel allocations are not.

> The other potential problem is that if we change
> PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> doesn't guarantee that free_pcppages_bulk() will work fairly and we
> will not notice it easily.
> 

In the event the high-order cache is increased, then the high watermark
would also need to be adjusted to account for that just as this patch
does.

> I think that it can be simply solved by maintaining a last pindex in
> pcp. How about it?
> 

That would rely on the previous allocation stream to drive the freeing
which is slightly related to the fact the per-cpu cache contents are
related to the previous request stream. It's still not guaranteed to be
related to the future request stream.

Adding a new pindex cache adds complexity to the free path without any
guarantee it benefits anything. The use of such a heuristic should be
driven by a workload demonstrating it's a problem. Granted, half of the
cost of a free operations is due to irq enable/disable but there is no
reason to make it unnecessarily expensive.

> > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > workloads. If this case happen, it invalidates effect of high order
> > > cache in pcp since cached high order pages would be also freed to the
> > > buddy when burst order-0 free happens.
> > > 
> > 
> > A large burst of order-0 frees will free the high-order cache if it's not
> > being used but I don't see what your point is or why that is a problem.
> > It is pretty much guaranteed that there will be workloads that benefit
> > from protecting the high-order cache (SLUB-intensive alloc/free
> > intensive workloads) while others suffer (Fault-intensive map/unmap
> > workloads).
> > 
> > What's there at the moment behaves reasonably on a variety of workloads
> > across 8 machines.
> 
> Yes, I see that this patch improves some workloads. What I like to say
> is that I find some weakness and if it is fixed, we can get better
> result. This patch implement unified pcp cache for migratetype and
> high-order but if we separate them and manage number of cached items
> separately, we would not have above problem. Could you teach me the
> reason not to implement the separate cache for high order?
> 

It's additional complexity and a separate cache that would require separate
batch counts and high watermarks for order-0 and high-order for a problem
that is not demonstrated as being necessary by a workload on any platform.

> > 
> > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > >  	struct page *page;
> > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > >  
> > > > -	if (likely(order == 0)) {
> > > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > >  		struct per_cpu_pages *pcp;
> > > >  		struct list_head *list;
> > > >  
> > > >  		local_irq_save(flags);
> > > >  		do {
> > > > +			unsigned int pindex;
> > > > +
> > > > +			pindex = order_to_pindex(migratetype, order);
> > > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > -			list = &pcp->lists[migratetype];
> > > > +			list = &pcp->lists[pindex];
> > > >  			if (list_empty(list)) {
> > > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > > +				int nr_pages = rmqueue_bulk(zone, order,
> > > >  						pcp->batch, list,
> > > >  						migratetype, cold);
> > > 
> > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > * because list empty here doesn't mean that pcp->count is zero.
> > > 
> > 
> > Potentially a refill can cause a drain on another list. However, I adjusted
> > the high watermark in pageset_set_batch to make it unlikely that a single
> > refill will cause a drain and added a comment about it. I say unlikely
> > because it's not guaranteed. A carefully created workload could potentially
> > bring all the order-0 and some of the high-order caches close to the
> > watermark and then trigger a drain due to a refill of order-3.  The impact
> > is marginal and in itself does not warrent increasing the high watermark
> > to guarantee that no single refill can cause a drain on the next free.
> 
> Hmm... What makes me wonder is that alloc/free isn't symmetric.
> Free in free_pcppages_bulk() are done until number of freed pages
> becomes the batch. High order pages are counted as 1 << order.
> But, in refill here, counting high order pages is single one rather than
> 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> number of high order page with the number of order-0 page in one
> batch?
> 

Because on the alloc side, we're batching the number of operations done
under the lock and on the free side, we're concerned with how much memory
is pinned by the per-cpu cache. There are different options that could
be taken such as accounting for the number of list elements instead of
order-0 pages but that will make the size of the per-cpu cache variable
without necessarily being beneficial. Pretty much anything related to the
per-cpu cache is an optimistic heuristic on what is beneficial to cache
and when with the view to reducing operations taken under the zone->lock.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-05  9:57           ` Mel Gorman
@ 2016-12-06  2:43             ` Joonsoo Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-06  2:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Mon, Dec 05, 2016 at 09:57:39AM +0000, Mel Gorman wrote:
> On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > > >  			if (unlikely(isolated_pageblocks))
> > > > >  				mt = get_pageblock_migratetype(page);
> > > > >  
> > > > > +			nr_freed += (1 << order);
> > > > > +			count -= (1 << order);
> > > > >  			if (bulkfree_pcp_prepare(page))
> > > > >  				continue;
> > > > >  
> > > > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > > > -		} while (--count && --batch_free && !list_empty(list));
> > > > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > > > >  	}
> > > > >  	spin_unlock(&zone->lock);
> > > > > +	pcp->count -= nr_freed;
> > > > >  }
> > > > 
> > > > I guess that this patch would cause following problems.
> > > > 
> > > > 1. If pcp->batch is too small, high order page will not be freed
> > > > easily and survive longer. Think about following situation.
> > > > 
> > > > Batch count: 7
> > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > > -> order 2...
> > > > 
> > > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > > so order 3 would not be freed.
> > > > 
> > > 
> > > You're relying on the batch count to be 7 where in a lot of cases it's
> > > 31. Even if low batch counts are common on another platform or you adjusted
> > > the other counts to be higher values until they equal 30, it would be for
> > > this drain that no order-3 pages were freed. It's not a permanent situation.
> > > 
> > > When or if it gets freed depends on the allocation request stream but the
> > > same applies to the existing caches. If a high-order request arrives, it'll
> > > be used. If all the requests are for the other orders, then eventually
> > > the frees will hit the high watermark enough that the round-robin batch
> > > freeing fill free the order-3 entry in the cache.
> > 
> > I know that it isn't a permanent situation and it depends on workload.
> > However, it is clearly an unfair situation. We don't have any good reason
> > to cache higher order freepage longer. Even, batch count 7 means
> > that it is a small system. In this kind of system, there is no reason
> > to keep high order freepage longer in the cache.
> > 
> 
> Without knowing the future allocation request stream, there is no reason
> to favour one part of the per-cpu cache over another. To me, it's not

What I suggest is that. Don't favour one part of the per-cpu cache over
another.

> actually clear at all it's an unfair situation, particularly given that the
> vanilla code is also unfair -- the vanilla code can artifically preserve
> MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> The only deciding factor there was a fault-intensive workload would mask
> overhead of the page allocator due to page zeroing cost which UNMOVABLE
> allocations may or may not require. Even that is vague considering that
> page-table allocations are zeroing even if many kernel allocations are not.

"Vanilla works like that" doesn't seem to be reasonable to justify
this change.  Vanilla code works with three lists and it now become
six lists and each list can have different size of page. We need to
think that previous approach will also work fine with current one. I
think that there is a problem although it's not permanent and would be
minor. However, it's better to fix it when it is found.

> > The other potential problem is that if we change
> > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> > doesn't guarantee that free_pcppages_bulk() will work fairly and we
> > will not notice it easily.
> > 
> 
> In the event the high-order cache is increased, then the high watermark
> would also need to be adjusted to account for that just as this patch
> does.

pcp->high will be adjusted automatically when high-order cache is
increased by your change. What we miss is pcp->batch and there is no
information about that the number of high-order cache and pcp->batch
has some association.

> > I think that it can be simply solved by maintaining a last pindex in
> > pcp. How about it?
> > 
> 
> That would rely on the previous allocation stream to drive the freeing
> which is slightly related to the fact the per-cpu cache contents are
> related to the previous request stream. It's still not guaranteed to be
> related to the future request stream.
> 
> Adding a new pindex cache adds complexity to the free path without any
> guarantee it benefits anything. The use of such a heuristic should be

It provides the benefit that prevents that high order page survives
longer in the cacahe in any pcp->batch, in any PAGE_ALLOC_COSTLY_ORDER
setup.

> driven by a workload demonstrating it's a problem. Granted, half of the
> cost of a free operations is due to irq enable/disable but there is no
> reason to make it unnecessarily expensive.

I think that it's not that complex. What we need all is that just one
variable in pcp and read/write it once in free_pcppages_bulk().

free_pcppages_bulk()
{
        int pindex = pcp->last_pindex

        ...

        pcp->last_pindex = pindex
        return
}

> 
> > > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > > workloads. If this case happen, it invalidates effect of high order
> > > > cache in pcp since cached high order pages would be also freed to the
> > > > buddy when burst order-0 free happens.
> > > > 
> > > 
> > > A large burst of order-0 frees will free the high-order cache if it's not
> > > being used but I don't see what your point is or why that is a problem.
> > > It is pretty much guaranteed that there will be workloads that benefit
> > > from protecting the high-order cache (SLUB-intensive alloc/free
> > > intensive workloads) while others suffer (Fault-intensive map/unmap
> > > workloads).
> > > 
> > > What's there at the moment behaves reasonably on a variety of workloads
> > > across 8 machines.
> > 
> > Yes, I see that this patch improves some workloads. What I like to say
> > is that I find some weakness and if it is fixed, we can get better
> > result. This patch implement unified pcp cache for migratetype and
> > high-order but if we separate them and manage number of cached items
> > separately, we would not have above problem. Could you teach me the
> > reason not to implement the separate cache for high order?
> > 
> 
> It's additional complexity and a separate cache that would require separate
> batch counts and high watermarks for order-0 and high-order for a problem
> that is not demonstrated as being necessary by a workload on any platform.

It is a potential problem of this implementation so no one can
demonstrated that separating is necessary without testing this patch widely.

The other problem that I can guess is that caching high order page in
pcp on highly fragmentation system will cause more compaction and negative
performance impact. It would be way expensive than merit of reducing
lock contention so we need to setup pcp's high/batch for high order
page carefully. Perhaps, separating them would be helpful here.

Anyway, nothing is proved so I don't insist more.

> > > 
> > > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > > >  	struct page *page;
> > > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > > >  
> > > > > -	if (likely(order == 0)) {
> > > > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > > >  		struct per_cpu_pages *pcp;
> > > > >  		struct list_head *list;
> > > > >  
> > > > >  		local_irq_save(flags);
> > > > >  		do {
> > > > > +			unsigned int pindex;
> > > > > +
> > > > > +			pindex = order_to_pindex(migratetype, order);
> > > > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > > -			list = &pcp->lists[migratetype];
> > > > > +			list = &pcp->lists[pindex];
> > > > >  			if (list_empty(list)) {
> > > > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > > > +				int nr_pages = rmqueue_bulk(zone, order,
> > > > >  						pcp->batch, list,
> > > > >  						migratetype, cold);
> > > > 
> > > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > > * because list empty here doesn't mean that pcp->count is zero.
> > > > 
> > > 
> > > Potentially a refill can cause a drain on another list. However, I adjusted
> > > the high watermark in pageset_set_batch to make it unlikely that a single
> > > refill will cause a drain and added a comment about it. I say unlikely
> > > because it's not guaranteed. A carefully created workload could potentially
> > > bring all the order-0 and some of the high-order caches close to the
> > > watermark and then trigger a drain due to a refill of order-3.  The impact
> > > is marginal and in itself does not warrent increasing the high watermark
> > > to guarantee that no single refill can cause a drain on the next free.
> > 
> > Hmm... What makes me wonder is that alloc/free isn't symmetric.
> > Free in free_pcppages_bulk() are done until number of freed pages
> > becomes the batch. High order pages are counted as 1 << order.
> > But, in refill here, counting high order pages is single one rather than
> > 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> > number of high order page with the number of order-0 page in one
> > batch?
> > 
> 
> Because on the alloc side, we're batching the number of operations done
> under the lock and on the free side, we're concerned with how much memory
> is pinned by the per-cpu cache. There are different options that could
> be taken such as accounting for the number of list elements instead of
> order-0 pages but that will make the size of the per-cpu cache variable
> without necessarily being beneficial. Pretty much anything related to the
> per-cpu cache is an optimistic heuristic on what is beneficial to cache
> and when with the view to reducing operations taken under the zone->lock.

As far as I know, maximum memory pinned by pcp is controlled by
pcp->high. pcp->batch is all for reducing lock contention even when it
is used in free side. If meaning of pcp->batch is different in
alloc/free side, it's better to use another variable. It is really
complicate to think/implement different concepts by using a single
variable.

And, if alloc side allows to cache pcp->batch elements of each type,
we cannot control how much memory is pinned by pcp. I know that you
add some comment about it but breaking design assumption that
pcp->high is maximum memory pinned by pcp looks not good. Is there
any way to fix it?

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-06  2:43             ` Joonsoo Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Joonsoo Kim @ 2016-12-06  2:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Mon, Dec 05, 2016 at 09:57:39AM +0000, Mel Gorman wrote:
> On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > > >  			if (unlikely(isolated_pageblocks))
> > > > >  				mt = get_pageblock_migratetype(page);
> > > > >  
> > > > > +			nr_freed += (1 << order);
> > > > > +			count -= (1 << order);
> > > > >  			if (bulkfree_pcp_prepare(page))
> > > > >  				continue;
> > > > >  
> > > > > -			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > > -			trace_mm_page_pcpu_drain(page, 0, mt);
> > > > > -		} while (--count && --batch_free && !list_empty(list));
> > > > > +			__free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > > +			trace_mm_page_pcpu_drain(page, order, mt);
> > > > > +		} while (count > 0 && --batch_free && !list_empty(list));
> > > > >  	}
> > > > >  	spin_unlock(&zone->lock);
> > > > > +	pcp->count -= nr_freed;
> > > > >  }
> > > > 
> > > > I guess that this patch would cause following problems.
> > > > 
> > > > 1. If pcp->batch is too small, high order page will not be freed
> > > > easily and survive longer. Think about following situation.
> > > > 
> > > > Batch count: 7
> > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > > -> order 2...
> > > > 
> > > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > > so order 3 would not be freed.
> > > > 
> > > 
> > > You're relying on the batch count to be 7 where in a lot of cases it's
> > > 31. Even if low batch counts are common on another platform or you adjusted
> > > the other counts to be higher values until they equal 30, it would be for
> > > this drain that no order-3 pages were freed. It's not a permanent situation.
> > > 
> > > When or if it gets freed depends on the allocation request stream but the
> > > same applies to the existing caches. If a high-order request arrives, it'll
> > > be used. If all the requests are for the other orders, then eventually
> > > the frees will hit the high watermark enough that the round-robin batch
> > > freeing fill free the order-3 entry in the cache.
> > 
> > I know that it isn't a permanent situation and it depends on workload.
> > However, it is clearly an unfair situation. We don't have any good reason
> > to cache higher order freepage longer. Even, batch count 7 means
> > that it is a small system. In this kind of system, there is no reason
> > to keep high order freepage longer in the cache.
> > 
> 
> Without knowing the future allocation request stream, there is no reason
> to favour one part of the per-cpu cache over another. To me, it's not

What I suggest is that. Don't favour one part of the per-cpu cache over
another.

> actually clear at all it's an unfair situation, particularly given that the
> vanilla code is also unfair -- the vanilla code can artifically preserve
> MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> The only deciding factor there was a fault-intensive workload would mask
> overhead of the page allocator due to page zeroing cost which UNMOVABLE
> allocations may or may not require. Even that is vague considering that
> page-table allocations are zeroing even if many kernel allocations are not.

"Vanilla works like that" doesn't seem to be reasonable to justify
this change.  Vanilla code works with three lists and it now become
six lists and each list can have different size of page. We need to
think that previous approach will also work fine with current one. I
think that there is a problem although it's not permanent and would be
minor. However, it's better to fix it when it is found.

> > The other potential problem is that if we change
> > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> > doesn't guarantee that free_pcppages_bulk() will work fairly and we
> > will not notice it easily.
> > 
> 
> In the event the high-order cache is increased, then the high watermark
> would also need to be adjusted to account for that just as this patch
> does.

pcp->high will be adjusted automatically when high-order cache is
increased by your change. What we miss is pcp->batch and there is no
information about that the number of high-order cache and pcp->batch
has some association.

> > I think that it can be simply solved by maintaining a last pindex in
> > pcp. How about it?
> > 
> 
> That would rely on the previous allocation stream to drive the freeing
> which is slightly related to the fact the per-cpu cache contents are
> related to the previous request stream. It's still not guaranteed to be
> related to the future request stream.
> 
> Adding a new pindex cache adds complexity to the free path without any
> guarantee it benefits anything. The use of such a heuristic should be

It provides the benefit that prevents that high order page survives
longer in the cacahe in any pcp->batch, in any PAGE_ALLOC_COSTLY_ORDER
setup.

> driven by a workload demonstrating it's a problem. Granted, half of the
> cost of a free operations is due to irq enable/disable but there is no
> reason to make it unnecessarily expensive.

I think that it's not that complex. What we need all is that just one
variable in pcp and read/write it once in free_pcppages_bulk().

free_pcppages_bulk()
{
        int pindex = pcp->last_pindex

        ...

        pcp->last_pindex = pindex
        return
}

> 
> > > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > > workloads. If this case happen, it invalidates effect of high order
> > > > cache in pcp since cached high order pages would be also freed to the
> > > > buddy when burst order-0 free happens.
> > > > 
> > > 
> > > A large burst of order-0 frees will free the high-order cache if it's not
> > > being used but I don't see what your point is or why that is a problem.
> > > It is pretty much guaranteed that there will be workloads that benefit
> > > from protecting the high-order cache (SLUB-intensive alloc/free
> > > intensive workloads) while others suffer (Fault-intensive map/unmap
> > > workloads).
> > > 
> > > What's there at the moment behaves reasonably on a variety of workloads
> > > across 8 machines.
> > 
> > Yes, I see that this patch improves some workloads. What I like to say
> > is that I find some weakness and if it is fixed, we can get better
> > result. This patch implement unified pcp cache for migratetype and
> > high-order but if we separate them and manage number of cached items
> > separately, we would not have above problem. Could you teach me the
> > reason not to implement the separate cache for high order?
> > 
> 
> It's additional complexity and a separate cache that would require separate
> batch counts and high watermarks for order-0 and high-order for a problem
> that is not demonstrated as being necessary by a workload on any platform.

It is a potential problem of this implementation so no one can
demonstrated that separating is necessary without testing this patch widely.

The other problem that I can guess is that caching high order page in
pcp on highly fragmentation system will cause more compaction and negative
performance impact. It would be way expensive than merit of reducing
lock contention so we need to setup pcp's high/batch for high order
page carefully. Perhaps, separating them would be helpful here.

Anyway, nothing is proved so I don't insist more.

> > > 
> > > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > > >  	struct page *page;
> > > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > > >  
> > > > > -	if (likely(order == 0)) {
> > > > > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > > >  		struct per_cpu_pages *pcp;
> > > > >  		struct list_head *list;
> > > > >  
> > > > >  		local_irq_save(flags);
> > > > >  		do {
> > > > > +			unsigned int pindex;
> > > > > +
> > > > > +			pindex = order_to_pindex(migratetype, order);
> > > > >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > > -			list = &pcp->lists[migratetype];
> > > > > +			list = &pcp->lists[pindex];
> > > > >  			if (list_empty(list)) {
> > > > > -				pcp->count += rmqueue_bulk(zone, 0,
> > > > > +				int nr_pages = rmqueue_bulk(zone, order,
> > > > >  						pcp->batch, list,
> > > > >  						migratetype, cold);
> > > > 
> > > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > > * because list empty here doesn't mean that pcp->count is zero.
> > > > 
> > > 
> > > Potentially a refill can cause a drain on another list. However, I adjusted
> > > the high watermark in pageset_set_batch to make it unlikely that a single
> > > refill will cause a drain and added a comment about it. I say unlikely
> > > because it's not guaranteed. A carefully created workload could potentially
> > > bring all the order-0 and some of the high-order caches close to the
> > > watermark and then trigger a drain due to a refill of order-3.  The impact
> > > is marginal and in itself does not warrent increasing the high watermark
> > > to guarantee that no single refill can cause a drain on the next free.
> > 
> > Hmm... What makes me wonder is that alloc/free isn't symmetric.
> > Free in free_pcppages_bulk() are done until number of freed pages
> > becomes the batch. High order pages are counted as 1 << order.
> > But, in refill here, counting high order pages is single one rather than
> > 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> > number of high order page with the number of order-0 page in one
> > batch?
> > 
> 
> Because on the alloc side, we're batching the number of operations done
> under the lock and on the free side, we're concerned with how much memory
> is pinned by the per-cpu cache. There are different options that could
> be taken such as accounting for the number of list elements instead of
> order-0 pages but that will make the size of the per-cpu cache variable
> without necessarily being beneficial. Pretty much anything related to the
> per-cpu cache is an optimistic heuristic on what is beneficial to cache
> and when with the view to reducing operations taken under the zone->lock.

As far as I know, maximum memory pinned by pcp is controlled by
pcp->high. pcp->batch is all for reducing lock contention even when it
is used in free side. If meaning of pcp->batch is different in
alloc/free side, it's better to use another variable. It is really
complicate to think/implement different concepts by using a single
variable.

And, if alloc side allows to cache pcp->batch elements of each type,
we cannot control how much memory is pinned by pcp. I know that you
add some comment about it but breaking design assumption that
pcp->high is maximum memory pinned by pcp looks not good. Is there
any way to fix it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
  2016-12-06  2:43             ` Joonsoo Kim
@ 2016-12-06 13:53               ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-06 13:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote:
> > actually clear at all it's an unfair situation, particularly given that the
> > vanilla code is also unfair -- the vanilla code can artifically preserve
> > MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> > The only deciding factor there was a fault-intensive workload would mask
> > overhead of the page allocator due to page zeroing cost which UNMOVABLE
> > allocations may or may not require. Even that is vague considering that
> > page-table allocations are zeroing even if many kernel allocations are not.
> 
> "Vanilla works like that" doesn't seem to be reasonable to justify
> this change.  Vanilla code works with three lists and it now become
> six lists and each list can have different size of page. We need to
> think that previous approach will also work fine with current one. I
> think that there is a problem although it's not permanent and would be
> minor. However, it's better to fix it when it is found.
> 

This is going in circles. I prototyped the modification which increases
the per-cpu structure slightly and will evaluate. It takes about a day
to run through the full set of tests. If it causes no harm, I'll release
another version.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
@ 2016-12-06 13:53               ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2016-12-06 13:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Jesper Dangaard Brouer, Linux-MM, Linux-Kernel

On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote:
> > actually clear at all it's an unfair situation, particularly given that the
> > vanilla code is also unfair -- the vanilla code can artifically preserve
> > MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> > The only deciding factor there was a fault-intensive workload would mask
> > overhead of the page allocator due to page zeroing cost which UNMOVABLE
> > allocations may or may not require. Even that is vague considering that
> > page-table allocations are zeroing even if many kernel allocations are not.
> 
> "Vanilla works like that" doesn't seem to be reasonable to justify
> this change.  Vanilla code works with three lists and it now become
> six lists and each list can have different size of page. We need to
> think that previous approach will also work fine with current one. I
> think that there is a problem although it's not permanent and would be
> minor. However, it's better to fix it when it is found.
> 

This is going in circles. I prototyped the modification which increases
the per-cpu structure slightly and will evaluate. It takes about a day
to run through the full set of tests. If it causes no harm, I'll release
another version.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-12-06 13:53 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02  0:22 [PATCH 0/2] High-order per-cpu cache v5 Mel Gorman
2016-12-02  0:22 ` Mel Gorman
2016-12-02  0:22 ` [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted Mel Gorman
2016-12-02  0:22   ` Mel Gorman
2016-12-02  3:47   ` Hillf Danton
2016-12-02  3:47     ` Hillf Danton
2016-12-02  6:19     ` Vlastimil Babka
2016-12-02  6:19       ` Vlastimil Babka
2016-12-02  9:30       ` Hillf Danton
2016-12-02  9:30         ` Hillf Danton
2016-12-02 10:04         ` Michal Hocko
2016-12-02 10:04           ` Michal Hocko
2016-12-02 11:02           ` Mel Gorman
2016-12-02 11:02             ` Mel Gorman
2016-12-02  8:12   ` Michal Hocko
2016-12-02  8:12     ` Michal Hocko
2016-12-02  9:49     ` Mel Gorman
2016-12-02  9:49       ` Mel Gorman
2016-12-02 10:03       ` Michal Hocko
2016-12-02 10:03         ` Michal Hocko
2016-12-02  0:22 ` [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5 Mel Gorman
2016-12-02  0:22   ` Mel Gorman
2016-12-02  6:03   ` Joonsoo Kim
2016-12-02  6:03     ` Joonsoo Kim
2016-12-02  8:21     ` Michal Hocko
2016-12-02  8:21       ` Michal Hocko
2016-12-05  3:10       ` Joonsoo Kim
2016-12-05  3:10         ` Joonsoo Kim
2016-12-02  9:04     ` Mel Gorman
2016-12-02  9:04       ` Mel Gorman
2016-12-05  3:06       ` Joonsoo Kim
2016-12-05  3:06         ` Joonsoo Kim
2016-12-05  9:57         ` Mel Gorman
2016-12-05  9:57           ` Mel Gorman
2016-12-06  2:43           ` Joonsoo Kim
2016-12-06  2:43             ` Joonsoo Kim
2016-12-06 13:53             ` Mel Gorman
2016-12-06 13:53               ` Mel Gorman
2016-12-02  8:25   ` Michal Hocko
2016-12-02  8:25     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.