linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
@ 2016-11-27 13:19 Mel Gorman
  2016-11-28 11:00 ` Vlastimil Babka
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Mel Gorman @ 2016-11-27 13:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel, Mel Gorman

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v3
Hmean    send-64         178.38 (  0.00%)      256.74 ( 43.93%)
Hmean    send-128        351.49 (  0.00%)      507.52 ( 44.39%)
Hmean    send-256        671.23 (  0.00%)     1004.19 ( 49.60%)
Hmean    send-1024      2663.60 (  0.00%)     3910.42 ( 46.81%)
Hmean    send-2048      5126.53 (  0.00%)     7562.13 ( 47.51%)
Hmean    send-3312      7949.99 (  0.00%)    11565.98 ( 45.48%)
Hmean    send-4096      9433.56 (  0.00%)    12929.67 ( 37.06%)
Hmean    send-8192     15940.64 (  0.00%)    21587.63 ( 35.43%)
Hmean    send-16384    26699.54 (  0.00%)    32013.79 ( 19.90%)
Hmean    recv-64         178.38 (  0.00%)      256.72 ( 43.92%)
Hmean    recv-128        351.49 (  0.00%)      507.47 ( 44.38%)
Hmean    recv-256        671.20 (  0.00%)     1003.95 ( 49.57%)
Hmean    recv-1024      2663.45 (  0.00%)     3909.70 ( 46.79%)
Hmean    recv-2048      5126.26 (  0.00%)     7560.67 ( 47.49%)
Hmean    recv-3312      7949.50 (  0.00%)    11564.63 ( 45.48%)
Hmean    recv-4096      9433.04 (  0.00%)    12927.48 ( 37.04%)
Hmean    recv-8192     15939.64 (  0.00%)    21584.59 ( 35.41%)
Hmean    recv-16384    26698.44 (  0.00%)    32009.77 ( 19.89%)

1-socket 6 year old machine
                                4.9.0-rc5             4.9.0-rc5
                                  vanilla             hopcpu-v3
Hmean    send-64          87.47 (  0.00%)      127.14 ( 45.36%)
Hmean    send-128        174.36 (  0.00%)      256.42 ( 47.06%)
Hmean    send-256        347.52 (  0.00%)      509.41 ( 46.59%)
Hmean    send-1024      1363.03 (  0.00%)     1991.54 ( 46.11%)
Hmean    send-2048      2632.68 (  0.00%)     3759.51 ( 42.80%)
Hmean    send-3312      4123.19 (  0.00%)     5873.28 ( 42.45%)
Hmean    send-4096      5056.48 (  0.00%)     7072.81 ( 39.88%)
Hmean    send-8192      8784.22 (  0.00%)    12143.92 ( 38.25%)
Hmean    send-16384    15081.60 (  0.00%)    19812.71 ( 31.37%)
Hmean    recv-64          86.19 (  0.00%)      126.59 ( 46.87%)
Hmean    recv-128        173.93 (  0.00%)      255.21 ( 46.73%)
Hmean    recv-256        346.19 (  0.00%)      506.72 ( 46.37%)
Hmean    recv-1024      1358.28 (  0.00%)     1980.03 ( 45.77%)
Hmean    recv-2048      2623.45 (  0.00%)     3729.35 ( 42.15%)
Hmean    recv-3312      4108.63 (  0.00%)     5831.47 ( 41.93%)
Hmean    recv-4096      5037.25 (  0.00%)     7021.59 ( 39.39%)
Hmean    recv-8192      8762.32 (  0.00%)    12072.44 ( 37.78%)
Hmean    recv-16384    15042.36 (  0.00%)    19690.14 ( 30.90%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v3
Hmean    14        93.90 (  0.00%)       92.79 ( -1.18%)
Hmean    100     1211.02 (  0.00%)     1286.66 (  6.25%)
Hmean    300     3081.59 (  0.00%)     3347.84 (  8.64%)
Hmean    500     4614.19 (  0.00%)     4953.23 (  7.35%)
Hmean    850     6521.74 (  0.00%)     6951.72 (  6.59%)
Stddev   14         0.89 (  0.00%)        3.24 (-264.75%)
Stddev   100        5.95 (  0.00%)        8.88 (-49.27%)
Stddev   300       11.16 (  0.00%)       28.13 (-151.98%)
Stddev   500       36.32 (  0.00%)       42.07 (-15.84%)
Stddev   850       29.61 (  0.00%)       66.73 (-125.36%)

sockperf-udp-throughput
                         4.9.0-rc5             4.9.0-rc5
                           vanilla             hopcpu-v3
Hmean    14        16.82 (  0.00%)       25.23 ( 50.03%)
Hmean    100      119.91 (  0.00%)      180.63 ( 50.65%)
Hmean    300      358.11 (  0.00%)      539.29 ( 50.59%)
Hmean    500      595.16 (  0.00%)      892.15 ( 49.90%)
Hmean    850      989.44 (  0.00%)     1496.01 ( 51.20%)
Stddev   14         0.05 (  0.00%)        0.10 (-116.02%)
Stddev   100        0.53 (  0.00%)        1.12 (-111.23%)
Stddev   300        1.43 (  0.00%)        1.58 (-10.21%)
Stddev   500        3.93 (  0.00%)        5.14 (-30.95%)
Stddev   850        4.02 (  0.00%)        6.46 (-60.64%)

Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.

This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.

hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted

1-socket machine
hackbench-process-pipes
                        4.9.0-rc5             4.9.0-rc5
                          vanilla        highmark-v1r12
Amean    1      12.9637 (  0.00%)     13.1807 ( -1.67%)
Amean    3      13.4770 (  0.00%)     13.6803 ( -1.51%)
Amean    5      18.5333 (  0.00%)     18.7383 ( -1.11%)
Amean    7      24.5690 (  0.00%)     23.0550 (  6.16%)
Amean    12     39.7990 (  0.00%)     36.7207 (  7.73%)
Amean    16     56.0520 (  0.00%)     48.2890 ( 13.85%)
Stddev   1       0.3847 (  0.00%)      0.5853 (-52.15%)
Stddev   3       0.2652 (  0.00%)      0.0295 ( 88.89%)
Stddev   5       0.5589 (  0.00%)      0.2466 ( 55.87%)
Stddev   7       0.5310 (  0.00%)      0.6680 (-25.79%)
Stddev   12      1.0780 (  0.00%)      0.3230 ( 70.04%)
Stddev   16      2.1138 (  0.00%)      0.6835 ( 67.66%)

hackbench-process-sockets
Amean    1       4.8873 (  0.00%)      4.7180 (  3.46%)
Amean    3      14.1157 (  0.00%)     14.3643 ( -1.76%)
Amean    5      22.5537 (  0.00%)     23.1380 ( -2.59%)
Amean    7      30.3743 (  0.00%)     31.1520 ( -2.56%)
Amean    12     49.1773 (  0.00%)     50.3060 ( -2.30%)
Amean    16     64.0873 (  0.00%)     66.2633 ( -3.40%)
Stddev   1       0.2360 (  0.00%)      0.2201 (  6.74%)
Stddev   3       0.0539 (  0.00%)      0.0780 (-44.72%)
Stddev   5       0.1463 (  0.00%)      0.1579 ( -7.90%)
Stddev   7       0.1260 (  0.00%)      0.3091 (-145.31%)
Stddev   12      0.2169 (  0.00%)      0.4822 (-122.36%)
Stddev   16      0.0529 (  0.00%)      0.4513 (-753.20%)

It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.

fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.

While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  20 +++++++++-
 mm/page_alloc.c        | 105 +++++++++++++++++++++++++++++--------------------
 2 files changed, 82 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..54032ab2f4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+	return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+	return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
 #define min_wmark_pages(z) (z->watermark[WMARK_MIN])
 #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
 #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -265,7 +283,7 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
-	struct list_head lists[MIGRATE_PCPTYPES];
+	struct list_head lists[NR_PCP_LISTS];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440e3ae2..91dc68c2a717 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 }
 
 #ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, true);
+	return free_pages_prepare(page, order, true);
 }
 
 static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
 	return false;
 }
 #else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
 {
-	return free_pages_prepare(page, 0, false);
+	return free_pages_prepare(page, order, false);
 }
 
 static bool bulkfree_pcp_prepare(struct page *page)
@@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
 {
-	int migratetype = 0;
-	int batch_free = 0;
+	unsigned int pindex = UINT_MAX;	/* Reclaim will start at 0 */
+	unsigned int batch_free = 0;
+	unsigned int nr_freed = 0;
 	unsigned long nr_scanned;
 	bool isolated_pageblocks;
 
@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
-	while (count) {
+	while (count > 0) {
 		struct page *page;
 		struct list_head *list;
+		unsigned int order;
 
 		/*
 		 * Remove pages from lists in a round-robin fashion. A
 		 * batch_free count is maintained that is incremented when an
-		 * empty list is encountered.  This is so more pages are freed
-		 * off fuller lists instead of spinning excessively around empty
-		 * lists
+		 * empty list is encountered. This is not exact due to
+		 * high-order but percision is not required.
 		 */
 		do {
 			batch_free++;
-			if (++migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
+			if (++pindex == NR_PCP_LISTS)
+				pindex = 0;
+			list = &pcp->lists[pindex];
 		} while (list_empty(list));
 
 		/* This is the only non-empty list. Free them all. */
-		if (batch_free == MIGRATE_PCPTYPES)
+		if (batch_free == NR_PCP_LISTS)
 			batch_free = count;
 
+		order = pindex_to_order(pindex);
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
 
@@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (bulkfree_pcp_prepare(page))
 				continue;
 
-			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
-		} while (--count && --batch_free && !list_empty(list));
+			__free_one_page(page, page_to_pfn(page), zone, order, mt);
+			trace_mm_page_pcpu_drain(page, order, mt);
+			nr_freed += (1 << order);
+			count -= (1 << order);
+		} while (count > 0 && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+	pcp->count -= nr_freed;
 }
 
 static void free_one_page(struct zone *zone,
@@ -2243,10 +2248,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	local_irq_save(flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0) {
+	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-		pcp->count -= to_drain;
-	}
 	local_irq_restore(flags);
 }
 #endif
@@ -2268,10 +2271,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	pset = per_cpu_ptr(zone->pageset, cpu);
 
 	pcp = &pset->pcp;
-	if (pcp->count) {
+	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
-		pcp->count = 0;
-	}
 	local_irq_restore(flags);
 }
 
@@ -2403,18 +2404,18 @@ void mark_free_pages(struct zone *zone)
 #endif /* CONFIG_PM */
 
 /*
- * Free a 0-order page
+ * Free a pcp page
  * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
-	int migratetype;
+	int migratetype, pindex;
 
-	if (!free_pcp_prepare(page))
+	if (!free_pcp_prepare(page, order))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2431,28 +2432,33 @@ void free_hot_cold_page(struct page *page, bool cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, order, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pindex = order_to_pindex(migratetype, order);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (!cold)
-		list_add(&page->lru, &pcp->lists[migratetype]);
+		list_add(&page->lru, &pcp->lists[pindex]);
 	else
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	pcp->count++;
+		list_add_tail(&page->lru, &pcp->lists[pindex]);
+	pcp->count += 1 << order;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = READ_ONCE(pcp->batch);
 		free_pcppages_bulk(zone, batch, pcp);
-		pcp->count -= batch;
 	}
 
 out:
 	local_irq_restore(flags);
 }
 
+void free_hot_cold_page(struct page *page, bool cold)
+{
+	__free_hot_cold_page(page, cold, 0);
+}
+
 /*
  * Free a list of 0-order pages
  */
@@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	struct page *page;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
-	if (likely(order == 0)) {
+	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
 		local_irq_save(flags);
 		do {
+			unsigned int pindex;
+
+			pindex = order_to_pindex(migratetype, order);
 			pcp = &this_cpu_ptr(zone->pageset)->pcp;
-			list = &pcp->lists[migratetype];
+			list = &pcp->lists[pindex];
 			if (list_empty(list)) {
-				pcp->count += rmqueue_bulk(zone, 0,
+				int nr_pages = rmqueue_bulk(zone, order,
 						pcp->batch, list,
 						migratetype, cold);
+				pcp->count += (nr_pages << order);
 				if (unlikely(list_empty(list)))
 					goto failed;
 			}
@@ -2610,7 +2620,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 				page = list_first_entry(list, struct page, lru);
 
 			list_del(&page->lru);
-			pcp->count--;
+			pcp->count -= (1 << order);
 
 		} while (check_new_pcp(page));
 	} else {
@@ -3837,8 +3847,8 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
-			free_hot_cold_page(page, false);
+		if (order <= PAGE_ALLOC_COSTLY_ORDER)
+			__free_hot_cold_page(page, false, order);
 		else
 			__free_pages_ok(page, order);
 	}
@@ -5160,20 +5170,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 /* a companion to pageset_set_high() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
-	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+	unsigned long high;
+
+	/*
+	 * per-cpu refills occur when a per-cpu list for a migratetype
+	 * or a high-order is depleted even if pages are free overall.
+	 * Tune the high watermark such that it's unlikely, but not
+	 * impossible, that a single refill event will trigger a
+	 * shrink on the next free to the per-cpu list.
+	 */
+	high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+	pageset_update(&p->pcp, high, max(1UL, 1 * batch));
 }
 
 static void pageset_init(struct per_cpu_pageset *p)
 {
 	struct per_cpu_pages *pcp;
-	int migratetype;
+	unsigned int pindex;
 
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp;
 	pcp->count = 0;
-	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
-		INIT_LIST_HEAD(&pcp->lists[migratetype]);
+	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+		INIT_LIST_HEAD(&pcp->lists[pindex]);
 }
 
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
@ 2016-11-28 11:00 ` Vlastimil Babka
  2016-11-28 11:45   ` Mel Gorman
  2016-11-30  8:55   ` Mel Gorman
  2016-11-28 15:39 ` Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Vlastimil Babka @ 2016-11-28 11:00 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Christoph Lameter, Michal Hocko, Johannes Weiner, Linux-MM, Linux-Kernel

On 11/27/2016 02:19 PM, Mel Gorman wrote:
>
> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v3
> Hmean    send-64         178.38 (  0.00%)      256.74 ( 43.93%)
> Hmean    send-128        351.49 (  0.00%)      507.52 ( 44.39%)
> Hmean    send-256        671.23 (  0.00%)     1004.19 ( 49.60%)
> Hmean    send-1024      2663.60 (  0.00%)     3910.42 ( 46.81%)
> Hmean    send-2048      5126.53 (  0.00%)     7562.13 ( 47.51%)
> Hmean    send-3312      7949.99 (  0.00%)    11565.98 ( 45.48%)
> Hmean    send-4096      9433.56 (  0.00%)    12929.67 ( 37.06%)
> Hmean    send-8192     15940.64 (  0.00%)    21587.63 ( 35.43%)
> Hmean    send-16384    26699.54 (  0.00%)    32013.79 ( 19.90%)
> Hmean    recv-64         178.38 (  0.00%)      256.72 ( 43.92%)
> Hmean    recv-128        351.49 (  0.00%)      507.47 ( 44.38%)
> Hmean    recv-256        671.20 (  0.00%)     1003.95 ( 49.57%)
> Hmean    recv-1024      2663.45 (  0.00%)     3909.70 ( 46.79%)
> Hmean    recv-2048      5126.26 (  0.00%)     7560.67 ( 47.49%)
> Hmean    recv-3312      7949.50 (  0.00%)    11564.63 ( 45.48%)
> Hmean    recv-4096      9433.04 (  0.00%)    12927.48 ( 37.04%)
> Hmean    recv-8192     15939.64 (  0.00%)    21584.59 ( 35.41%)
> Hmean    recv-16384    26698.44 (  0.00%)    32009.77 ( 19.89%)
>
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v3
> Hmean    send-64          87.47 (  0.00%)      127.14 ( 45.36%)
> Hmean    send-128        174.36 (  0.00%)      256.42 ( 47.06%)
> Hmean    send-256        347.52 (  0.00%)      509.41 ( 46.59%)
> Hmean    send-1024      1363.03 (  0.00%)     1991.54 ( 46.11%)
> Hmean    send-2048      2632.68 (  0.00%)     3759.51 ( 42.80%)
> Hmean    send-3312      4123.19 (  0.00%)     5873.28 ( 42.45%)
> Hmean    send-4096      5056.48 (  0.00%)     7072.81 ( 39.88%)
> Hmean    send-8192      8784.22 (  0.00%)    12143.92 ( 38.25%)
> Hmean    send-16384    15081.60 (  0.00%)    19812.71 ( 31.37%)
> Hmean    recv-64          86.19 (  0.00%)      126.59 ( 46.87%)
> Hmean    recv-128        173.93 (  0.00%)      255.21 ( 46.73%)
> Hmean    recv-256        346.19 (  0.00%)      506.72 ( 46.37%)
> Hmean    recv-1024      1358.28 (  0.00%)     1980.03 ( 45.77%)
> Hmean    recv-2048      2623.45 (  0.00%)     3729.35 ( 42.15%)
> Hmean    recv-3312      4108.63 (  0.00%)     5831.47 ( 41.93%)
> Hmean    recv-4096      5037.25 (  0.00%)     7021.59 ( 39.39%)
> Hmean    recv-8192      8762.32 (  0.00%)    12072.44 ( 37.78%)
> Hmean    recv-16384    15042.36 (  0.00%)    19690.14 ( 30.90%)

That looks way much better than the "v1" RFC posting. Was it just 
because you stopped doing the "at first iteration, use migratetype as 
index", and initializing pindex UINT_MAX hits so much quicker, or was 
there something more subtle that I missed? There was no changelog 
between "v1" and "v2".

>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 11:00 ` Vlastimil Babka
@ 2016-11-28 11:45   ` Mel Gorman
  2016-11-30  8:55   ` Mel Gorman
  1 sibling, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2016-11-28 11:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> On 11/27/2016 02:19 PM, Mel Gorman wrote:
> > 
> > 2-socket modern machine
> >                                 4.9.0-rc5             4.9.0-rc5
> >                                   vanilla             hopcpu-v3
> > Hmean    send-64         178.38 (  0.00%)      256.74 ( 43.93%)
> > Hmean    send-128        351.49 (  0.00%)      507.52 ( 44.39%)
> > Hmean    send-256        671.23 (  0.00%)     1004.19 ( 49.60%)
> > Hmean    send-1024      2663.60 (  0.00%)     3910.42 ( 46.81%)
> > Hmean    send-2048      5126.53 (  0.00%)     7562.13 ( 47.51%)
> > Hmean    send-3312      7949.99 (  0.00%)    11565.98 ( 45.48%)
> > Hmean    send-4096      9433.56 (  0.00%)    12929.67 ( 37.06%)
> > Hmean    send-8192     15940.64 (  0.00%)    21587.63 ( 35.43%)
> > Hmean    send-16384    26699.54 (  0.00%)    32013.79 ( 19.90%)
> > Hmean    recv-64         178.38 (  0.00%)      256.72 ( 43.92%)
> > Hmean    recv-128        351.49 (  0.00%)      507.47 ( 44.38%)
> > Hmean    recv-256        671.20 (  0.00%)     1003.95 ( 49.57%)
> > Hmean    recv-1024      2663.45 (  0.00%)     3909.70 ( 46.79%)
> > Hmean    recv-2048      5126.26 (  0.00%)     7560.67 ( 47.49%)
> > Hmean    recv-3312      7949.50 (  0.00%)    11564.63 ( 45.48%)
> > Hmean    recv-4096      9433.04 (  0.00%)    12927.48 ( 37.04%)
> > Hmean    recv-8192     15939.64 (  0.00%)    21584.59 ( 35.41%)
> > Hmean    recv-16384    26698.44 (  0.00%)    32009.77 ( 19.89%)
> > 
> > 1-socket 6 year old machine
> >                                 4.9.0-rc5             4.9.0-rc5
> >                                   vanilla             hopcpu-v3
> > Hmean    send-64          87.47 (  0.00%)      127.14 ( 45.36%)
> > Hmean    send-128        174.36 (  0.00%)      256.42 ( 47.06%)
> > Hmean    send-256        347.52 (  0.00%)      509.41 ( 46.59%)
> > Hmean    send-1024      1363.03 (  0.00%)     1991.54 ( 46.11%)
> > Hmean    send-2048      2632.68 (  0.00%)     3759.51 ( 42.80%)
> > Hmean    send-3312      4123.19 (  0.00%)     5873.28 ( 42.45%)
> > Hmean    send-4096      5056.48 (  0.00%)     7072.81 ( 39.88%)
> > Hmean    send-8192      8784.22 (  0.00%)    12143.92 ( 38.25%)
> > Hmean    send-16384    15081.60 (  0.00%)    19812.71 ( 31.37%)
> > Hmean    recv-64          86.19 (  0.00%)      126.59 ( 46.87%)
> > Hmean    recv-128        173.93 (  0.00%)      255.21 ( 46.73%)
> > Hmean    recv-256        346.19 (  0.00%)      506.72 ( 46.37%)
> > Hmean    recv-1024      1358.28 (  0.00%)     1980.03 ( 45.77%)
> > Hmean    recv-2048      2623.45 (  0.00%)     3729.35 ( 42.15%)
> > Hmean    recv-3312      4108.63 (  0.00%)     5831.47 ( 41.93%)
> > Hmean    recv-4096      5037.25 (  0.00%)     7021.59 ( 39.39%)
> > Hmean    recv-8192      8762.32 (  0.00%)    12072.44 ( 37.78%)
> > Hmean    recv-16384    15042.36 (  0.00%)    19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

The array is sized correctly which avoids one useless check. The order-0
lists are always drained first so in some rare cases, only the fast
paths are used. There was a subtle correction in detecting when all of
one list should be drained. In combination, it happened to boost
performance a lot on the two machines I reported on. While 6 other
machines were tested, not all of them saw such a dramatic boost and if
these machines are rebooted and retested every time, the high
performance is not always consistent, it all depends on how often the
fast paths are used.

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
  2016-11-28 11:00 ` Vlastimil Babka
@ 2016-11-28 15:39 ` Christoph Lameter
  2016-11-28 16:21   ` Mel Gorman
  2016-11-28 19:54 ` Johannes Weiner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2016-11-28 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Sun, 27 Nov 2016, Mel Gorman wrote:

>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications

Note that SLUB will only use high order pages when available and fall back
to order 0 if memory is fragmented. This means that the effect of this
patch is going to gradually vanish as memory becomes more and more
fragmented.

I think this patch is beneficial but we need to address long term the
issue of memory fragmentation. That is not only a SLUB issue but an
overall problem since we keep on having to maintain lists of 4k memory
blocks in variuos subsystems. And as memory increases these lists are
becoming larger and larger and more difficult to manage. Code complexity
increases and fragility too (look at transparent hugepages). Ultimately we
will need a clean way to manage the allocation and freeing of large
physically contiguous pages. Reserving memory at booting (CMA, giant
pages) is some sort of solution but this all devolves into lots of knobs
that only insiders know how to tune and an overall fragile solution.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 15:39 ` Christoph Lameter
@ 2016-11-28 16:21   ` Mel Gorman
  2016-11-28 16:38     ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-11-28 16:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, Nov 28, 2016 at 09:39:19AM -0600, Christoph Lameter wrote:
> On Sun, 27 Nov 2016, Mel Gorman wrote:
> 
> >
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> 
> Note that SLUB will only use high order pages when available and fall back
> to order 0 if memory is fragmented. This means that the effect of this
> patch is going to gradually vanish as memory becomes more and more
> fragmented.
> 

Yes, that's a problem for SLUB with or without this patch. It's always
been the case that SLUB relying on high-order pages for performance is
problematic.

> I think this patch is beneficial but we need to address long term the
> issue of memory fragmentation. That is not only a SLUB issue but an
> overall problem since we keep on having to maintain lists of 4k memory
> blocks in variuos subsystems. And as memory increases these lists are
> becoming larger and larger and more difficult to manage. Code complexity
> increases and fragility too (look at transparent hugepages). Ultimately we
> will need a clean way to manage the allocation and freeing of large
> physically contiguous pages. Reserving memory at booting (CMA, giant
> pages) is some sort of solution but this all devolves into lots of knobs
> that only insiders know how to tune and an overall fragile solution.
> 

While I agree with all of this, it's also a problem independent of this
patch.


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 16:21   ` Mel Gorman
@ 2016-11-28 16:38     ` Christoph Lameter
  2016-11-28 18:47       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2016-11-28 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, 28 Nov 2016, Mel Gorman wrote:

> Yes, that's a problem for SLUB with or without this patch. It's always
> been the case that SLUB relying on high-order pages for performance is
> problematic.

This is a general issue in the kernel. Performance often requires larger
contiguous ranges of memory.


> > that only insiders know how to tune and an overall fragile solution.
> While I agree with all of this, it's also a problem independent of this
> patch.

It is related. The fundamental issue with fragmentation remain and IMHO we
really need to tackle this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 16:38     ` Christoph Lameter
@ 2016-11-28 18:47       ` Mel Gorman
  2016-11-28 18:54         ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-11-28 18:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, Nov 28, 2016 at 10:38:58AM -0600, Christoph Lameter wrote:
> > > that only insiders know how to tune and an overall fragile solution.
> > While I agree with all of this, it's also a problem independent of this
> > patch.
> 
> It is related. The fundamental issue with fragmentation remain and IMHO we
> really need to tackle this.
> 

Fragmentation is one issue. Allocation scalability is a separate issue.
This patch is about scaling parallel allocations of small contiguous
ranges. Even if there were fragmentation-related patches up for discussion,
they would not be directly affected by this patch.

If you have a series aimed at parts of the fragmentation problem or how
subsystems can avoid tracking 4K pages in some important cases then by
all means post them.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 18:47       ` Mel Gorman
@ 2016-11-28 18:54         ` Christoph Lameter
  2016-11-28 20:59           ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2016-11-28 18:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, 28 Nov 2016, Mel Gorman wrote:

> If you have a series aimed at parts of the fragmentation problem or how
> subsystems can avoid tracking 4K pages in some important cases then by
> all means post them.

I designed SLUB with defrag methods in mind. We could warm up some old
patchsets that where never merged:

https://lkml.org/lkml/2010/1/29/332

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
  2016-11-28 11:00 ` Vlastimil Babka
  2016-11-28 15:39 ` Christoph Lameter
@ 2016-11-28 19:54 ` Johannes Weiner
  2016-11-30 12:40 ` Jesper Dangaard Brouer
  2016-11-30 13:05 ` Michal Hocko
  4 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-11-28 19:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Linux-MM, Linux-Kernel

On Sun, Nov 27, 2016 at 01:19:54PM +0000, Mel Gorman wrote:
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

This seems like a net gain to me, and the patch loos good too.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -255,6 +255,24 @@ enum zone_watermarks {
>  	NR_WMARK
>  };
>  
> +/*
> + * One per migratetype for order-0 pages and one per high-order up to
> + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
> + * allocations to contaminate reclaimable pageblocks if high-order
> + * pages are heavily used.

I think that should be fine. Higher order allocations rely on being
able to compact movable blocks, not on reclaim freeing contiguous
blocks, so poisoning reclaimable blocks is much less of a concern than
poisoning movable blocks. And I'm not aware of any 0 < order < COSTLY
movable allocations that would put movable blocks into an HO cache.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 18:54         ` Christoph Lameter
@ 2016-11-28 20:59           ` Vlastimil Babka
  0 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2016-11-28 20:59 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Linux-MM, Linux-Kernel

On 11/28/2016 07:54 PM, Christoph Lameter wrote:
> On Mon, 28 Nov 2016, Mel Gorman wrote:
> 
>> If you have a series aimed at parts of the fragmentation problem or how
>> subsystems can avoid tracking 4K pages in some important cases then by
>> all means post them.
> 
> I designed SLUB with defrag methods in mind. We could warm up some old
> patchsets that where never merged:
> 
> https://lkml.org/lkml/2010/1/29/332

Note that some other solutions to the dentry cache problem (perhaps of a
more low-hanging fruit kind) were also discussed at KS/LPC MM panel
session: https://lwn.net/Articles/705758/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-28 11:00 ` Vlastimil Babka
  2016-11-28 11:45   ` Mel Gorman
@ 2016-11-30  8:55   ` Mel Gorman
  1 sibling, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2016-11-30  8:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Johannes Weiner,
	Linux-MM, Linux-Kernel

On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> > 1-socket 6 year old machine
> >                                 4.9.0-rc5             4.9.0-rc5
> >                                   vanilla             hopcpu-v3
> > Hmean    send-64          87.47 (  0.00%)      127.14 ( 45.36%)
> > Hmean    send-128        174.36 (  0.00%)      256.42 ( 47.06%)
> > Hmean    send-256        347.52 (  0.00%)      509.41 ( 46.59%)
> > Hmean    send-1024      1363.03 (  0.00%)     1991.54 ( 46.11%)
> > Hmean    send-2048      2632.68 (  0.00%)     3759.51 ( 42.80%)
> > Hmean    send-3312      4123.19 (  0.00%)     5873.28 ( 42.45%)
> > Hmean    send-4096      5056.48 (  0.00%)     7072.81 ( 39.88%)
> > Hmean    send-8192      8784.22 (  0.00%)    12143.92 ( 38.25%)
> > Hmean    send-16384    15081.60 (  0.00%)    19812.71 ( 31.37%)
> > Hmean    recv-64          86.19 (  0.00%)      126.59 ( 46.87%)
> > Hmean    recv-128        173.93 (  0.00%)      255.21 ( 46.73%)
> > Hmean    recv-256        346.19 (  0.00%)      506.72 ( 46.37%)
> > Hmean    recv-1024      1358.28 (  0.00%)     1980.03 ( 45.77%)
> > Hmean    recv-2048      2623.45 (  0.00%)     3729.35 ( 42.15%)
> > Hmean    recv-3312      4108.63 (  0.00%)     5831.47 ( 41.93%)
> > Hmean    recv-4096      5037.25 (  0.00%)     7021.59 ( 39.39%)
> > Hmean    recv-8192      8762.32 (  0.00%)    12072.44 ( 37.78%)
> > Hmean    recv-16384    15042.36 (  0.00%)    19690.14 ( 30.90%)
> 
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
> 

FYI, the LKP test robot reported the following so there is some
independent basis for picking this up.

---8<---

FYI, we noticed a +23.0% improvement of netperf.Throughput_Mbps due to
commit:

commit 79404c5a5c66481aa55c0cae685e49e0f44a0479 ("mm: page_alloc: High-order per-cpu page allocator")
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pagealloc-highorder-percpu-v3r1


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2016-11-28 19:54 ` Johannes Weiner
@ 2016-11-30 12:40 ` Jesper Dangaard Brouer
  2016-11-30 14:06   ` Mel Gorman
  2016-11-30 13:05 ` Michal Hocko
  4 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 12:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: brouer, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, Paolo Abeni


On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

[...]
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.

This will also help performance of NIC driver that allocator
higher-order pages for their RX-ring queue (and chop it up for MTU).
I do like this patch, even-though I'm working on moving drivers away
from allocation these high-order pages.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

[...]
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.

I do like you are using a networking test to benchmark this. Looking at
the results, my initial response is that the improvements are basically
too good to be true.

Can you share how you tested this with netperf and the specific netperf
parameters? 
e.g.
 How do you configure the send/recv sizes?
 Have you pinned netperf and netserver on different CPUs?

For localhost testing, when netperf and netserver run on the same CPU,
you observer half the performance, very intuitively.  When pinning
netperf and netserver (via e.g. option -T 1,2) you observe the most
stable results.  When allowing netperf and netserver to migrate between
CPUs (default setting), the real fun starts and unstable results,
because now the CPU scheduler is also being tested, and my experience
is also more "fun" memory situations occurs, as I guess we are hopping
between more per CPU alloc caches (also affecting the SLUB per CPU usage
pattern).

> 2-socket modern machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v3

The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
this single change right?
Netdev/Paolo recently (in net-next) optimized the UDP code path
significantly, and I just want to make sure your results are not
affected by these changes.


> Hmean    send-64         178.38 (  0.00%)      256.74 ( 43.93%)
> Hmean    send-128        351.49 (  0.00%)      507.52 ( 44.39%)
> Hmean    send-256        671.23 (  0.00%)     1004.19 ( 49.60%)
> Hmean    send-1024      2663.60 (  0.00%)     3910.42 ( 46.81%)
> Hmean    send-2048      5126.53 (  0.00%)     7562.13 ( 47.51%)
> Hmean    send-3312      7949.99 (  0.00%)    11565.98 ( 45.48%)
> Hmean    send-4096      9433.56 (  0.00%)    12929.67 ( 37.06%)
> Hmean    send-8192     15940.64 (  0.00%)    21587.63 ( 35.43%)
> Hmean    send-16384    26699.54 (  0.00%)    32013.79 ( 19.90%)
> Hmean    recv-64         178.38 (  0.00%)      256.72 ( 43.92%)
> Hmean    recv-128        351.49 (  0.00%)      507.47 ( 44.38%)
> Hmean    recv-256        671.20 (  0.00%)     1003.95 ( 49.57%)
> Hmean    recv-1024      2663.45 (  0.00%)     3909.70 ( 46.79%)
> Hmean    recv-2048      5126.26 (  0.00%)     7560.67 ( 47.49%)
> Hmean    recv-3312      7949.50 (  0.00%)    11564.63 ( 45.48%)
> Hmean    recv-4096      9433.04 (  0.00%)    12927.48 ( 37.04%)
> Hmean    recv-8192     15939.64 (  0.00%)    21584.59 ( 35.41%)
> Hmean    recv-16384    26698.44 (  0.00%)    32009.77 ( 19.89%)
> 
> 1-socket 6 year old machine
>                                 4.9.0-rc5             4.9.0-rc5
>                                   vanilla             hopcpu-v3
> Hmean    send-64          87.47 (  0.00%)      127.14 ( 45.36%)
> Hmean    send-128        174.36 (  0.00%)      256.42 ( 47.06%)
> Hmean    send-256        347.52 (  0.00%)      509.41 ( 46.59%)
> Hmean    send-1024      1363.03 (  0.00%)     1991.54 ( 46.11%)
> Hmean    send-2048      2632.68 (  0.00%)     3759.51 ( 42.80%)
> Hmean    send-3312      4123.19 (  0.00%)     5873.28 ( 42.45%)
> Hmean    send-4096      5056.48 (  0.00%)     7072.81 ( 39.88%)
> Hmean    send-8192      8784.22 (  0.00%)    12143.92 ( 38.25%)
> Hmean    send-16384    15081.60 (  0.00%)    19812.71 ( 31.37%)
> Hmean    recv-64          86.19 (  0.00%)      126.59 ( 46.87%)
> Hmean    recv-128        173.93 (  0.00%)      255.21 ( 46.73%)
> Hmean    recv-256        346.19 (  0.00%)      506.72 ( 46.37%)
> Hmean    recv-1024      1358.28 (  0.00%)     1980.03 ( 45.77%)
> Hmean    recv-2048      2623.45 (  0.00%)     3729.35 ( 42.15%)
> Hmean    recv-3312      4108.63 (  0.00%)     5831.47 ( 41.93%)
> Hmean    recv-4096      5037.25 (  0.00%)     7021.59 ( 39.39%)
> Hmean    recv-8192      8762.32 (  0.00%)    12072.44 ( 37.78%)
> Hmean    recv-16384    15042.36 (  0.00%)    19690.14 ( 30.90%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2016-11-30 12:40 ` Jesper Dangaard Brouer
@ 2016-11-30 13:05 ` Michal Hocko
  2016-11-30 14:16   ` Mel Gorman
  4 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-11-30 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel

On Sun 27-11-16 13:19:54, Mel Gorman wrote:
[...]
> @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  	struct page *page;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  
> -	if (likely(order == 0)) {
> +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
>  
>  		local_irq_save(flags);
>  		do {
> +			unsigned int pindex;
> +
> +			pindex = order_to_pindex(migratetype, order);
>  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -			list = &pcp->lists[migratetype];
> +			list = &pcp->lists[pindex];
>  			if (list_empty(list)) {
> -				pcp->count += rmqueue_bulk(zone, 0,
> +				int nr_pages = rmqueue_bulk(zone, order,
>  						pcp->batch, list,
>  						migratetype, cold);
> +				pcp->count += (nr_pages << order);
>  				if (unlikely(list_empty(list)))
>  					goto failed;

just a nit, we can reorder the check and the count update because nobody
could have stolen pages allocated by rmqueue_bulk. I would also consider
nr_pages a bit misleading because we get a number or allocated elements.
Nothing to lose sleep over...

>  			}

But...  Unless I am missing something this effectively means that we do
not exercise high order atomic reserves. Shouldn't we fallback to
the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
path which I am not seeing?

Other than that the patch looks reasonable to me. Keeping some portion
of !costly pages on pcp lists sounds useful from the fragmentation
point of view as well AFAICS because it would be normally dissolved for
order-0 requests while we push on the reclaim more right now.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 12:40 ` Jesper Dangaard Brouer
@ 2016-11-30 14:06   ` Mel Gorman
  2016-11-30 15:06     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-11-30 14:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel, Rick Jones, Paolo Abeni

On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> 
> On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> [...]
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> > 
> > o New per-cpu lists are added to cache the high-order pages. This increases
> >   the cache footprint of the per-cpu allocator and overall usage but for
> >   some workloads, this will be offset by reduced contention on zone->lock.
> 
> This will also help performance of NIC driver that allocator
> higher-order pages for their RX-ring queue (and chop it up for MTU).
> I do like this patch, even-though I'm working on moving drivers away
> from allocation these high-order pages.
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 

Thanks.

> [...]
> > This is the result from netperf running UDP_STREAM on localhost. It was
> > selected on the basis that it is slab-intensive and has been the subject
> > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > testing between two physical hosts.
> 
> I do like you are using a networking test to benchmark this. Looking at
> the results, my initial response is that the improvements are basically
> too good to be true.
> 

FWIW, LKP independently measured the boost to be 23% so it's expected
there will be different results depending on exact configuration and CPU.

> Can you share how you tested this with netperf and the specific netperf
> parameters? 

The mmtests config file used is
configs/config-global-dhp__network-netperf-unbound so all details can be
extrapolated or reproduced from that.

> e.g.
>  How do you configure the send/recv sizes?

Static range of sizes specified in the config file.

>  Have you pinned netperf and netserver on different CPUs?
> 

No. While it's possible to do a pinned test which helps stability, it
also tends to be less reflective of what happens in a variety of
workloads so I took the "harder" option.

> For localhost testing, when netperf and netserver run on the same CPU,
> you observer half the performance, very intuitively.  When pinning
> netperf and netserver (via e.g. option -T 1,2) you observe the most
> stable results.  When allowing netperf and netserver to migrate between
> CPUs (default setting), the real fun starts and unstable results,
> because now the CPU scheduler is also being tested, and my experience
> is also more "fun" memory situations occurs, as I guess we are hopping
> between more per CPU alloc caches (also affecting the SLUB per CPU usage
> pattern).
> 

Yes which is another reason why I used an unbound configuration. I didn't
want to get an artificial boost from pinned server/client using the same
per-cpu caches. As a side-effect, it may mean that machines with fewer
CPUs get a greater boost as there are fewer per-cpu caches being used.

> > 2-socket modern machine
> >                                 4.9.0-rc5             4.9.0-rc5
> >                                   vanilla             hopcpu-v3
> 
> The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
> this single change right?

Yes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 13:05 ` Michal Hocko
@ 2016-11-30 14:16   ` Mel Gorman
  2016-11-30 14:59     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-11-30 14:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel

On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
> On Sun 27-11-16 13:19:54, Mel Gorman wrote:
> [...]
> > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  	struct page *page;
> >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -	if (likely(order == 0)) {
> > +	if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> >  		struct per_cpu_pages *pcp;
> >  		struct list_head *list;
> >  
> >  		local_irq_save(flags);
> >  		do {
> > +			unsigned int pindex;
> > +
> > +			pindex = order_to_pindex(migratetype, order);
> >  			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > -			list = &pcp->lists[migratetype];
> > +			list = &pcp->lists[pindex];
> >  			if (list_empty(list)) {
> > -				pcp->count += rmqueue_bulk(zone, 0,
> > +				int nr_pages = rmqueue_bulk(zone, order,
> >  						pcp->batch, list,
> >  						migratetype, cold);
> > +				pcp->count += (nr_pages << order);
> >  				if (unlikely(list_empty(list)))
> >  					goto failed;
> 
> just a nit, we can reorder the check and the count update because nobody
> could have stolen pages allocated by rmqueue_bulk.

Ok, it's minor but I can do that.

> I would also consider
> nr_pages a bit misleading because we get a number or allocated elements.
> Nothing to lose sleep over...
> 

I didn't think of a clearer name because in this sort of context, I consider
a high-order page to be a single page.

> >  			}
> 
> But...  Unless I am missing something this effectively means that we do
> not exercise high order atomic reserves. Shouldn't we fallback to
> the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> path which I am not seeing?
> 

Good spot, would this be acceptable to you?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91dc68c2a717..94808f565f74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 				int nr_pages = rmqueue_bulk(zone, order,
 						pcp->batch, list,
 						migratetype, cold);
-				pcp->count += (nr_pages << order);
-				if (unlikely(list_empty(list)))
+				if (unlikely(list_empty(list))) {
+					/*
+					 * Retry high-order atomic allocs
+					 * from the buddy list which may
+					 * use MIGRATE_HIGHATOMIC.
+					 */
+					if (order && (alloc_flags & ALLOC_HARDER))
+						goto try_buddylist;
+
 					goto failed;
+				}
+				pcp->count += (nr_pages << order);
 			}
 
 			if (cold)
@@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 
 		} while (check_new_pcp(page));
 	} else {
+try_buddylist:
 		/*
 		 * We most definitely don't want callers attempting to
 		 * allocate greater than order-1 page units with __GFP_NOFAIL.
-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 14:16   ` Mel Gorman
@ 2016-11-30 14:59     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-11-30 14:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel

On Wed 30-11-16 14:16:13, Mel Gorman wrote:
> On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
[...]
> > But...  Unless I am missing something this effectively means that we do
> > not exercise high order atomic reserves. Shouldn't we fallback to
> > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> > path which I am not seeing?
> > 
> 
> Good spot, would this be acceptable to you?

It's not a queen of beauty but it works. A more elegant solution would
require more surgery I guess which is probably not worth it at this
stage.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91dc68c2a717..94808f565f74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  				int nr_pages = rmqueue_bulk(zone, order,
>  						pcp->batch, list,
>  						migratetype, cold);
> -				pcp->count += (nr_pages << order);
> -				if (unlikely(list_empty(list)))
> +				if (unlikely(list_empty(list))) {
> +					/*
> +					 * Retry high-order atomic allocs
> +					 * from the buddy list which may
> +					 * use MIGRATE_HIGHATOMIC.
> +					 */
> +					if (order && (alloc_flags & ALLOC_HARDER))
> +						goto try_buddylist;
> +
>  					goto failed;
> +				}
> +				pcp->count += (nr_pages << order);
>  			}
>  
>  			if (cold)
> @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  
>  		} while (check_new_pcp(page));
>  	} else {
> +try_buddylist:
>  		/*
>  		 * We most definitely don't want callers attempting to
>  		 * allocate greater than order-1 page units with __GFP_NOFAIL.
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 14:06   ` Mel Gorman
@ 2016-11-30 15:06     ` Jesper Dangaard Brouer
  2016-11-30 16:35       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel, Rick Jones, Paolo Abeni,
	brouer

On Wed, 30 Nov 2016 14:06:15 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > [...]  
> > > SLUB has been the default small kernel object allocator for quite some time
> > > but it is not universally used due to performance concerns and a reliance
> > > on high-order pages. The high-order concerns has two major components --
> > > high-order pages are not always available and high-order page allocations
> > > potentially contend on the zone->lock. This patch addresses some concerns
> > > about the zone lock contention by extending the per-cpu page allocator to
> > > cache high-order pages. The patch makes the following modifications
> > > 
> > > o New per-cpu lists are added to cache the high-order pages. This increases
> > >   the cache footprint of the per-cpu allocator and overall usage but for
> > >   some workloads, this will be offset by reduced contention on zone->lock.  
> > 
> > This will also help performance of NIC driver that allocator
> > higher-order pages for their RX-ring queue (and chop it up for MTU).
> > I do like this patch, even-though I'm working on moving drivers away
> > from allocation these high-order pages.
> > 
> > Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >   
> 
> Thanks.
> 
> > [...]  
> > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > selected on the basis that it is slab-intensive and has been the subject
> > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > testing between two physical hosts.  
> > 
> > I do like you are using a networking test to benchmark this. Looking at
> > the results, my initial response is that the improvements are basically
> > too good to be true.
> >   
> 
> FWIW, LKP independently measured the boost to be 23% so it's expected
> there will be different results depending on exact configuration and CPU.

Yes, noticed that, nice (which was a SCTP test) 
 https://lists.01.org/pipermail/lkp/2016-November/005210.html

It is of-cause great. It is just strange I cannot reproduce it on my
high-end box, with manual testing. I'll try your test suite and try to
figure out what is wrong with my setup.


> > Can you share how you tested this with netperf and the specific netperf
> > parameters?   
> 
> The mmtests config file used is
> configs/config-global-dhp__network-netperf-unbound so all details can be
> extrapolated or reproduced from that.

I didn't know of mmtests: https://github.com/gormanm/mmtests

It looks nice and quite comprehensive! :-)


> > e.g.
> >  How do you configure the send/recv sizes?  
> 
> Static range of sizes specified in the config file.

I'll figure it out... reading your shell code :-)

export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
 https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72

I see you are using netperf 2.4.5 and setting both the send an recv
size (-- -m and -M) which is fine.

I don't quite get why you are setting the socket recv size (with -- -s
and -S) to such a small number, size + 256.

 SOCKETSIZE_OPT="-s $((SIZE+256)) -S $((SIZE+256))

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 320 -S 320 -m 64 -M 64 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 384 -S 384 -m 128 -M 128 -P 15895

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
 
> >  Have you pinned netperf and netserver on different CPUs?
> >   
> 
> No. While it's possible to do a pinned test which helps stability, it
> also tends to be less reflective of what happens in a variety of
> workloads so I took the "harder" option.

Agree.
 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 15:06     ` Jesper Dangaard Brouer
@ 2016-11-30 16:35       ` Mel Gorman
  2016-12-01 17:34         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-11-30 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel, Rick Jones, Paolo Abeni

On Wed, Nov 30, 2016 at 04:06:12PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]  
> > > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > > selected on the basis that it is slab-intensive and has been the subject
> > > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > > testing between two physical hosts.  
> > > 
> > > I do like you are using a networking test to benchmark this. Looking at
> > > the results, my initial response is that the improvements are basically
> > > too good to be true.
> > >   
> > 
> > FWIW, LKP independently measured the boost to be 23% so it's expected
> > there will be different results depending on exact configuration and CPU.
> 
> Yes, noticed that, nice (which was a SCTP test) 
>  https://lists.01.org/pipermail/lkp/2016-November/005210.html
> 
> It is of-cause great. It is just strange I cannot reproduce it on my
> high-end box, with manual testing. I'll try your test suite and try to
> figure out what is wrong with my setup.
> 

That would be great. I had seen the boost on multiple machines and LKP
verifying it is helpful. 

> 
> > > Can you share how you tested this with netperf and the specific netperf
> > > parameters?   
> > 
> > The mmtests config file used is
> > configs/config-global-dhp__network-netperf-unbound so all details can be
> > extrapolated or reproduced from that.
> 
> I didn't know of mmtests: https://github.com/gormanm/mmtests
> 
> It looks nice and quite comprehensive! :-)
> 

Thanks.

> > > e.g.
> > >  How do you configure the send/recv sizes?  
> > 
> > Static range of sizes specified in the config file.
> 
> I'll figure it out... reading your shell code :-)
> 
> export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
>  https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72
> 
> I see you are using netperf 2.4.5 and setting both the send an recv
> size (-- -m and -M) which is fine.
> 

Ok.

> I don't quite get why you are setting the socket recv size (with -- -s
> and -S) to such a small number, size + 256.
> 

Maybe I missed something at the time I wrote that but why would it need
to be larger?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-11-30 16:35       ` Mel Gorman
@ 2016-12-01 17:34         ` Jesper Dangaard Brouer
  2016-12-01 22:17           ` Paolo Abeni
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Linux-MM, Linux-Kernel, Rick Jones, Paolo Abeni,
	brouer, netdev, Hannes Frederic Sowa

(Cc. netdev, we might have an issue with Paolo's UDP accounting and
small socket queues)

On Wed, 30 Nov 2016 16:35:20 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> > I don't quite get why you are setting the socket recv size
> > (with -- -s and -S) to such a small number, size + 256.
> >   
> 
> Maybe I missed something at the time I wrote that but why would it
> need to be larger?

Well, to me it is quite obvious that we need some queue to avoid packet
drops.  We have two processes netperf and netserver, that are sending
packets between each-other (UDP_STREAM mostly netperf -> netserver).
These PIDs are getting scheduled and migrated between CPUs, and thus
does not get executed equally fast, thus a queue is need absorb the
fluctuations.

The network stack is even partly catching your config "mistake" and
increase the socket queue size, so we minimum can handle one max frame
(due skb "truesize" concept approx PAGE_SIZE + overhead).

Hopefully for localhost testing a small queue should hopefully not
result in packet drops.  Testing... ups, this does result in packet
drops.

Test command extracted from mmtests, UDP_STREAM size 1024:

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
  port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed      Messages                
 Size    Size     Time         Okay Errors   Throughput
 bytes   bytes    secs            #      #   10^6bits/sec

   4608    1024   60.00     50024301      0    6829.98
   2560           60.00     46133211           6298.72

 Dropped packets: 50024301-46133211=3891090

To get a better drop indication, during this I run a command, to get
system-wide network counters from the last second, so below numbers are
per second.

 $ nstat > /dev/null && sleep 1  && nstat
 #kernel
 IpInReceives                    885162             0.0
 IpInDelivers                    885161             0.0
 IpOutRequests                   885162             0.0
 UdpInDatagrams                  776105             0.0
 UdpInErrors                     109056             0.0
 UdpOutDatagrams                 885160             0.0
 UdpRcvbufErrors                 109056             0.0
 IpExtInOctets                   931190476          0.0
 IpExtOutOctets                  931189564          0.0
 IpExtInNoECTPkts                885162             0.0

So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
happens kernel side in __udp_queue_rcv_skb[1], because receiving
process didn't empty it's queue fast enough see [2].

Although upstream changes are coming in this area, [2] is replaced with
__udp_enqueue_schedule_skb, which I actually tested with... hmm

Retesting with kernel 4.7.0-baseline+ ... show something else. 
To Paolo, you might want to look into this.  And it could also explain why
I've not see the mentioned speedup by mm-change, as I've been testing
this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
   -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895
  AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
 Socket  Message  Elapsed      Messages                
 Size    Size     Time         Okay Errors   Throughput
 bytes   bytes    secs            #      #   10^6bits/sec

   4608    1024   60.00     47248301      0    6450.97
   2560           60.00     47245030           6450.52

Only dropped 47248301-47245030=3271

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    810566             0.0
IpInDelivers                    810566             0.0
IpOutRequests                   810566             0.0
UdpInDatagrams                  810468             0.0
UdpInErrors                     99                 0.0
UdpOutDatagrams                 810566             0.0
UdpRcvbufErrors                 99                 0.0
IpExtInOctets                   852713328          0.0
IpExtOutOctets                  852713328          0.0
IpExtInNoECTPkts                810563             0.0

And nstat is also much better with only 99 drop/sec.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.8#L1454
[2] http://lxr.free-electrons.com/source/net/core/sock.c?v=4.8#L413


Extra: with net-next at 93ba2222550

If I use netperf default socket queue, then there is not a single
packet drop:

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1  
   -- -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) 
 port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1024   60.00     48485642      0    6619.91
212992           60.00     48485642           6619.91


$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    821723             0.0
IpInDelivers                    821722             0.0
IpOutRequests                   821723             0.0
UdpInDatagrams                  821722             0.0
UdpOutDatagrams                 821722             0.0
IpExtInOctets                   864457856          0.0
IpExtOutOctets                  864458908          0.0
IpExtInNoECTPkts                821729             0.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-12-01 17:34         ` Jesper Dangaard Brouer
@ 2016-12-01 22:17           ` Paolo Abeni
  2016-12-02 15:37             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Abeni @ 2016-12-01 22:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev, Hannes Frederic Sowa

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed      Messages                
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>    4608    1024   60.00     50024301      0    6829.98
>    2560           60.00     46133211           6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives                    885162             0.0
>  IpInDelivers                    885161             0.0
>  IpOutRequests                   885162             0.0
>  UdpInDatagrams                  776105             0.0
>  UdpInErrors                     109056             0.0
>  UdpOutDatagrams                 885160             0.0
>  UdpRcvbufErrors                 109056             0.0
>  IpExtInOctets                   931190476          0.0
>  IpExtOutOctets                  931189564          0.0
>  IpExtInNoECTPkts                885162             0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;
+	int limit;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
-	if (rmem && (rmem + size > sk->sk_rcvbuf))
+	limit = size + sk->sk_rcvbuf;
+	if (rmem > limit)
 		goto drop;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
 	 */
 	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+	if (rmem > limit)
 		goto uncharge_drop;
 
 	spin_lock(&list->lock);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-12-01 22:17           ` Paolo Abeni
@ 2016-12-02 15:37             ` Jesper Dangaard Brouer
  2016-12-02 15:44               ` Paolo Abeni
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-02 15:37 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev, Hannes Frederic Sowa, brouer

On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni <pabeni@redhat.com> wrote:

> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> > 
> > On Wed, 30 Nov 2016 16:35:20 +0000
> > Mel Gorman <mgorman@techsingularity.net> wrote:
> >   
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > >     
> > > 
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?  
> > 
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops.  We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> > 
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > 
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops.  Testing... ups, this does result in packet
> > drops.
> > 
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> > 
> >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> >    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > 
> >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> >  Socket  Message  Elapsed      Messages                
> >  Size    Size     Time         Okay Errors   Throughput
> >  bytes   bytes    secs            #      #   10^6bits/sec
> > 
> >    4608    1024   60.00     50024301      0    6829.98
> >    2560           60.00     46133211           6298.72
> > 
> >  Dropped packets: 50024301-46133211=3891090
> > 
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> > 
> >  $ nstat > /dev/null && sleep 1  && nstat
> >  #kernel
> >  IpInReceives                    885162             0.0
> >  IpInDelivers                    885161             0.0
> >  IpOutRequests                   885162             0.0
> >  UdpInDatagrams                  776105             0.0
> >  UdpInErrors                     109056             0.0
> >  UdpOutDatagrams                 885160             0.0
> >  UdpRcvbufErrors                 109056             0.0
> >  IpExtInOctets                   931190476          0.0
> >  IpExtOutOctets                  931189564          0.0
> >  IpExtInNoECTPkts                885162             0.0
> > 
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> > 
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > 
> > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > To Paolo, you might want to look into this.  And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.  
> 
> Thank you for reporting this.
> 
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf. 
> 
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
> 
> Can you please try the following patch ?

Sure, it looks much better with this patch.


$ /home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

  4608    1024   60.00     50191555      0    6852.82
  2560           60.00     50189872           6852.59

Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    885417             0.0
IpInDelivers                    885416             0.0
IpOutRequests                   885417             0.0
UdpInDatagrams                  885382             0.0
UdpInErrors                     29                 0.0
UdpOutDatagrams                 885410             0.0
UdpRcvbufErrors                 29                 0.0
IpExtInOctets                   931534428          0.0
IpExtOutOctets                  931533376          0.0
IpExtInNoECTPkts                885488             0.0

 
> Thank you,
> 
> Paolo
> ---
>  net/ipv4/udp.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e1d0bf8..2f5dc92 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	struct sk_buff_head *list = &sk->sk_receive_queue;
>  	int rmem, delta, amt, err = -ENOMEM;
>  	int size = skb->truesize;
> +	int limit;
>  
>  	/* try to avoid the costly atomic add/sub pair when the receive
>  	 * queue is full; always allow at least a packet
>  	 */
>  	rmem = atomic_read(&sk->sk_rmem_alloc);
> -	if (rmem && (rmem + size > sk->sk_rcvbuf))
> +	limit = size + sk->sk_rcvbuf;
> +	if (rmem > limit)
>  		goto drop;
>  
>  	/* we drop only if the receive buf is full and the receive
>  	 * queue contains some other skb
>  	 */
>  	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> -	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
> +	if (rmem > limit)
>  		goto uncharge_drop;
>  
>  	spin_lock(&list->lock);
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
  2016-12-02 15:37             ` Jesper Dangaard Brouer
@ 2016-12-02 15:44               ` Paolo Abeni
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Abeni @ 2016-12-02 15:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev, Hannes Frederic Sowa

On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > > 
> > > On Wed, 30 Nov 2016 16:35:20 +0000
> > > Mel Gorman <mgorman@techsingularity.net> wrote:
> > >   
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > >     
> > > > 
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?  
> > > 
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops.  We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > > 
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > > 
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops.  Testing... ups, this does result in packet
> > > drops.
> > > 
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > > 
> > >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> > >    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > > 
> > >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > >  Socket  Message  Elapsed      Messages                
> > >  Size    Size     Time         Okay Errors   Throughput
> > >  bytes   bytes    secs            #      #   10^6bits/sec
> > > 
> > >    4608    1024   60.00     50024301      0    6829.98
> > >    2560           60.00     46133211           6298.72
> > > 
> > >  Dropped packets: 50024301-46133211=3891090
> > > 
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > > 
> > >  $ nstat > /dev/null && sleep 1  && nstat
> > >  #kernel
> > >  IpInReceives                    885162             0.0
> > >  IpInDelivers                    885161             0.0
> > >  IpOutRequests                   885162             0.0
> > >  UdpInDatagrams                  776105             0.0
> > >  UdpInErrors                     109056             0.0
> > >  UdpOutDatagrams                 885160             0.0
> > >  UdpRcvbufErrors                 109056             0.0
> > >  IpExtInOctets                   931190476          0.0
> > >  IpExtOutOctets                  931189564          0.0
> > >  IpExtInNoECTPkts                885162             0.0
> > > 
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > > 
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > > 
> > > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > > To Paolo, you might want to look into this.  And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.  
> > 
> > Thank you for reporting this.
> > 
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf. 
> > 
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> > 
> > Can you please try the following patch ?
> 
> Sure, it looks much better with this patch.

Thank you for testing. I'll send a formal patch to David soon.

BTW I see I nice performance improvement compared to 4.7...

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-12-02 15:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-27 13:19 [PATCH] mm: page_alloc: High-order per-cpu page allocator v3 Mel Gorman
2016-11-28 11:00 ` Vlastimil Babka
2016-11-28 11:45   ` Mel Gorman
2016-11-30  8:55   ` Mel Gorman
2016-11-28 15:39 ` Christoph Lameter
2016-11-28 16:21   ` Mel Gorman
2016-11-28 16:38     ` Christoph Lameter
2016-11-28 18:47       ` Mel Gorman
2016-11-28 18:54         ` Christoph Lameter
2016-11-28 20:59           ` Vlastimil Babka
2016-11-28 19:54 ` Johannes Weiner
2016-11-30 12:40 ` Jesper Dangaard Brouer
2016-11-30 14:06   ` Mel Gorman
2016-11-30 15:06     ` Jesper Dangaard Brouer
2016-11-30 16:35       ` Mel Gorman
2016-12-01 17:34         ` Jesper Dangaard Brouer
2016-12-01 22:17           ` Paolo Abeni
2016-12-02 15:37             ` Jesper Dangaard Brouer
2016-12-02 15:44               ` Paolo Abeni
2016-11-30 13:05 ` Michal Hocko
2016-11-30 14:16   ` Mel Gorman
2016-11-30 14:59     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).