linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Improve sequential read throughput
@ 2014-06-18  8:23 Mel Gorman
  2014-06-18  8:23 ` [PATCH 1/4] cfq: Increase default value of target_latency Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Mel Gorman @ 2014-06-18  8:23 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Jan Kara, Johannes Weiner, Jens Axboe, Mel Gorman

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential read
performance, particularly for higher numbers of threads. This is visible
in a number of benchmarks but tiobench has been the one I looked at the
closest despite its age.

                                      3.16.0-rc1            3.16.0-rc1                 3.0.0
                                         vanilla          patch-series               vanilla
Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      133.84 (  9.81%)      134.59 ( 10.42%)
Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      115.01 ( 12.77%)      122.59 ( 20.20%)
Mean   SeqRead-MB/sec-4          97.42 (  0.00%)      108.40 ( 11.27%)      114.78 ( 17.82%)
Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       97.50 ( 16.92%)      100.14 ( 20.09%)
Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       82.14 ( 19.22%)       81.64 ( 18.50%)

The impact on the other operations is negligible. Note that 3.0-vanilla is
still far better but bringing the patch series further in line would involve
increasing the CFQ target latency higher and there should be better options.
This series is a major improvement on 3.16-rc1-vanilla at least so worth
sending out to a larger audience for comment.

 block/cfq-iosched.c            |   2 +-
 include/linux/mmzone.h         |   9 +++
 include/linux/writeback.h      |   1 +
 include/trace/events/pagemap.h |  16 ++--
 mm/internal.h                  |   1 +
 mm/mm_init.c                   |   5 +-
 mm/page-writeback.c            |  15 ++--
 mm/page_alloc.c                | 176 ++++++++++++++++++++++++++---------------
 mm/swap.c                      |   4 +-
 9 files changed, 144 insertions(+), 85 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-18  8:23 [PATCH 0/4] Improve sequential read throughput Mel Gorman
@ 2014-06-18  8:23 ` Mel Gorman
  2014-06-19 18:38   ` Jeff Moyer
  2014-06-18  8:23 ` [PATCH 2/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2014-06-18  8:23 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Jan Kara, Johannes Weiner, Jens Axboe, Mel Gorman

The existing CFQ default target_latency results in very poor performance
for larger numbers of threads doing sequential reads.  While this can be
easily described as a tuning problem for users, it is one that is tricky
to detect. This patch the default on the assumption that people with access
to expensive fast storage also know how to tune their IO scheduler.

The following is from tiobench run on a mid-range desktop with a single
spinning disk.

                                      3.16.0-rc1            3.16.0-rc1                 3.0.0
                                         vanilla          cfq600                     vanilla
Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)

As expected, the performance increases for larger number of threads although
still far short of 3.0-vanilla.  A concern with a patch like this is that
it would hurt IO latencies but the iostat figures still look reasonable

                  3.16.0-rc1  3.16.0-rc1       3.0.0
                     vanilla   cfq600        vanilla
Mean sda-avgqz        912.29      939.89     1000.70
Mean sda-await       4268.03     4403.99     4887.67
Mean sda-r_await       79.42       80.33      108.53
Mean sda-w_await    13073.49    11038.81    11599.83
Max  sda-avgqz       2194.84     2215.01     2626.78
Max  sda-await      18157.88    17586.08    24971.00
Max  sda-r_await      888.40      874.22     5308.00
Max  sda-w_await   212563.59   190265.33   177698.47

Average read waiting times are barely changed and still short of the
3.0-vanilla kresult. The worst-case read wait times are also acceptable
and far better than 3.0.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 block/cfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cadc378..34b9d8b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -32,7 +32,7 @@ static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 static int cfq_group_idle = HZ / 125;
-static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static const int cfq_target_latency = HZ * 6/10; /* 600 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-06-18  8:23 [PATCH 0/4] Improve sequential read throughput Mel Gorman
  2014-06-18  8:23 ` [PATCH 1/4] cfq: Increase default value of target_latency Mel Gorman
@ 2014-06-18  8:23 ` Mel Gorman
  2014-06-18  8:23 ` [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired Mel Gorman
  2014-06-18  8:23 ` [PATCH 4/4] mm: page_alloc: Reduce cost of dirty zone balancing Mel Gorman
  3 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2014-06-18  8:23 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Jan Kara, Johannes Weiner, Jens Axboe, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired
  2014-06-18  8:23 [PATCH 0/4] Improve sequential read throughput Mel Gorman
  2014-06-18  8:23 ` [PATCH 1/4] cfq: Increase default value of target_latency Mel Gorman
  2014-06-18  8:23 ` [PATCH 2/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
@ 2014-06-18  8:23 ` Mel Gorman
  2014-06-18 20:01   ` Johannes Weiner
  2014-06-18  8:23 ` [PATCH 4/4] mm: page_alloc: Reduce cost of dirty zone balancing Mel Gorman
  3 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2014-06-18  8:23 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Jan Kara, Johannes Weiner, Jens Axboe, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath. On large NUMA machines, the scanning overhead is
higher as zones are scanned that are ineligible for zone allocation policy.

This patch makes a number of changes which are all related to each
other. First and foremost, the patch resets the fair zone policy counts when
all the counters are depleted, avoids scanning remote nodes unnecessarily
and reduces the frequency that resets are required. Second, when the fair
zone batch counter is expired, the zone is flagged which has a lighter
cache footprint than accessing the counters. Lastly, if the local node
has only one zone then the fair zone allocation policy is not applied to
reduce overall overhead.

Comparison is tiobench with data size 2*RAM on a small single-node machine
and on an ext3 filesystem although it is known that ext4 sees similar gains.
I'm reporting sequental reads only as the other operations are essentially
flat.

                                      3.16.0-rc1            3.16.0-rc1            3.16.0-rc1                 3.0.0
                                         vanilla          cfq600              fairzone                     vanilla
Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      131.68 (  8.04%)      134.59 ( 10.42%)
Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      113.24 ( 11.04%)      122.59 ( 20.20%)
Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      107.43 ( 10.28%)      114.78 ( 17.82%)
Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)       96.81 ( 16.09%)      100.14 ( 20.09%)
Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.88 ( 18.85%)       81.64 ( 18.50%)

Where as the CFQ patch helped throughput for higher number of threads, this
patch (fairzone) whos performance increases for all thread counts and brings
performance much closer to 3.0-vanilla. Note that performance can be further
increased by tuning CFQ but the latencies of read operations are then higher
but from the IO stats they are still acceptable.

                  3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
                     vanilla      cfq600    fairzone     vanilla
Mean sda-avgqz        912.29      939.89      947.90     1000.70
Mean sda-await       4268.03     4403.99     4450.89     4887.67
Mean sda-r_await       79.42       80.33       81.34      108.53
Mean sda-w_await    13073.49    11038.81    13217.25    11599.83
Max  sda-avgqz       2194.84     2215.01     2307.48     2626.78
Max  sda-await      18157.88    17586.08    14189.21    24971.00
Max  sda-r_await      888.40      874.22      800.80     5308.00
Max  sda-w_await   212563.59   190265.33   173295.33   177698.47

The primary concern with this patch is that it'll break the fair zone
allocation policy but it should be still fine as long as the working set
fits in memory. When the low watermark is constantly hit and the spread
is still even as before. However, the policy is still in force most of the
time. This is the allocation spread when running tiobench at 80% of memory

                            3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
                               vanilla      cfq600    fairzone     vanilla
DMA32 allocs                  11099122    11020083     9459921     7698716
Normal allocs                 18823134    18801874    20429838    18787406
Movable allocs                       0           0           0           0

Note that the number of pages allocated from the Normal zone is still
comparable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 +++
 mm/mm_init.c           |   5 +-
 mm/page_alloc.c        | 161 ++++++++++++++++++++++++++++++-------------------
 3 files changed, 109 insertions(+), 64 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..e041f63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -529,6 +529,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -566,6 +567,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
@@ -689,6 +695,7 @@ struct zonelist_cache;
 struct zoneref {
 	struct zone *zone;	/* Pointer to actual zone */
 	int zone_idx;		/* zone_idx(zoneref->zone) */
+	bool fair_enabled;	/* eligible for fair zone policy */
 };
 
 /*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4074caf..37b7337 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -54,8 +54,9 @@ void mminit_verify_zonelist(void)
 			/* Iterate the zonelist */
 			for_each_zone_zonelist(zone, z, zonelist, zoneid) {
 #ifdef CONFIG_NUMA
-				printk(KERN_CONT "%d:%s ",
-					zone->node, zone->name);
+				printk(KERN_CONT "%d:%s%s ",
+					zone->node, zone->name,
+					z->fair_enabled ? "(F)" : "");
 #else
 				printk(KERN_CONT "0:%s ", zone->name);
 #endif /* CONFIG_NUMA */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..7614404 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1543,6 +1543,7 @@ int split_free_page(struct page *page)
  */
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
+			struct zoneref *z,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, int migratetype)
 {
@@ -1596,7 +1597,11 @@ again:
 					  get_freepage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (z->fair_enabled) {
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+		if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+			zone_set_flag(zone, ZONE_FAIR_DEPLETED);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1909,6 +1914,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			(zone->managed_pages >> 2) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1926,8 +1943,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0, nr_fair_eligible = 0, nr_fail_watermark = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1950,11 +1970,15 @@ zonelist_scan:
 		 * time the page has in memory before being reclaimed.
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
-			if (!zone_local(preferred_zone, zone))
-				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+			if (!zone_local(preferred_zone, zone) || !z->fair_enabled)
+				break;
+			nr_fair_eligible++;
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
+			}
 		}
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -1994,6 +2018,8 @@ zonelist_scan:
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
+			nr_fail_watermark++;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
@@ -2050,7 +2076,7 @@ zonelist_scan:
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(preferred_zone, zone, order,
+		page = buffered_rmqueue(preferred_zone, z, zone, order,
 						gfp_mask, migratetype);
 		if (page)
 			break;
@@ -2059,13 +2085,7 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2074,8 +2094,36 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	/*
+	 * The first pass spreads allocations fairly within the local node.
+	 * Reset the counters if necessary and recheck the zonelist taking
+	 * the remote nodes and the fact that a batch count might have
+	 * failed due to per-cpu vmstat accounting drift into account. This
+	 * is preferable to entering the slowpath and waking kswapd.
+	 */
+	if (alloc_flags & ALLOC_FAIR) {
+		alloc_flags &= ~ALLOC_FAIR;
+		if (nr_online_nodes > 1)
+			zonelist_rescan = true;
+		if (nr_fail_watermark || nr_fair_eligible == nr_fair_skipped) {
+			zonelist_rescan = true;
+			reset_alloc_batches(preferred_zone);
+		}
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2396,28 +2444,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2718,7 +2744,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
-	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
+	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
 	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
@@ -2749,33 +2775,18 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	if (preferred_zoneref->fair_enabled)
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
@@ -3288,10 +3299,19 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
+			struct zoneref *zoneref, struct zone *preferred_zone)
 {
+	int zone_type = zone_idx(zone);
+	bool fair_enabled = zone_local(zone, preferred_zone);
+	if (zone_type == 0 &&
+			zone->managed_pages < (pgdat->node_present_pages >> 4))
+		fair_enabled = false;
+
 	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
+	zoneref->zone_idx = zone_type;
+	zoneref->fair_enabled = fair_enabled;
+	return fair_enabled;
 }
 
 /*
@@ -3304,17 +3324,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	struct zone *preferred_zone = NULL;
+	int nr_fair = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			if (!preferred_zone)
+				preferred_zone = zone;
+
+			nr_fair += zoneref_set_zone(pgdat, zone,
+				&zonelist->_zonerefs[nr_zones++],
+				preferred_zone);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
+	if (nr_fair <= 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
+
 	return nr_zones;
 }
 
@@ -3511,6 +3540,7 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
 	zonelist->_zonerefs[j].zone = NULL;
 	zonelist->_zonerefs[j].zone_idx = 0;
+	zonelist->_zonerefs[j].fair_enabled = false;
 }
 
 /*
@@ -3539,8 +3569,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
-	struct zone *z;
+	struct zone *z, *preferred_zone = NULL;
 	struct zonelist *zonelist;
+	int nr_fair = 0;
 
 	zonelist = &pgdat->node_zonelists[0];
 	pos = 0;
@@ -3548,15 +3579,22 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (!preferred_zone)
+				preferred_zone = z;
 			if (populated_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
+				nr_fair += zoneref_set_zone(pgdat, z,
+					&zonelist->_zonerefs[pos++],
+					preferred_zone);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
 	zonelist->_zonerefs[pos].zone = NULL;
 	zonelist->_zonerefs[pos].zone_idx = 0;
+	zonelist->_zonerefs[pos].fair_enabled = false;
+
+	if (nr_fair <= 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
 }
 
 static int default_zonelist_order(void)
@@ -5681,8 +5719,7 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 
 		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
-				      high_wmark_pages(zone) -
-				      low_wmark_pages(zone) -
+				      (zone->managed_pages >> 2) -
 				      zone_page_state(zone, NR_ALLOC_BATCH));
 
 		setup_zone_migrate_reserve(zone);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-18  8:23 [PATCH 0/4] Improve sequential read throughput Mel Gorman
                   ` (2 preceding siblings ...)
  2014-06-18  8:23 ` [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired Mel Gorman
@ 2014-06-18  8:23 ` Mel Gorman
  3 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2014-06-18  8:23 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Jan Kara, Johannes Weiner, Jens Axboe, Mel Gorman

When allocating a page cache page for writing the allocator makes an attempt
to proportionally distribute dirty pages between populated zones. The call
to zone_dirty_ok is more expensive than expected because of the number of
vmstats it examines. This patch caches some of that information to reduce
the cost. It means the proportional allocation is based on stale data but
the heuristic should not need perfectly accurate information. The impact
is less than the fair zone policy patch but is still visible

                                      3.16.0-rc1            3.16.0-rc1            3.16.0-rc1            3.16.0-rc1                 3.0.0
                                         vanilla          cfq600              fairzone             lessdirty                     vanilla
Min    SeqRead-MB/sec-1         121.06 (  0.00%)      120.38 ( -0.56%)      130.25 (  7.59%)      132.03 (  9.06%)      134.04 ( 10.72%)
Min    SeqRead-MB/sec-2         100.06 (  0.00%)       99.17 ( -0.89%)      112.73 ( 12.66%)      114.83 ( 14.76%)      120.76 ( 20.69%)
Min    SeqRead-MB/sec-4          97.10 (  0.00%)       99.14 (  2.10%)      107.24 ( 10.44%)      108.06 ( 11.29%)      114.49 ( 17.91%)
Min    SeqRead-MB/sec-8          81.45 (  0.00%)       89.18 (  9.49%)       94.94 ( 16.56%)       96.41 ( 18.37%)       98.04 ( 20.37%)
Min    SeqRead-MB/sec-16         67.39 (  0.00%)       74.52 ( 10.58%)       81.37 ( 20.74%)       78.62 ( 16.66%)       79.49 ( 17.96%)
Min    RandRead-MB/sec-1          1.06 (  0.00%)        1.09 (  2.83%)        1.09 (  2.83%)        1.07 (  0.94%)        1.07 (  0.94%)
Min    RandRead-MB/sec-2          1.28 (  0.00%)        1.27 ( -0.78%)        1.29 (  0.78%)        1.29 (  0.78%)        1.19 ( -7.03%)
Min    RandRead-MB/sec-4          1.55 (  0.00%)        1.44 ( -7.10%)        1.49 ( -3.87%)        1.53 ( -1.29%)        1.47 ( -5.16%)
Min    RandRead-MB/sec-8          1.73 (  0.00%)        1.75 (  1.16%)        1.68 ( -2.89%)        1.70 ( -1.73%)        1.61 ( -6.94%)
Min    RandRead-MB/sec-16         1.76 (  0.00%)        1.83 (  3.98%)        1.86 (  5.68%)        1.77 (  0.57%)        1.73 ( -1.70%)
Min    SeqWrite-MB/sec-1        113.95 (  0.00%)      115.98 (  1.78%)      116.09 (  1.88%)      115.30 (  1.18%)      113.11 ( -0.74%)
Min    SeqWrite-MB/sec-2        103.00 (  0.00%)      103.27 (  0.26%)      104.31 (  1.27%)      104.26 (  1.22%)      103.49 (  0.48%)
Min    SeqWrite-MB/sec-4         98.42 (  0.00%)       98.16 ( -0.26%)       99.17 (  0.76%)       98.69 (  0.27%)       95.08 ( -3.39%)
Min    SeqWrite-MB/sec-8         92.91 (  0.00%)       93.32 (  0.44%)       93.14 (  0.25%)       93.33 (  0.45%)       89.43 ( -3.75%)
Min    SeqWrite-MB/sec-16        85.96 (  0.00%)       86.33 (  0.43%)       86.71 (  0.87%)       86.67 (  0.83%)       83.04 ( -3.40%)
Min    RandWrite-MB/sec-1         1.34 (  0.00%)        1.30 ( -2.99%)        1.34 (  0.00%)        1.32 ( -1.49%)        1.35 (  0.75%)
Min    RandWrite-MB/sec-2         1.40 (  0.00%)        1.30 ( -7.14%)        1.40 (  0.00%)        1.38 ( -1.43%)        1.44 (  2.86%)
Min    RandWrite-MB/sec-4         1.38 (  0.00%)        1.35 ( -2.17%)        1.36 ( -1.45%)        1.38 (  0.00%)        1.37 ( -0.72%)
Min    RandWrite-MB/sec-8         1.34 (  0.00%)        1.35 (  0.75%)        1.33 ( -0.75%)        1.32 ( -1.49%)        1.33 ( -0.75%)
Min    RandWrite-MB/sec-16        1.35 (  0.00%)        1.33 ( -1.48%)        1.33 ( -1.48%)        1.33 ( -1.48%)        1.33 ( -1.48%)
Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      131.68 (  8.04%)      133.84 (  9.81%)      134.59 ( 10.42%)
Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      113.24 ( 11.04%)      115.01 ( 12.77%)      122.59 ( 20.20%)
Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      107.43 ( 10.28%)      108.40 ( 11.27%)      114.78 ( 17.82%)
Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)       96.81 ( 16.09%)       97.50 ( 16.92%)      100.14 ( 20.09%)
Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.88 ( 18.85%)       82.14 ( 19.22%)       81.64 ( 18.50%)

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h    |  2 ++
 include/linux/writeback.h |  1 +
 mm/internal.h             |  1 +
 mm/page-writeback.c       | 15 +++++++++------
 mm/page_alloc.c           | 15 ++++++++++++---
 5 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e041f63..9ec4459 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -399,6 +399,8 @@ struct zone {
 	int			compact_order_failed;
 #endif
 
+	unsigned long		dirty_limit_cached;
+
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..90190d4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -121,6 +121,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirty_limit(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..f31e3b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -370,5 +370,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
+#define ALLOC_DIRTY		0x200 /* spread GFP_WRITE allocations */
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..1990e9a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -298,10 +298,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  * Returns the maximum number of dirty pages allowed in a zone, based
  * on the zone's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+unsigned long zone_dirty_limit(struct zone *zone)
 {
 	unsigned long zone_memory = zone_dirtyable_memory(zone);
-	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
@@ -310,9 +309,6 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 	else
 		dirty = vm_dirty_ratio * zone_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
-		dirty += dirty / 4;
-
 	return dirty;
 }
 
@@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
  */
 bool zone_dirty_ok(struct zone *zone)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	unsigned long limit = zone->dirty_limit_cached;
+	struct task_struct *tsk = current;
+
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		limit = zone_dirty_limit(zone);
+		zone->dirty_limit_cached = limit;
+		limit += limit / 4;
+	}
 
 	return zone_page_state(zone, NR_FILE_DIRTY) +
 	       zone_page_state(zone, NR_UNSTABLE_NFS) +
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7614404..c0cddae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,9 +1941,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
 	int nr_fair_skipped = 0, nr_fair_eligible = 0, nr_fail_watermark = 0;
+	int nr_fail_dirty = 0;
 	bool zonelist_rescan;
 
 zonelist_scan:
@@ -2005,8 +2004,11 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if ((alloc_flags & ALLOC_DIRTY) && !zone_dirty_ok(zone)) {
+			nr_fail_dirty++;
+			zone->dirty_limit_cached = zone_dirty_limit(zone);
 			continue;
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
@@ -2120,6 +2122,11 @@ this_zone_full:
 		}
 	}
 
+	if ((alloc_flags & ALLOC_DIRTY) && nr_fail_dirty) {
+		alloc_flags &= ~ALLOC_DIRTY;
+		zonelist_rescan = true;
+	}
+
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2777,6 +2784,8 @@ retry_cpuset:
 
 	if (preferred_zoneref->fair_enabled)
 		alloc_flags |= ALLOC_FAIR;
+	if (gfp_mask & __GFP_WRITE)
+		alloc_flags |= ALLOC_DIRTY;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired
  2014-06-18  8:23 ` [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired Mel Gorman
@ 2014-06-18 20:01   ` Johannes Weiner
  2014-06-18 21:57     ` Mel Gorman
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara, Jens Axboe

Hi Mel,

On Wed, Jun 18, 2014 at 09:23:26AM +0100, Mel Gorman wrote:
> The fair zone allocation policy round-robins allocations between zones
> within a node to avoid age inversion problems during reclaim. If the
> first allocation fails, the batch counts is reset and a second attempt
> made before entering the slow path.
> 
> One assumption made with this scheme is that batches expire at roughly the
> same time and the resets each time are justified. This assumption does not
> hold when zones reach their low watermark as the batches will be consumed
> at uneven rates.
>
> Allocation failure due to watermark depletion result in
> additional zonelist scans for the reset and another watermark check before
> hitting the slowpath.

Yes, one consequence of the changes in 3a025760fc15 ("mm: page_alloc:
spill to remote nodes before waking kswapd") is that on single-node
systems we have one useless spill cycle to non-existent remote nodes.

Your patch adds a nr_online_nodes check, but it also does another
zonelist scan if any watermark breaches were detected in the first
cycle, so I don't see how you actually fix this issue?

> On large NUMA machines, the scanning overhead is higher as zones are
> scanned that are ineligible for zone allocation policy.

I'm not sure we can use your fix for that because of zone-order
zonelists, see inline comments below.

> This patch makes a number of changes which are all related to each
> other. First and foremost, the patch resets the fair zone policy counts when
> all the counters are depleted,

You also still reset them if any of the low watermarks are breached,
so the only time we *would* save a reset cycle now is when all
considered watermarks are fine and there are still some non-depleted
batches - but in this case the allocation would have succeeded...?

> avoids scanning remote nodes unnecessarily

I don't see how we are scanning them unnecessarily now.

> and reduces the frequency that resets are required.

How?

It would be good to start with an analysis of the problem(s) and then
propose a solution based on that, otherwise it makes it very hard to
follow your thought process, and especially match these rather broad
statements to the code when you change multiple things at once.

> Second, when the fair zone batch counter is expired, the zone is
> flagged which has a lighter cache footprint than accessing the
> counters. Lastly, if the local node has only one zone then the fair
> zone allocation policy is not applied to reduce overall overhead.

These two are plausible, but they also make the code harder to
understand and their performance impact is not represented in your
test results, so we can't compare cost and value.

> Comparison is tiobench with data size 2*RAM on a small single-node machine
> and on an ext3 filesystem although it is known that ext4 sees similar gains.
> I'm reporting sequental reads only as the other operations are essentially
> flat.
> 
>                                       3.16.0-rc1            3.16.0-rc1            3.16.0-rc1                 3.0.0
>                                          vanilla          cfq600              fairzone                     vanilla
> Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      131.68 (  8.04%)      134.59 ( 10.42%)
> Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      113.24 ( 11.04%)      122.59 ( 20.20%)
> Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      107.43 ( 10.28%)      114.78 ( 17.82%)
> Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)       96.81 ( 16.09%)      100.14 ( 20.09%)
> Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.88 ( 18.85%)       81.64 ( 18.50%)
> 
> Where as the CFQ patch helped throughput for higher number of threads, this
> patch (fairzone) whos performance increases for all thread counts and brings
> performance much closer to 3.0-vanilla. Note that performance can be further
> increased by tuning CFQ but the latencies of read operations are then higher
> but from the IO stats they are still acceptable.
> 
>                   3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
>                      vanilla      cfq600    fairzone     vanilla
> Mean sda-avgqz        912.29      939.89      947.90     1000.70
> Mean sda-await       4268.03     4403.99     4450.89     4887.67
> Mean sda-r_await       79.42       80.33       81.34      108.53
> Mean sda-w_await    13073.49    11038.81    13217.25    11599.83
> Max  sda-avgqz       2194.84     2215.01     2307.48     2626.78
> Max  sda-await      18157.88    17586.08    14189.21    24971.00
> Max  sda-r_await      888.40      874.22      800.80     5308.00
> Max  sda-w_await   212563.59   190265.33   173295.33   177698.47
> 
> The primary concern with this patch is that it'll break the fair zone
> allocation policy but it should be still fine as long as the working set
> fits in memory. When the low watermark is constantly hit and the spread
> is still even as before. However, the policy is still in force most of the
> time. This is the allocation spread when running tiobench at 80% of memory
> 
>                             3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
>                                vanilla      cfq600    fairzone     vanilla
> DMA32 allocs                  11099122    11020083     9459921     7698716
> Normal allocs                 18823134    18801874    20429838    18787406
> Movable allocs                       0           0           0           0
> 
> Note that the number of pages allocated from the Normal zone is still
> comparable.

When you translate them to percentages, it rather looks like fairness
is closer to pre-fairpolicy levels for this workload:

                             3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
                                vanilla      cfq600    fairzone     vanilla
 DMA32 allocs                     37.1%       37.0%       31.6%       29.1%
 Normal allocs                    62.9%       63.0%       68.4%       70.9%
 Movable allocs                      0%          0%          0%          0%

> @@ -1909,6 +1914,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  
>  #endif	/* CONFIG_NUMA */
>  
> +static void reset_alloc_batches(struct zone *preferred_zone)
> +{
> +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> +
> +	do {
> +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> +			(zone->managed_pages >> 2) -
> +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> +	} while (zone++ != preferred_zone);

What is zone->managed_pages >> 2 based on?

The batch size was picked so that after all zones were used according
to their size, they would also get reclaimed according to their size,
and the cycle would start over.  This ensures that available memory is
fully utilized and page lifetime stays independent of zone placement.

The page allocator depletes the zones to their low watermark, then
kswapd restores them to their high watermark before the reclaim cycle
starts over.  This means that a reclaim cycle is high - low watermark
pages, which is reflected in the current round-robin batch sizes.

Now, we agree that the batches might drift from the actual reclaim
cycle due to per-cpu counter inaccuracies, but it's still a better
match for the reclaim cycle than "quarter zone size"...?

> @@ -1926,8 +1943,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
>  	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
>  				(gfp_mask & __GFP_WRITE);
> +	int nr_fair_skipped = 0, nr_fair_eligible = 0, nr_fail_watermark = 0;
> +	bool zonelist_rescan;
>  
>  zonelist_scan:
> +	zonelist_rescan = false;
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
>  	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> @@ -1950,11 +1970,15 @@ zonelist_scan:
>  		 * time the page has in memory before being reclaimed.
>  		 */
>  		if (alloc_flags & ALLOC_FAIR) {
> -			if (!zone_local(preferred_zone, zone))
> -				continue;
> -			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> +			if (!zone_local(preferred_zone, zone) || !z->fair_enabled)
> +				break;

The reason this was a "continue" rather than a "break" was because
zonelists can be ordered by zone type, where a local zone can show up
after a remote zone.  It might be worth rethinking the usefulness of
zone-order in general, but it probably shouldn't be a silent side
effect of a performance patch.

All in all, I still don't really understand exactly how your changes
work and the changelog doesn't clarify much :( I'm just having a hard
time seeing how you get 10%-20% performance increase for an IO-bound
workload by making the allocator paths a little leaner.  Your results
certainly show that you *are* improving this particular workload, but
I think we should be clear on the mental model and then go from there.

I haven't managed to reproduce it locally yet, will continue to play
around with the parameters.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired
  2014-06-18 20:01   ` Johannes Weiner
@ 2014-06-18 21:57     ` Mel Gorman
  2014-06-19 18:18       ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2014-06-18 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara, Jens Axboe

On Wed, Jun 18, 2014 at 04:01:29PM -0400, Johannes Weiner wrote:
> Hi Mel,
> 
> On Wed, Jun 18, 2014 at 09:23:26AM +0100, Mel Gorman wrote:
> > The fair zone allocation policy round-robins allocations between zones
> > within a node to avoid age inversion problems during reclaim. If the
> > first allocation fails, the batch counts is reset and a second attempt
> > made before entering the slow path.
> > 
> > One assumption made with this scheme is that batches expire at roughly the
> > same time and the resets each time are justified. This assumption does not
> > hold when zones reach their low watermark as the batches will be consumed
> > at uneven rates.
> >
> > Allocation failure due to watermark depletion result in
> > additional zonelist scans for the reset and another watermark check before
> > hitting the slowpath.
> 
> Yes, one consequence of the changes in 3a025760fc15 ("mm: page_alloc:
> spill to remote nodes before waking kswapd") is that on single-node
> systems we have one useless spill cycle to non-existent remote nodes.
> 

Yes.

> Your patch adds a nr_online_nodes check, but it also does another
> zonelist scan if any watermark breaches were detected in the first
> cycle, so I don't see how you actually fix this issue?
> 

In the case any of the watermarks were breached then it was necessary
to rescan the zones that may have met the watermark but did not have a
NR_ALLOC_BATCH available. There were not many good choices on that front
that did not end up adding overhead in other places.

> > On large NUMA machines, the scanning overhead is higher as zones are
> > scanned that are ineligible for zone allocation policy.
> 
> I'm not sure we can use your fix for that because of zone-order
> zonelists, see inline comments below.
> 

At one point I had a comment on that but then deleted it again. In the case
of zone ordering the expectation is that low zones are preserved. The fair
zone allocation policy actually breaks that expectation and violates the zone
ordering rules but in a way that involves scanning zones that cannot be used.

> > This patch makes a number of changes which are all related to each
> > other. First and foremost, the patch resets the fair zone policy counts when
> > all the counters are depleted,
> 
> You also still reset them if any of the low watermarks are breached,
> so the only time we *would* save a reset cycle now is when all
> considered watermarks are fine and there are still some non-depleted
> batches - but in this case the allocation would have succeeded...?
> 

I was also taking into account the possibility that NR_ALLOC_BATCH might
have failed due to per-cpu drift but I get your point. I can see if it be
improved further. It had simply reached the point where the series in had
a sufficiently large impact that I released it.

> > avoids scanning remote nodes unnecessarily
> 
> I don't see how we are scanning them unnecessarily now.
> 
> > and reduces the frequency that resets are required.
> 
> How?
> 

By waiting until all the batches are consumed.

> It would be good to start with an analysis of the problem(s) and then
> propose a solution based on that, otherwise it makes it very hard to
> follow your thought process, and especially match these rather broad
> statements to the code when you change multiple things at once.
> 

I'm not sure what you're looking for here. The problem is that there was a
sizable performance hit due to spending too much time in the allocator fast
path. I suspected there was a secondary hit because the cache footprint is
heavier when switching between the zones but profiles were inconclusive.
There were higher number of cache misses during the copying of data and it
could be inferred that this is partially due to a heavier cache footprint in
the page allocator but profiles are not really suitable for proving that.
The fact is that using vmstat counters increased cache footprint because
of the numbers of spills from the per-cpu counter to the zone counter. Of
course the VM already has a lot of these but the fair zone policy added more.

> > Second, when the fair zone batch counter is expired, the zone is
> > flagged which has a lighter cache footprint than accessing the
> > counters. Lastly, if the local node has only one zone then the fair
> > zone allocation policy is not applied to reduce overall overhead.
> 
> These two are plausible, but they also make the code harder to
> understand and their performance impact is not represented in your
> test results, so we can't compare cost and value.
> 

Do you mean that I hadn't posted results for a NUMA machine? They weren't
available at the time I was writing the changelog but I knew from old results
based on earlier iterations of the patch that it made a difference. The
problem with the NUMA machine is that the results are much more variable
due to locality and the fact that automatic NUMA balancing is enabled
on any tests I do to match what I expect a distribution config to look
like. I felt it was self-evident that applying the fair policy to a node
with a single zone was a bad idea.

> > Comparison is tiobench with data size 2*RAM on a small single-node machine
> > and on an ext3 filesystem although it is known that ext4 sees similar gains.
> > I'm reporting sequental reads only as the other operations are essentially
> > flat.
> > 
> >                                       3.16.0-rc1            3.16.0-rc1            3.16.0-rc1                 3.0.0
> >                                          vanilla          cfq600              fairzone                     vanilla
> > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      131.68 (  8.04%)      134.59 ( 10.42%)
> > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      113.24 ( 11.04%)      122.59 ( 20.20%)
> > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      107.43 ( 10.28%)      114.78 ( 17.82%)
> > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)       96.81 ( 16.09%)      100.14 ( 20.09%)
> > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.88 ( 18.85%)       81.64 ( 18.50%)
> > 
> > Where as the CFQ patch helped throughput for higher number of threads, this
> > patch (fairzone) whos performance increases for all thread counts and brings
> > performance much closer to 3.0-vanilla. Note that performance can be further
> > increased by tuning CFQ but the latencies of read operations are then higher
> > but from the IO stats they are still acceptable.
> > 
> >                   3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
> >                      vanilla      cfq600    fairzone     vanilla
> > Mean sda-avgqz        912.29      939.89      947.90     1000.70
> > Mean sda-await       4268.03     4403.99     4450.89     4887.67
> > Mean sda-r_await       79.42       80.33       81.34      108.53
> > Mean sda-w_await    13073.49    11038.81    13217.25    11599.83
> > Max  sda-avgqz       2194.84     2215.01     2307.48     2626.78
> > Max  sda-await      18157.88    17586.08    14189.21    24971.00
> > Max  sda-r_await      888.40      874.22      800.80     5308.00
> > Max  sda-w_await   212563.59   190265.33   173295.33   177698.47
> > 
> > The primary concern with this patch is that it'll break the fair zone
> > allocation policy but it should be still fine as long as the working set
> > fits in memory. When the low watermark is constantly hit and the spread
> > is still even as before. However, the policy is still in force most of the
> > time. This is the allocation spread when running tiobench at 80% of memory
> > 
> >                             3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
> >                                vanilla      cfq600    fairzone     vanilla
> > DMA32 allocs                  11099122    11020083     9459921     7698716
> > Normal allocs                 18823134    18801874    20429838    18787406
> > Movable allocs                       0           0           0           0
> > 
> > Note that the number of pages allocated from the Normal zone is still
> > comparable.
> 
> When you translate them to percentages, it rather looks like fairness
> is closer to pre-fairpolicy levels for this workload:
> 
>                              3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
>                                 vanilla      cfq600    fairzone     vanilla
>  DMA32 allocs                     37.1%       37.0%       31.6%       29.1%
>  Normal allocs                    62.9%       63.0%       68.4%       70.9%
>  Movable allocs                      0%          0%          0%          0%
> 

I can re-examine it again. The key problem here is that once the low
watermark is reached that we can either adhere to the fair zone policy
and stall the allocator by dropping into the slow path and/or waiting for
kswapd to make progress or we can break the fair zone allocation policy,
make progress now and hope that reclaim does not cause problems later. That
is a bleak choice.

Ideally zones would go away altogether and LRU lists and alloctor paths
used the same list with overhead of additional scanning if pages from
a particular zone was required. That would remove the need for the fair
zone policy entirely. However, this would be a heavy reachitecting of the
current infrastructure and not guaranteed to work correctly.

> > @@ -1909,6 +1914,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> >  
> >  #endif	/* CONFIG_NUMA */
> >  
> > +static void reset_alloc_batches(struct zone *preferred_zone)
> > +{
> > +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> > +
> > +	do {
> > +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> > +			(zone->managed_pages >> 2) -
> > +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> > +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> > +	} while (zone++ != preferred_zone);
> 
> What is zone->managed_pages >> 2 based on?
> 

Magic number that would allow more progress to be made before switching
to the lower zone.

> The batch size was picked so that after all zones were used according
> to their size, they would also get reclaimed according to their size,
> and the cycle would start over.  This ensures that available memory is
> fully utilized and page lifetime stays independent of zone placement.
> 
> The page allocator depletes the zones to their low watermark, then
> kswapd restores them to their high watermark before the reclaim cycle
> starts over.  This means that a reclaim cycle is high - low watermark
> pages, which is reflected in the current round-robin batch sizes.
> 
> Now, we agree that the batches might drift from the actual reclaim
> cycle due to per-cpu counter inaccuracies, but it's still a better
> match for the reclaim cycle than "quarter zone size"...?
> 

Fair enough, I'll restore it. At the time the priority was to minimise
any cache effect from switching zones.

> > @@ -1926,8 +1943,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> >  	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
> >  				(gfp_mask & __GFP_WRITE);
> > +	int nr_fair_skipped = 0, nr_fair_eligible = 0, nr_fail_watermark = 0;
> > +	bool zonelist_rescan;
> >  
> >  zonelist_scan:
> > +	zonelist_rescan = false;
> >  	/*
> >  	 * Scan zonelist, looking for a zone with enough free.
> >  	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> > @@ -1950,11 +1970,15 @@ zonelist_scan:
> >  		 * time the page has in memory before being reclaimed.
> >  		 */
> >  		if (alloc_flags & ALLOC_FAIR) {
> > -			if (!zone_local(preferred_zone, zone))
> > -				continue;
> > -			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> > +			if (!zone_local(preferred_zone, zone) || !z->fair_enabled)
> > +				break;
> 
> The reason this was a "continue" rather than a "break" was because
> zonelists can be ordered by zone type, where a local zone can show up
> after a remote zone.  It might be worth rethinking the usefulness of
> zone-order in general, but it probably shouldn't be a silent side
> effect of a performance patch.
> 

I know that. Again, zone order is expected to preserve lower zones and
the fair zone allocation policy actually breaks that. This both restores
the expected behaviour of zone-ordered zonelists and reduces overhead.

> All in all, I still don't really understand exactly how your changes
> work and the changelog doesn't clarify much :( I'm just having a hard
> time seeing how you get 10%-20% performance increase for an IO-bound
> workload by making the allocator paths a little leaner.  Your results
> certainly show that you *are* improving this particular workload, but
> I think we should be clear on the mental model and then go from there.
> 

Mental model is that too much overhead in the allocator fast path builds
up over time. Second part is switching between free lists between zones
and skipping the preferred zone when the batch is depleted has a heavier
cache footprint which in turn has additional follow-on effects. Consider
for example that recently freed pages that are potentially cache hot may
get ignored by the fair zone allocation policy in favour of cache cold
pages in a lower zone. The exact impact of this would depend on the CPU
that did the freeing and whether cache was shared so analysing it in the
general sense would be prohibitive.

> I haven't managed to reproduce it locally yet, will continue to play
> around with the parameters.

Does that mean you have tried using tiobench and saw no effect or you
haven't run tiobench yet? FWIW, I've seen this impact on multiple machines.

These are the results from a laptop. cfq600 is patch 1 of this series and
fairzone is this patch so you're looking at the difference between those
kernels. I'm using 3.0 as the baseline to highlight the relative damage
in current kernels. I suspect in some cases that older kernels again would
have been better than 3.0.

tiobench MB/sec
                                           3.0.0            3.16.0-rc1            3.16.0-rc1            3.16.0-rc1
                                         vanilla               vanilla          cfq600-v2r38        fairzone-v2r38
Mean   SeqRead-MB/sec-1          94.82 (  0.00%)       85.03 (-10.32%)       85.02 (-10.34%)       97.33 (  2.65%)
Mean   SeqRead-MB/sec-2          78.22 (  0.00%)       67.55 (-13.64%)       69.06 (-11.71%)       76.63 ( -2.03%)
Mean   SeqRead-MB/sec-4          63.57 (  0.00%)       59.68 ( -6.11%)       63.61 (  0.07%)       70.91 ( 11.56%)
Mean   SeqRead-MB/sec-8          54.51 (  0.00%)       48.48 (-11.07%)       52.15 ( -4.32%)       55.42 (  1.66%)
Mean   SeqRead-MB/sec-16         46.43 (  0.00%)       44.42 ( -4.34%)       49.74 (  7.13%)       51.45 ( 10.80%)
Mean   RandRead-MB/sec-1          0.84 (  0.00%)        0.84 (  0.40%)        0.85 (  1.99%)        0.85 (  1.20%)
Mean   RandRead-MB/sec-2          1.01 (  0.00%)        0.99 ( -1.65%)        0.97 ( -3.63%)        1.00 ( -0.99%)
Mean   RandRead-MB/sec-4          1.16 (  0.00%)        1.23 (  6.34%)        1.20 (  3.46%)        1.17 (  1.15%)
Mean   RandRead-MB/sec-8          1.32 (  0.00%)        1.38 (  4.28%)        1.34 (  1.51%)        1.33 (  0.76%)
Mean   RandRead-MB/sec-16         1.32 (  0.00%)        1.38 (  4.81%)        1.35 (  2.28%)        1.42 (  7.59%)
Mean   SeqWrite-MB/sec-1         73.99 (  0.00%)       76.59 (  3.52%)       77.41 (  4.63%)       78.74 (  6.43%)
Mean   SeqWrite-MB/sec-2         65.83 (  0.00%)       67.45 (  2.47%)       67.52 (  2.57%)       68.14 (  3.51%)
Mean   SeqWrite-MB/sec-4         60.09 (  0.00%)       61.92 (  3.05%)       62.11 (  3.37%)       62.80 (  4.52%)
Mean   SeqWrite-MB/sec-8         51.98 (  0.00%)       54.18 (  4.23%)       54.11 (  4.10%)       54.62 (  5.07%)
Mean   SeqWrite-MB/sec-16        53.20 (  0.00%)       55.98 (  5.23%)       55.98 (  5.24%)       55.58 (  4.48%)
Mean   RandWrite-MB/sec-1         1.01 (  0.00%)        1.05 (  3.62%)        1.07 (  5.59%)        1.05 (  3.29%)
Mean   RandWrite-MB/sec-2         1.03 (  0.00%)        1.00 ( -3.23%)        1.04 (  0.32%)        1.06 (  2.90%)
Mean   RandWrite-MB/sec-4         0.98 (  0.00%)        1.01 (  3.75%)        0.99 (  1.02%)        1.01 (  3.75%)
Mean   RandWrite-MB/sec-8         0.93 (  0.00%)        0.96 (  3.24%)        0.93 (  0.72%)        0.93 (  0.36%)
Mean   RandWrite-MB/sec-16        0.91 (  0.00%)        0.91 (  0.37%)        0.91 ( -0.37%)        0.90 ( -0.73%)

The following is more a slightly better desktop that the machine used
for the changelog figures

tiobench MB/sec
                                           3.0.0            3.16.0-rc1            3.16.0-rc1            3.16.0-rc1
                                         vanilla               vanilla          cfq600-v2r38        fairzone-v2r38
Mean   SeqRead-MB/sec-1         136.66 (  0.00%)      125.09 ( -8.47%)      125.97 ( -7.82%)      135.60 ( -0.77%)
Mean   SeqRead-MB/sec-2         110.52 (  0.00%)       96.78 (-12.44%)       98.92 (-10.50%)      102.64 ( -7.14%)
Mean   SeqRead-MB/sec-4          91.20 (  0.00%)       83.55 ( -8.38%)       86.39 ( -5.27%)       90.74 ( -0.50%)
Mean   SeqRead-MB/sec-8          73.85 (  0.00%)       66.97 ( -9.32%)       72.63 ( -1.66%)       75.32 (  1.99%)
Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       82.15 ( -4.97%)       94.01 (  8.75%)       95.35 ( 10.30%)
Mean   RandRead-MB/sec-1          0.94 (  0.00%)        0.92 ( -1.77%)        0.91 ( -3.19%)        0.93 ( -1.42%)
Mean   RandRead-MB/sec-2          1.07 (  0.00%)        1.05 ( -2.48%)        1.08 (  0.62%)        1.06 ( -0.93%)
Mean   RandRead-MB/sec-4          1.27 (  0.00%)        1.36 (  6.54%)        1.31 (  2.88%)        1.32 (  3.66%)
Mean   RandRead-MB/sec-8          1.35 (  0.00%)        1.41 (  4.20%)        1.49 ( 10.37%)        1.37 (  1.48%)
Mean   RandRead-MB/sec-16         1.68 (  0.00%)        1.74 (  3.17%)        1.77 (  4.95%)        1.72 (  2.18%)
Mean   SeqWrite-MB/sec-1        113.16 (  0.00%)      116.46 (  2.92%)      117.04 (  3.43%)      116.89 (  3.30%)
Mean   SeqWrite-MB/sec-2         96.55 (  0.00%)       94.06 ( -2.58%)       93.69 ( -2.96%)       93.84 ( -2.81%)
Mean   SeqWrite-MB/sec-4         81.35 (  0.00%)       81.98 (  0.77%)       81.84 (  0.59%)       81.69 (  0.41%)
Mean   SeqWrite-MB/sec-8         71.93 (  0.00%)       72.61 (  0.94%)       72.44 (  0.70%)       72.33 (  0.55%)
Mean   SeqWrite-MB/sec-16        94.22 (  0.00%)       96.61 (  2.54%)       96.83 (  2.77%)       96.87 (  2.81%)
Mean   RandWrite-MB/sec-1         1.07 (  0.00%)        1.12 (  4.35%)        1.09 (  1.55%)        1.10 (  2.80%)
Mean   RandWrite-MB/sec-2         1.06 (  0.00%)        1.07 (  0.63%)        1.06 (  0.00%)        1.05 ( -1.26%)
Mean   RandWrite-MB/sec-4         1.03 (  0.00%)        1.01 ( -2.58%)        1.03 ( -0.00%)        1.03 ( -0.32%)
Mean   RandWrite-MB/sec-8         0.98 (  0.00%)        0.99 (  1.71%)        0.98 (  0.00%)        0.98 (  0.34%)
Mean   RandWrite-MB/sec-16        1.02 (  0.00%)        1.03 (  1.31%)        1.02 (  0.66%)        1.01 ( -0.66%)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired
  2014-06-18 21:57     ` Mel Gorman
@ 2014-06-19 18:18       ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2014-06-19 18:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara, Jens Axboe

On Wed, Jun 18, 2014 at 10:57:34PM +0100, Mel Gorman wrote:
> > > On large NUMA machines, the scanning overhead is higher as zones are
> > > scanned that are ineligible for zone allocation policy.
> > 
> > I'm not sure we can use your fix for that because of zone-order
> > zonelists, see inline comments below.
> > 
> 
> At one point I had a comment on that but then deleted it again. In the case
> of zone ordering the expectation is that low zones are preserved. The fair
> zone allocation policy actually breaks that expectation and violates the zone
> ordering rules but in a way that involves scanning zones that cannot be used.

That makes a lot of sense.  It would be great to make this a separate,
documented change, though.

> > It would be good to start with an analysis of the problem(s) and then
> > propose a solution based on that, otherwise it makes it very hard to
> > follow your thought process, and especially match these rather broad
> > statements to the code when you change multiple things at once.
> > 
> 
> I'm not sure what you're looking for here. The problem is that there was a
> sizable performance hit due to spending too much time in the allocator fast
> path. I suspected there was a secondary hit because the cache footprint is
> heavier when switching between the zones but profiles were inconclusive.
> There were higher number of cache misses during the copying of data and it
> could be inferred that this is partially due to a heavier cache footprint in
> the page allocator but profiles are not really suitable for proving that.
> The fact is that using vmstat counters increased cache footprint because
> of the numbers of spills from the per-cpu counter to the zone counter. Of
> course the VM already has a lot of these but the fair zone policy added more.

I think mainly I'm asking to split these individual changes out.  It's
a single change that almost doubles the implementation size and
changed the behavior in non-obvious ways, and it was hard to find
descriptions or justification for each change in the changelog.

> > > Second, when the fair zone batch counter is expired, the zone is
> > > flagged which has a lighter cache footprint than accessing the
> > > counters. Lastly, if the local node has only one zone then the fair
> > > zone allocation policy is not applied to reduce overall overhead.
> > 
> > These two are plausible, but they also make the code harder to
> > understand and their performance impact is not represented in your
> > test results, so we can't compare cost and value.
> > 
> 
> Do you mean that I hadn't posted results for a NUMA machine? They weren't
> available at the time I was writing the changelog but I knew from old results
> based on earlier iterations of the patch that it made a difference. The
> problem with the NUMA machine is that the results are much more variable
> due to locality and the fact that automatic NUMA balancing is enabled
> on any tests I do to match what I expect a distribution config to look
> like. I felt it was self-evident that applying the fair policy to a node
> with a single zone was a bad idea.

The single-zone node avoidance does make sense when considered in
isolation, agreed.  The depleted flags come with atomic bit ops and
add an extra conditional in the fast path to avoid a word-sized read,
and writes to that word are batched depending on machine size, so to
me the cost/benefit of it really isn't all that obvious.

> > > Comparison is tiobench with data size 2*RAM on a small single-node machine
> > > and on an ext3 filesystem although it is known that ext4 sees similar gains.
> > > I'm reporting sequental reads only as the other operations are essentially
> > > flat.
> > > 
> > >                                       3.16.0-rc1            3.16.0-rc1            3.16.0-rc1                 3.0.0
> > >                                          vanilla          cfq600              fairzone                     vanilla
> > > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      131.68 (  8.04%)      134.59 ( 10.42%)
> > > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      113.24 ( 11.04%)      122.59 ( 20.20%)
> > > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      107.43 ( 10.28%)      114.78 ( 17.82%)
> > > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)       96.81 ( 16.09%)      100.14 ( 20.09%)
> > > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.88 ( 18.85%)       81.64 ( 18.50%)
> > > 
> > > Where as the CFQ patch helped throughput for higher number of threads, this
> > > patch (fairzone) whos performance increases for all thread counts and brings
> > > performance much closer to 3.0-vanilla. Note that performance can be further
> > > increased by tuning CFQ but the latencies of read operations are then higher
> > > but from the IO stats they are still acceptable.
> > > 
> > >                   3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
> > >                      vanilla      cfq600    fairzone     vanilla
> > > Mean sda-avgqz        912.29      939.89      947.90     1000.70
> > > Mean sda-await       4268.03     4403.99     4450.89     4887.67
> > > Mean sda-r_await       79.42       80.33       81.34      108.53
> > > Mean sda-w_await    13073.49    11038.81    13217.25    11599.83
> > > Max  sda-avgqz       2194.84     2215.01     2307.48     2626.78
> > > Max  sda-await      18157.88    17586.08    14189.21    24971.00
> > > Max  sda-r_await      888.40      874.22      800.80     5308.00
> > > Max  sda-w_await   212563.59   190265.33   173295.33   177698.47
> > > 
> > > The primary concern with this patch is that it'll break the fair zone
> > > allocation policy but it should be still fine as long as the working set
> > > fits in memory. When the low watermark is constantly hit and the spread
> > > is still even as before. However, the policy is still in force most of the
> > > time. This is the allocation spread when running tiobench at 80% of memory
> > > 
> > >                             3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
> > >                                vanilla      cfq600    fairzone     vanilla
> > > DMA32 allocs                  11099122    11020083     9459921     7698716
> > > Normal allocs                 18823134    18801874    20429838    18787406
> > > Movable allocs                       0           0           0           0
> > > 
> > > Note that the number of pages allocated from the Normal zone is still
> > > comparable.
> > 
> > When you translate them to percentages, it rather looks like fairness
> > is closer to pre-fairpolicy levels for this workload:
> > 
> >                              3.16.0-rc1  3.16.0-rc1  3.16.0-rc1       3.0.0
> >                                 vanilla      cfq600    fairzone     vanilla
> >  DMA32 allocs                     37.1%       37.0%       31.6%       29.1%
> >  Normal allocs                    62.9%       63.0%       68.4%       70.9%
> >  Movable allocs                      0%          0%          0%          0%
> > 
> 
> I can re-examine it again. The key problem here is that once the low
> watermark is reached that we can either adhere to the fair zone policy
> and stall the allocator by dropping into the slow path and/or waiting for
> kswapd to make progress or we can break the fair zone allocation policy,
> make progress now and hope that reclaim does not cause problems later. That
> is a bleak choice.

I'm not sure I follow entirely, but we definitely do rely on kswapd to
make forward progress right now because the assumption is that once we
allocated high - low wmark pages, kswapd will reclaim that same amount
as well.  And we do break fairness if that's not the case, but I think
that's actually good enough for practical purposes.

An alternative would be to increase the reclaim cycle and make the
distance between low and high watermarks bigger.  That would trade
some memory utilization for CPU time in the allocator.

> Ideally zones would go away altogether and LRU lists and alloctor paths
> used the same list with overhead of additional scanning if pages from
> a particular zone was required. That would remove the need for the fair
> zone policy entirely. However, this would be a heavy reachitecting of the
> current infrastructure and not guaranteed to work correctly.

Good lord, yes please!

But not exactly something we can do now and backport into stable ;-)

> > > @@ -1909,6 +1914,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> > >  
> > >  #endif	/* CONFIG_NUMA */
> > >  
> > > +static void reset_alloc_batches(struct zone *preferred_zone)
> > > +{
> > > +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> > > +
> > > +	do {
> > > +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> > > +			(zone->managed_pages >> 2) -
> > > +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> > > +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> > > +	} while (zone++ != preferred_zone);
> > 
> > What is zone->managed_pages >> 2 based on?
> > 
> 
> Magic number that would allow more progress to be made before switching
> to the lower zone.
>
> > The batch size was picked so that after all zones were used according
> > to their size, they would also get reclaimed according to their size,
> > and the cycle would start over.  This ensures that available memory is
> > fully utilized and page lifetime stays independent of zone placement.
> > 
> > The page allocator depletes the zones to their low watermark, then
> > kswapd restores them to their high watermark before the reclaim cycle
> > starts over.  This means that a reclaim cycle is high - low watermark
> > pages, which is reflected in the current round-robin batch sizes.
> > 
> > Now, we agree that the batches might drift from the actual reclaim
> > cycle due to per-cpu counter inaccuracies, but it's still a better
> > match for the reclaim cycle than "quarter zone size"...?
> > 
> 
> Fair enough, I'll restore it. At the time the priority was to minimise
> any cache effect from switching zones.

Unfortunately, for the given tests, this might be the main impact of
this patch.  Here are my tiobench results with ONLY increasing the
batch size to managed_pages >> 2, without even the CFQ patch, which I
understand accounts for the remaining drop at higher concurrency:

tiobench MB/sec
                                               3              3.16-rc1              3.16-rc1
                                             3.0                                  bigbatches
Mean   SeqRead-MB/sec-1         130.39 (  0.00%)      129.69 ( -0.54%)      137.02 (  5.08%)
Mean   SeqRead-MB/sec-2         128.14 (  0.00%)      115.63 ( -9.76%)      127.25 ( -0.69%)
Mean   SeqRead-MB/sec-4         125.06 (  0.00%)      110.03 (-12.02%)      121.35 ( -2.97%)
Mean   SeqRead-MB/sec-8         118.97 (  0.00%)      101.86 (-14.38%)      110.48 ( -7.14%)
Mean   SeqRead-MB/sec-16         96.30 (  0.00%)       86.30 (-10.39%)       94.07 ( -2.32%)

But yeah, they also wreck fairness to prehistoric levels:

                                 3.    3.16-rc1    3.16-rc1
                                3.0              bigbatches
Zone normal velocity      15772.202   11346.939   15234.211
Zone dma32 velocity        3102.806    8196.689    3437.191

> > All in all, I still don't really understand exactly how your changes
> > work and the changelog doesn't clarify much :( I'm just having a hard
> > time seeing how you get 10%-20% performance increase for an IO-bound
> > workload by making the allocator paths a little leaner.  Your results
> > certainly show that you *are* improving this particular workload, but
> > I think we should be clear on the mental model and then go from there.
> > 
> 
> Mental model is that too much overhead in the allocator fast path builds
> up over time. Second part is switching between free lists between zones
> and skipping the preferred zone when the batch is depleted has a heavier
> cache footprint which in turn has additional follow-on effects. Consider
> for example that recently freed pages that are potentially cache hot may
> get ignored by the fair zone allocation policy in favour of cache cold
> pages in a lower zone. The exact impact of this would depend on the CPU
> that did the freeing and whether cache was shared so analysing it in the
> general sense would be prohibitive.

Yes, I think these problems are inherent in a fairness system, it's
just a question how we find the right balance between fairness and
throughput while preserving a meaningful model of how it's supposed to
work, which I think the proposed magic batch sizes don't qualify for.

The expired-flags and skipping single-zone nodes OTOH were immediately
obvious because they optimize while preserving the existing semantics,
although I kind of still want to see these things quantified.

> > I haven't managed to reproduce it locally yet, will continue to play
> > around with the parameters.
> 
> Does that mean you have tried using tiobench and saw no effect or you
> haven't run tiobench yet? FWIW, I've seen this impact on multiple machines.

I tried My Favorite IO Benchmarks first, but they wouldn't yield
anything.  I could reproduce the problem with tiobench and the mmtests
standard configuration.

o bigbatches is increasing the batch to managed_pages >> 2

o bigbatches1node is additionally avoiding the unfair second remote
  spill pass on a single node system and goes straight into the slow
  path, but it looks like that optimization drowns in the noise

tiobench MB/sec
                                              3.              3.16-rc1              3.16-rc1              3.16-rc1
                                             3.0                                  bigbatches       bigbatches1node
Mean   SeqRead-MB/sec-1         130.39 (  0.00%)      129.69 ( -0.54%)      137.02 (  5.08%)      137.19 (  5.21%)
Mean   SeqRead-MB/sec-2         128.14 (  0.00%)      115.63 ( -9.76%)      127.25 ( -0.69%)      127.45 ( -0.53%)
Mean   SeqRead-MB/sec-4         125.06 (  0.00%)      110.03 (-12.02%)      121.35 ( -2.97%)      120.83 ( -3.38%)
Mean   SeqRead-MB/sec-8         118.97 (  0.00%)      101.86 (-14.38%)      110.48 ( -7.14%)      111.06 ( -6.65%)
Mean   SeqRead-MB/sec-16         96.30 (  0.00%)       86.30 (-10.39%)       94.07 ( -2.32%)       94.42 ( -1.96%)
Mean   RandRead-MB/sec-1          1.10 (  0.00%)        1.16 (  4.83%)        1.13 (  2.11%)        1.12 (  1.51%)
Mean   RandRead-MB/sec-2          1.29 (  0.00%)        1.27 ( -1.55%)        1.27 ( -1.81%)        1.27 ( -1.29%)
Mean   RandRead-MB/sec-4          1.51 (  0.00%)        1.48 ( -1.98%)        1.43 ( -5.51%)        1.46 ( -3.30%)
Mean   RandRead-MB/sec-8          1.60 (  0.00%)        1.70 (  6.46%)        1.62 (  1.25%)        1.68 (  5.21%)
Mean   RandRead-MB/sec-16         1.71 (  0.00%)        1.74 (  1.76%)        1.65 ( -3.52%)        1.72 (  0.98%)
Mean   SeqWrite-MB/sec-1        124.36 (  0.00%)      124.53 (  0.14%)      124.48 (  0.09%)      124.41 (  0.04%)
Mean   SeqWrite-MB/sec-2        117.16 (  0.00%)      117.58 (  0.36%)      117.59 (  0.37%)      117.74 (  0.50%)
Mean   SeqWrite-MB/sec-4        112.48 (  0.00%)      113.65 (  1.04%)      113.76 (  1.14%)      113.96 (  1.32%)
Mean   SeqWrite-MB/sec-8        110.40 (  0.00%)      110.76 (  0.33%)      111.28 (  0.80%)      111.65 (  1.14%)
Mean   SeqWrite-MB/sec-16       107.62 (  0.00%)      108.26 (  0.59%)      108.90 (  1.19%)      108.64 (  0.94%)
Mean   RandWrite-MB/sec-1         1.23 (  0.00%)        1.26 (  2.99%)        1.29 (  4.89%)        1.28 (  4.08%)
Mean   RandWrite-MB/sec-2         1.27 (  0.00%)        1.27 ( -0.26%)        1.28 (  0.79%)        1.31 (  3.41%)
Mean   RandWrite-MB/sec-4         1.23 (  0.00%)        1.24 (  0.81%)        1.27 (  3.25%)        1.27 (  3.25%)
Mean   RandWrite-MB/sec-8         1.23 (  0.00%)        1.26 (  2.17%)        1.26 (  2.44%)        1.24 (  0.81%)
Mean   RandWrite-MB/sec-16        1.19 (  0.00%)        1.24 (  4.21%)        1.24 (  4.21%)        1.25 (  5.06%)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-18  8:23 ` [PATCH 1/4] cfq: Increase default value of target_latency Mel Gorman
@ 2014-06-19 18:38   ` Jeff Moyer
  2014-06-19 21:42     ` Dave Chinner
  2014-06-20 11:28     ` Mel Gorman
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Moyer @ 2014-06-19 18:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara, Johannes Weiner,
	Jens Axboe

Mel Gorman <mgorman@suse.de> writes:

> The existing CFQ default target_latency results in very poor performance
> for larger numbers of threads doing sequential reads.  While this can be
> easily described as a tuning problem for users, it is one that is tricky
> to detect. This patch the default on the assumption that people with access
> to expensive fast storage also know how to tune their IO scheduler.
>
> The following is from tiobench run on a mid-range desktop with a single
> spinning disk.
>
>                                       3.16.0-rc1            3.16.0-rc1                 3.0.0
>                                          vanilla          cfq600                     vanilla
> Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
> Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
> Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
> Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
> Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)

Did you test any workloads other than this?  Also, what normal workload
has 8 or more threads doing sequential reads?  (That's an honest
question.)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-19 18:38   ` Jeff Moyer
@ 2014-06-19 21:42     ` Dave Chinner
  2014-06-20 11:30       ` Mel Gorman
  2014-06-20 11:28     ` Mel Gorman
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2014-06-19 21:42 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Mel Gorman, Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara,
	Johannes Weiner, Jens Axboe

On Thu, Jun 19, 2014 at 02:38:44PM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > The existing CFQ default target_latency results in very poor performance
> > for larger numbers of threads doing sequential reads.  While this can be
> > easily described as a tuning problem for users, it is one that is tricky
> > to detect. This patch the default on the assumption that people with access
> > to expensive fast storage also know how to tune their IO scheduler.
> >
> > The following is from tiobench run on a mid-range desktop with a single
> > spinning disk.
> >
> >                                       3.16.0-rc1            3.16.0-rc1                 3.0.0
> >                                          vanilla          cfq600                     vanilla
> > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
> > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
> > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
> > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
> > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)
> 
> Did you test any workloads other than this?  Also, what normal workload
> has 8 or more threads doing sequential reads?  (That's an honest
> question.)

I'd also suggest that making changes basd on the assumption that
people affected by the change know how to tune CFQ is a bad idea.
When CFQ misbehaves, most people just switch to deadline or no-op
because they don't understand how CFQ works, nor what what all the
nobs do or which ones to tweak to solve their problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-19 18:38   ` Jeff Moyer
  2014-06-19 21:42     ` Dave Chinner
@ 2014-06-20 11:28     ` Mel Gorman
  1 sibling, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2014-06-20 11:28 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara, Johannes Weiner,
	Jens Axboe

On Thu, Jun 19, 2014 at 02:38:44PM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > The existing CFQ default target_latency results in very poor performance
> > for larger numbers of threads doing sequential reads.  While this can be
> > easily described as a tuning problem for users, it is one that is tricky
> > to detect. This patch the default on the assumption that people with access
> > to expensive fast storage also know how to tune their IO scheduler.
> >
> > The following is from tiobench run on a mid-range desktop with a single
> > spinning disk.
> >
> >                                       3.16.0-rc1            3.16.0-rc1                 3.0.0
> >                                          vanilla          cfq600                     vanilla
> > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
> > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
> > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
> > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
> > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)
> 
> Did you test any workloads other than this? 

dd tests were inconclusive due to high variability. The dbench results
hadn't come through but regression tests there indicate that it has
regressed for high numbers of clients. I know sequential reads of
benchmarks like bonnie++ have also regressed but I have not reverified
the results yet.

> Also, what normal workload
> has 8 or more threads doing sequential reads?  (That's an honest
> question.)
> 

File servers, mail servers, streaming media servers with multiple users,
multi-user systems

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-19 21:42     ` Dave Chinner
@ 2014-06-20 11:30       ` Mel Gorman
  2014-06-21  0:39         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2014-06-20 11:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Moyer, Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara,
	Johannes Weiner, Jens Axboe

On Fri, Jun 20, 2014 at 07:42:14AM +1000, Dave Chinner wrote:
> On Thu, Jun 19, 2014 at 02:38:44PM -0400, Jeff Moyer wrote:
> > Mel Gorman <mgorman@suse.de> writes:
> > 
> > > The existing CFQ default target_latency results in very poor performance
> > > for larger numbers of threads doing sequential reads.  While this can be
> > > easily described as a tuning problem for users, it is one that is tricky
> > > to detect. This patch the default on the assumption that people with access
> > > to expensive fast storage also know how to tune their IO scheduler.
> > >
> > > The following is from tiobench run on a mid-range desktop with a single
> > > spinning disk.
> > >
> > >                                       3.16.0-rc1            3.16.0-rc1                 3.0.0
> > >                                          vanilla          cfq600                     vanilla
> > > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
> > > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
> > > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
> > > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
> > > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)
> > 
> > Did you test any workloads other than this?  Also, what normal workload
> > has 8 or more threads doing sequential reads?  (That's an honest
> > question.)
> 
> I'd also suggest that making changes basd on the assumption that
> people affected by the change know how to tune CFQ is a bad idea.
> When CFQ misbehaves, most people just switch to deadline or no-op
> because they don't understand how CFQ works, nor what what all the
> nobs do or which ones to tweak to solve their problem....

Ok, that's fair enough. Tuning CFQ is tricky but as it is, the default
performance is not great in comparison to older kernels and it's something
that has varied considerably over time. I'm surprised there have not been
more complaints but maybe I just missed them on the lists.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] cfq: Increase default value of target_latency
  2014-06-20 11:30       ` Mel Gorman
@ 2014-06-21  0:39         ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2014-06-21  0:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jeff Moyer, Linux Kernel, Linux-MM, Linux-FSDevel, Jan Kara,
	Johannes Weiner, Jens Axboe

On Fri, Jun 20, 2014 at 12:30:25PM +0100, Mel Gorman wrote:
> On Fri, Jun 20, 2014 at 07:42:14AM +1000, Dave Chinner wrote:
> > On Thu, Jun 19, 2014 at 02:38:44PM -0400, Jeff Moyer wrote:
> > > Mel Gorman <mgorman@suse.de> writes:
> > > 
> > > > The existing CFQ default target_latency results in very poor performance
> > > > for larger numbers of threads doing sequential reads.  While this can be
> > > > easily described as a tuning problem for users, it is one that is tricky
> > > > to detect. This patch the default on the assumption that people with access
> > > > to expensive fast storage also know how to tune their IO scheduler.
> > > >
> > > > The following is from tiobench run on a mid-range desktop with a single
> > > > spinning disk.
> > > >
> > > >                                       3.16.0-rc1            3.16.0-rc1                 3.0.0
> > > >                                          vanilla          cfq600                     vanilla
> > > > Mean   SeqRead-MB/sec-1         121.88 (  0.00%)      121.60 ( -0.23%)      134.59 ( 10.42%)
> > > > Mean   SeqRead-MB/sec-2         101.99 (  0.00%)      102.35 (  0.36%)      122.59 ( 20.20%)
> > > > Mean   SeqRead-MB/sec-4          97.42 (  0.00%)       99.71 (  2.35%)      114.78 ( 17.82%)
> > > > Mean   SeqRead-MB/sec-8          83.39 (  0.00%)       90.39 (  8.39%)      100.14 ( 20.09%)
> > > > Mean   SeqRead-MB/sec-16         68.90 (  0.00%)       77.29 ( 12.18%)       81.64 ( 18.50%)
> > > 
> > > Did you test any workloads other than this?  Also, what normal workload
> > > has 8 or more threads doing sequential reads?  (That's an honest
> > > question.)
> > 
> > I'd also suggest that making changes basd on the assumption that
> > people affected by the change know how to tune CFQ is a bad idea.
> > When CFQ misbehaves, most people just switch to deadline or no-op
> > because they don't understand how CFQ works, nor what what all the
> > nobs do or which ones to tweak to solve their problem....
> 
> Ok, that's fair enough. Tuning CFQ is tricky but as it is, the default
> performance is not great in comparison to older kernels and it's something
> that has varied considerably over time. I'm surprised there have not been
> more complaints but maybe I just missed them on the lists.

That's because there are widespread recommendations not to use CFQ
if you have any sort of significant storage or IO workload. We
specifically recommend that you don't use CFQ with XFS
because it does not play nicely with correlated multi-process
IO. This is something that happens a lot, even with single threaded
workloads.

e.g. a single fsync can issue dependent IOs from multiple
process contexts - the syscall process for data IO, the allocation
workqueue kworker for btree blocks, the xfsaild to push metadata to
disk to make space available for the allocation transaction, and
then the journal IO from the xfs log workqueue kworker.

There's 4 IOs, all from different process contexts, all of which
need to be dispatched and completed with the minimum of latency.
With CFQ adding scheduling and idling delays in the middle of this,
it tends to leave disks idle when they really should be doing work.

We also don't recommend using CFQ when you have hardware raid with
caches, because the HW RAID does a much, much better job of
optimising and prioritising IO through it's cache. Idling is
wrong if the cache has hardware readahead, because most subsequent
read IOs will hit the hardware cache. Hence you could be dispatching
other IO instead of idling, yet still get minimal IO latency  across
multiple streams of different read workloads.

Hence people search on CFQ problems, see the "use deadline"
recommendations, change to deadline and see there IO workload going
faster. So they shrug their shoulders, set deadline as the
default, and move on to the next problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-06-21  0:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-18  8:23 [PATCH 0/4] Improve sequential read throughput Mel Gorman
2014-06-18  8:23 ` [PATCH 1/4] cfq: Increase default value of target_latency Mel Gorman
2014-06-19 18:38   ` Jeff Moyer
2014-06-19 21:42     ` Dave Chinner
2014-06-20 11:30       ` Mel Gorman
2014-06-21  0:39         ` Dave Chinner
2014-06-20 11:28     ` Mel Gorman
2014-06-18  8:23 ` [PATCH 2/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-06-18  8:23 ` [PATCH 3/4] mm: page_alloc: Reset fair zone allocation policy when batch counts are expired Mel Gorman
2014-06-18 20:01   ` Johannes Weiner
2014-06-18 21:57     ` Mel Gorman
2014-06-19 18:18       ` Johannes Weiner
2014-06-18  8:23 ` [PATCH 4/4] mm: page_alloc: Reduce cost of dirty zone balancing Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).