linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Improve sequential read throughput v4r8
@ 2014-06-30 16:47 Mel Gorman
  2014-06-30 16:48 ` [PATCH 1/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 16:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Changelog since V3
o Push down kwapd changes to cover the balance gap
o Drop drop page distribution patch

Changelog since V2
o Simply fair zone policy cost reduction
o Drop CFQ patch

Changelog since v1
o Rebase to v3.16-rc2
o Move CFQ patch to end of series where it can be rejected easier if necessary
o Introduce page-reclaim related patch related to kswapd/fairzone interactions
o Rework fast zone policy patch

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential
read throughput. This is visible in a number of benchmarks but I looked
at tiobench the closest. This is using ext3 on a mid-range desktop and
the series applied.

                                      3.16.0-rc2                 3.0.0            3.16.0-rc2
                                         vanilla               vanilla         fairzone-v4r5
Min    SeqRead-MB/sec-1         120.92 (  0.00%)      133.65 ( 10.53%)      140.68 ( 16.34%)
Min    SeqRead-MB/sec-2         100.25 (  0.00%)      121.74 ( 21.44%)      118.13 ( 17.84%)
Min    SeqRead-MB/sec-4          96.27 (  0.00%)      113.48 ( 17.88%)      109.84 ( 14.10%)
Min    SeqRead-MB/sec-8          83.55 (  0.00%)       97.87 ( 17.14%)       89.62 (  7.27%)
Min    SeqRead-MB/sec-16         66.77 (  0.00%)       82.59 ( 23.69%)       70.49 (  5.57%)

Overall system CPU usage is reduced

          3.16.0-rc2       3.0.0  3.16.0-rc2
             vanilla     vanilla fairzone-v4
User          390.13      251.45      396.13
System        404.41      295.13      389.61
Elapsed      5412.45     5072.42     5163.49

This series does not fully restore throughput performance to 3.0 levels
but it brings it close for lower thread counts. Higher thread counts are
known to be worse than 3.0 due to CFQ changes but there is no appetite
for changing the defaults there.

 include/linux/mmzone.h         | 207 ++++++++++++++++++++++-------------------
 include/linux/swap.h           |   9 --
 include/trace/events/pagemap.h |  16 ++--
 mm/page_alloc.c                | 126 ++++++++++++++-----------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |  46 ++++-----
 mm/vmstat.c                    |   4 +-
 7 files changed, 208 insertions(+), 204 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
@ 2014-06-30 16:48 ` Mel Gorman
  2014-06-30 16:48 ` [PATCH 2/4] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
  2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
  2014-06-30 16:48 ` [PATCH 1/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
@ 2014-06-30 16:48 ` Mel Gorman
  2014-06-30 16:48 ` [PATCH 3/4] mm: vmscan: Do not reclaim from lower zones if they are balanced Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that arguably the biggest change is reducing the size of the
lowmem_reserve type. It should still be large enough but by shrinking it
the fields used by the page allocator fast path all fit in one cache line.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 201 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |  13 ++--
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 105 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..a2f6443 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,17 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	unsigned int lowmem_reserve[MAX_NR_ZONES];
 
+	struct per_cpu_pageset __percpu *pageset;
 #ifdef CONFIG_NUMA
 	int node;
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
 #endif
-	struct per_cpu_pageset __percpu *pageset;
 	/*
-	 * free areas of different sizes
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned int inactive_ratio;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +357,37 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
 	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	unsigned long		dirty_balance_reserve;
 
 	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
 	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
+	unsigned long percpu_drift_mark;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
+	const char		*name;
 
 	/*
-	 * Discontig memory support fields.
+	 * Number of MIGRATE_RESEVE page block. To maintain for just
+	 * optimization. Protected by zone->lock.
 	 */
+	int			nr_migrate_reserve_block;
+
 	struct pglist_data	*zone_pgdat;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -504,16 +436,89 @@ struct zone {
 	unsigned long		present_pages;
 	unsigned long		managed_pages;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * Number of MIGRATE_RESEVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	int			nr_migrate_reserve_block;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	unsigned long		pages_scanned;	   /* since last reclaim */
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
 
+#ifdef CONFIG_COMPACTION
 	/*
-	 * rarely used fields:
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
 	 */
-	const char		*name;
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..ebbdbcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1699,7 +1699,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1714,7 +1713,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3245,7 +3244,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %u", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5566,7 +5565,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			unsigned int max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
@@ -5617,6 +5616,7 @@ static void setup_per_zone_lowmem_reserve(void)
 			idx = j;
 			while (idx) {
 				struct zone *lower_zone;
+				unsigned long reserve;
 
 				idx--;
 
@@ -5624,8 +5624,11 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = managed_pages /
+				reserve = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
+				if (WARN_ON(reserve > UINT_MAX))
+					reserve = UINT_MAX;
+				lower_zone->lowmem_reserve[j] = reserve;
 				managed_pages += lower_zone->managed_pages;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..c6d6fae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%u",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %u", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
  2014-06-30 16:48 ` [PATCH 1/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
  2014-06-30 16:48 ` [PATCH 2/4] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
@ 2014-06-30 16:48 ` Mel Gorman
  2014-06-30 16:48 ` [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
  2014-07-01 17:16 ` [PATCH 0/5] Improve sequential read throughput v4r8 Johannes Weiner
  4 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible
zones in order until the low watermarks are reached and then wakes kswapd.
Once kswapd is awake, it scans zones in the opposite direction so the
scanning lists on 64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
some percentage will be allocated from Normal and some from DMA32. The
percentage depends on the ratio of the zone sizes and when their watermarks
were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
is unbalanced only DMA32 will be shrunk. This leads to a difference of
ages between DMA32 and Normal. Relatively young pages are then continually
rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
Some of these pages may be recently read-ahead pages requiring that the page
be re-read from disk and impacting overall performance.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

kswapd shrinks equally from all zones up to the high watermark plus a
balance gap and the lowmem reserves. This patch removes the additional
reclaim from lower zones on the grounds that the fair zone allocation
policy will typically be interleaving between the zones.  This should not
break the normal page aging as the proportional allocations due to the
fair zone allocation policy should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2                 3.0.0            3.16.0-rc2
                                         vanilla               vanilla           checklow-v4
Min    SeqRead-MB/sec-1         120.92 (  0.00%)      133.65 ( 10.53%)      140.64 ( 16.31%)
Min    SeqRead-MB/sec-2         100.25 (  0.00%)      121.74 ( 21.44%)      117.67 ( 17.38%)
Min    SeqRead-MB/sec-4          96.27 (  0.00%)      113.48 ( 17.88%)      107.56 ( 11.73%)
Min    SeqRead-MB/sec-8          83.55 (  0.00%)       97.87 ( 17.14%)       88.08 (  5.42%)
Min    SeqRead-MB/sec-16         66.77 (  0.00%)       82.59 ( 23.69%)       71.04 (  6.40%)

There are still regressions for higher number of threads but this is
related to changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h |  9 ---------
 mm/vmscan.c          | 46 ++++++++++++++++------------------------------
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bdbee8..1680307 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -165,15 +165,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..3e315c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2294,7 +2294,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 /* Returns true if compaction should go ahead for a high-order request */
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long balance_gap, watermark;
+	unsigned long watermark;
 	bool watermark_ok;
 
 	/* Do not consider compaction for orders reclaim is meant to satisfy */
@@ -2307,9 +2307,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+	watermark = high_wmark_pages(zone) + (2UL << sc->order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
 
 	/*
@@ -2816,11 +2814,9 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			  unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order)
 {
-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx, 0))
+	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0))
 		return false;
 
 	if (IS_ENABLED(CONFIG_COMPACTION) && order &&
@@ -2877,7 +2873,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 		}
 
-		if (zone_balanced(zone, order, 0, i))
+		if (zone_balanced(zone, order))
 			balanced_pages += zone->managed_pages;
 		else if (!order)
 			return false;
@@ -2934,7 +2930,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       unsigned long *nr_attempted)
 {
 	int testorder = sc->order;
-	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
@@ -2956,21 +2951,11 @@ static bool kswapd_shrink_zone(struct zone *zone,
 		testorder = 0;
 
 	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-
-	/*
 	 * If there is no low memory pressure or the zone is balanced then no
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder,
-						balance_gap, classzone_idx))
+	if (!lowmem_pressure && zone_balanced(zone, testorder))
 		return true;
 
 	shrink_zone(zone, sc);
@@ -2993,7 +2978,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+	    zone_balanced(zone, testorder)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 	}
@@ -3079,7 +3064,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order)) {
 				end_zone = i;
 				break;
 			} else {
@@ -3124,12 +3109,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3397,7 +3383,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
                   ` (2 preceding siblings ...)
  2014-06-30 16:48 ` [PATCH 3/4] mm: vmscan: Do not reclaim from lower zones if they are balanced Mel Gorman
@ 2014-06-30 16:48 ` Mel Gorman
  2014-06-30 21:14   ` Andrew Morton
  2014-07-01 17:16 ` [PATCH 0/5] Improve sequential read throughput v4r8 Johannes Weiner
  4 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

This patch makes a number of changes that should reduce the overall cost

o Abort the fair zone allocation policy once remote zones are encountered
o Use a simplier scan when resetting NR_ALLOC_BATCH
o Use a simple flag to identify depleted zones instead of accessing a
  potentially write-intensive cache line for counters

On UMA machines, the effect on overall performance is marginal. The main
impact is on system CPU usage which is small enough on UMA to begin with.
This comparison shows the system CPu usage between vanilla, the previous
patch and this patch.

          3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
             vanilla checklow-v4 fairzone-v4
User          390.13      400.85      396.13
System        404.41      393.60      389.61
Elapsed      5412.45     5166.12     5163.49

There is a small reduction and it appears consistent.

On NUMA machines, the scanning overhead is higher as zones are scanned
that are ineligible for use by zone allocation policy. This patch fixes
the zone-order zonelist policy and reduces the numbers of zones scanned
by the allocator leading to an overall reduction of CPU usage.

          3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
             vanilla checklow-v4 fairzone-v4
User          744.05      763.26      778.53
System      70148.60    49331.48    44905.73
Elapsed     28094.08    27476.72    27378.98

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   6 +++
 mm/page_alloc.c        | 113 ++++++++++++++++++++++++++++---------------------
 2 files changed, 70 insertions(+), 49 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2f6443..9e9ddc4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebbdbcd..3fbe995 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1597,6 +1597,8 @@ again:
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1908,6 +1910,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		if (!zone_is_fair_depleted(zone))
+			continue;
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1925,8 +1941,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1950,9 +1969,12 @@ zonelist_scan:
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
+				break;
+
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
+			}
 		}
 		/*
 		 * When allocating a page cache page for writing, we
@@ -2058,13 +2080,7 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2073,8 +2089,37 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	/*
+	 * The first pass makes sure allocations are spread fairly within the
+	 * local node.  However, the local node might have free pages left
+	 * after the fairness batches are exhausted, and remote zones haven't
+	 * even been considered yet.  Try once more without fairness, and
+	 * include remote zones now, before entering the slowpath and waking
+	 * kswapd: prefer spilling to a remote zone over swapping locally.
+	 */
+	if (alloc_flags & ALLOC_FAIR) {
+		alloc_flags &= ~ALLOC_FAIR;
+		if (nr_fair_skipped) {
+			zonelist_rescan = true;
+			reset_alloc_batches(preferred_zone);
+		}
+		if (nr_online_nodes > 1)
+			zonelist_rescan = true;
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2395,28 +2440,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2714,6 +2737,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct zoneref *preferred_zoneref;
+	struct zoneref *second_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
@@ -2748,33 +2772,24 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	/*
+	 * Only apply the fair zone allocation policy if there is more than
+	 * one local eligible zone
+	 */
+	second_zoneref = preferred_zoneref + 1;
+	if (zonelist_zone(second_zoneref) &&
+	    zone_local(preferred_zone, zonelist_zone(second_zoneref)))
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-30 16:48 ` [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
@ 2014-06-30 21:14   ` Andrew Morton
  2014-06-30 21:51     ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-06-30 21:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On Mon, 30 Jun 2014 17:48:03 +0100 Mel Gorman <mgorman@suse.de> wrote:

> The fair zone allocation policy round-robins allocations between zones
> within a node to avoid age inversion problems during reclaim. If the
> first allocation fails, the batch counts is reset and a second attempt
> made before entering the slow path.
> 
> One assumption made with this scheme is that batches expire at roughly the
> same time and the resets each time are justified. This assumption does not
> hold when zones reach their low watermark as the batches will be consumed
> at uneven rates.  Allocation failure due to watermark depletion result in
> additional zonelist scans for the reset and another watermark check before
> hitting the slowpath.
> 
> This patch makes a number of changes that should reduce the overall cost
> 
> o Abort the fair zone allocation policy once remote zones are encountered
> o Use a simplier scan when resetting NR_ALLOC_BATCH
> o Use a simple flag to identify depleted zones instead of accessing a
>   potentially write-intensive cache line for counters
> 
> On UMA machines, the effect on overall performance is marginal. The main
> impact is on system CPU usage which is small enough on UMA to begin with.
> This comparison shows the system CPu usage between vanilla, the previous
> patch and this patch.
> 
>           3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
>              vanilla checklow-v4 fairzone-v4
> User          390.13      400.85      396.13
> System        404.41      393.60      389.61
> Elapsed      5412.45     5166.12     5163.49
> 
> There is a small reduction and it appears consistent.
> 
> On NUMA machines, the scanning overhead is higher as zones are scanned
> that are ineligible for use by zone allocation policy. This patch fixes
> the zone-order zonelist policy and reduces the numbers of zones scanned
> by the allocator leading to an overall reduction of CPU usage.
> 
>           3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
>              vanilla checklow-v4 fairzone-v4
> User          744.05      763.26      778.53
> System      70148.60    49331.48    44905.73
> Elapsed     28094.08    27476.72    27378.98

That's a large change in system time.  Does this all include kswapd
activity?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-30 21:14   ` Andrew Morton
@ 2014-06-30 21:51     ` Mel Gorman
  2014-06-30 22:09       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2014-06-30 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On Mon, Jun 30, 2014 at 02:14:04PM -0700, Andrew Morton wrote:
> On Mon, 30 Jun 2014 17:48:03 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > The fair zone allocation policy round-robins allocations between zones
> > within a node to avoid age inversion problems during reclaim. If the
> > first allocation fails, the batch counts is reset and a second attempt
> > made before entering the slow path.
> > 
> > One assumption made with this scheme is that batches expire at roughly the
> > same time and the resets each time are justified. This assumption does not
> > hold when zones reach their low watermark as the batches will be consumed
> > at uneven rates.  Allocation failure due to watermark depletion result in
> > additional zonelist scans for the reset and another watermark check before
> > hitting the slowpath.
> > 
> > This patch makes a number of changes that should reduce the overall cost
> > 
> > o Abort the fair zone allocation policy once remote zones are encountered
> > o Use a simplier scan when resetting NR_ALLOC_BATCH
> > o Use a simple flag to identify depleted zones instead of accessing a
> >   potentially write-intensive cache line for counters
> > 
> > On UMA machines, the effect on overall performance is marginal. The main
> > impact is on system CPU usage which is small enough on UMA to begin with.
> > This comparison shows the system CPu usage between vanilla, the previous
> > patch and this patch.
> > 
> >           3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
> >              vanilla checklow-v4 fairzone-v4
> > User          390.13      400.85      396.13
> > System        404.41      393.60      389.61
> > Elapsed      5412.45     5166.12     5163.49
> > 
> > There is a small reduction and it appears consistent.
> > 
> > On NUMA machines, the scanning overhead is higher as zones are scanned
> > that are ineligible for use by zone allocation policy. This patch fixes
> > the zone-order zonelist policy and reduces the numbers of zones scanned
> > by the allocator leading to an overall reduction of CPU usage.
> > 
> >           3.16.0-rc2  3.16.0-rc2  3.16.0-rc2
> >              vanilla checklow-v4 fairzone-v4
> > User          744.05      763.26      778.53
> > System      70148.60    49331.48    44905.73
> > Elapsed     28094.08    27476.72    27378.98
> 
> That's a large change in system time.  Does this all include kswapd
> activity?
> 

I don't have a profile to quantify that exactly. It takes 7 hours to
complete a test on that machine in this configuration and it would take
longer with profiling. I was not testing with profiling enabled as that
invalidates performance tests. I'd expect it'd take the guts of two days
to gather full profiles for it and even then it would be masked by remote
access costs and other factors. It'd be worse considering that automatic
NUMA balancing is enabled and I normally test with that turned on.

However, without the kswapd change there are a lot of retries and
reallocations for pages recently reclaimed. For the fairzone patch there
are far fewer scans of unusable zones to find the lower zones. Considering
the number of allocations required there is simply a lot of overhead that
builds up.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-30 21:51     ` Mel Gorman
@ 2014-06-30 22:09       ` Andrew Morton
  2014-07-01  8:02         ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-06-30 22:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On Mon, 30 Jun 2014 22:51:21 +0100 Mel Gorman <mgorman@suse.de> wrote:

> > That's a large change in system time.  Does this all include kswapd
> > activity?
> > 
> 
> I don't have a profile to quantify that exactly. It takes 7 hours to
> complete a test on that machine in this configuration

That's nuts.  Why should measuring this require more than a few minutes?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-30 22:09       ` Andrew Morton
@ 2014-07-01  8:02         ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-07-01  8:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On Mon, Jun 30, 2014 at 03:09:14PM -0700, Andrew Morton wrote:
> On Mon, 30 Jun 2014 22:51:21 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > > That's a large change in system time.  Does this all include kswapd
> > > activity?
> > > 
> > 
> > I don't have a profile to quantify that exactly. It takes 7 hours to
> > complete a test on that machine in this configuration
> 
> That's nuts.  Why should measuring this require more than a few minutes?

That's how long the full test takes to complete for each part of the IO
test. Profiling a subsection of it will miss some parts with no
guarantee the sampled subset is representative. Profiling for smaller
amounts of IO so the test completes quickly does not guarantee that the
sample is representative. Reducing the size of memory of the machine
using any tricks is also not representative etc.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
                   ` (3 preceding siblings ...)
  2014-06-30 16:48 ` [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
@ 2014-07-01 17:16 ` Johannes Weiner
  2014-07-01 18:39   ` Mel Gorman
  4 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2014-07-01 17:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Mon, Jun 30, 2014 at 05:47:59PM +0100, Mel Gorman wrote:
> Changelog since V3
> o Push down kwapd changes to cover the balance gap
> o Drop drop page distribution patch
> 
> Changelog since V2
> o Simply fair zone policy cost reduction
> o Drop CFQ patch
> 
> Changelog since v1
> o Rebase to v3.16-rc2
> o Move CFQ patch to end of series where it can be rejected easier if necessary
> o Introduce page-reclaim related patch related to kswapd/fairzone interactions
> o Rework fast zone policy patch
> 
> IO performance since 3.0 has been a mixed bag. In many respects we are
> better and in some we are worse and one of those places is sequential
> read throughput. This is visible in a number of benchmarks but I looked
> at tiobench the closest. This is using ext3 on a mid-range desktop and
> the series applied.
> 
>                                       3.16.0-rc2                 3.0.0            3.16.0-rc2
>                                          vanilla               vanilla         fairzone-v4r5
> Min    SeqRead-MB/sec-1         120.92 (  0.00%)      133.65 ( 10.53%)      140.68 ( 16.34%)
> Min    SeqRead-MB/sec-2         100.25 (  0.00%)      121.74 ( 21.44%)      118.13 ( 17.84%)
> Min    SeqRead-MB/sec-4          96.27 (  0.00%)      113.48 ( 17.88%)      109.84 ( 14.10%)
> Min    SeqRead-MB/sec-8          83.55 (  0.00%)       97.87 ( 17.14%)       89.62 (  7.27%)
> Min    SeqRead-MB/sec-16         66.77 (  0.00%)       82.59 ( 23.69%)       70.49 (  5.57%)
> 
> Overall system CPU usage is reduced
> 
>           3.16.0-rc2       3.0.0  3.16.0-rc2
>              vanilla     vanilla fairzone-v4
> User          390.13      251.45      396.13
> System        404.41      295.13      389.61
> Elapsed      5412.45     5072.42     5163.49
> 
> This series does not fully restore throughput performance to 3.0 levels
> but it brings it close for lower thread counts. Higher thread counts are
> known to be worse than 3.0 due to CFQ changes but there is no appetite
> for changing the defaults there.

I ran tiobench locally and here are the results:

tiobench MB/sec
                                        3.16-rc1              3.16-rc1
                                                           seqreadv4r8
Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)
Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)
Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)
Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)
Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)
Mean   RandRead-MB/sec-1          1.14 (  0.00%)        1.11 ( -2.35%)
Mean   RandRead-MB/sec-2          1.30 (  0.00%)        1.25 ( -3.85%)
Mean   RandRead-MB/sec-4          1.50 (  0.00%)        1.46 ( -2.23%)
Mean   RandRead-MB/sec-8          1.72 (  0.00%)        1.60 ( -6.96%)
Mean   RandRead-MB/sec-16         1.72 (  0.00%)        1.69 ( -2.13%)

Seqread throughput is up, randread takes a small hit.  But allocation
latency is badly screwed at higher concurrency levels:

tiobench Maximum Latency
                                            3.16-rc1              3.16-rc1
                                                               seqreadv4r8
Mean   SeqRead-MaxLatency-1          77.23 (  0.00%)       57.69 ( 25.30%)
Mean   SeqRead-MaxLatency-2         228.80 (  0.00%)      218.50 (  4.50%)
Mean   SeqRead-MaxLatency-4         329.58 (  0.00%)      325.93 (  1.11%)
Mean   SeqRead-MaxLatency-8         485.13 (  0.00%)      475.35 (  2.02%)
Mean   SeqRead-MaxLatency-16        599.10 (  0.00%)      637.89 ( -6.47%)
Mean   RandRead-MaxLatency-1         66.98 (  0.00%)       18.21 ( 72.81%)
Mean   RandRead-MaxLatency-2        132.88 (  0.00%)      119.61 (  9.98%)
Mean   RandRead-MaxLatency-4        222.95 (  0.00%)      213.82 (  4.10%)
Mean   RandRead-MaxLatency-8        982.99 (  0.00%)     1009.71 ( -2.72%)
Mean   RandRead-MaxLatency-16       515.24 (  0.00%)     1883.82 (-265.62%)
Mean   SeqWrite-MaxLatency-1        239.78 (  0.00%)      233.61 (  2.57%)
Mean   SeqWrite-MaxLatency-2        517.85 (  0.00%)      413.39 ( 20.17%)
Mean   SeqWrite-MaxLatency-4        249.10 (  0.00%)      416.33 (-67.14%)
Mean   SeqWrite-MaxLatency-8        629.31 (  0.00%)      851.62 (-35.33%)
Mean   SeqWrite-MaxLatency-16       987.05 (  0.00%)     1080.92 ( -9.51%)
Mean   RandWrite-MaxLatency-1         0.01 (  0.00%)        0.01 (  0.00%)
Mean   RandWrite-MaxLatency-2         0.02 (  0.00%)        0.02 (  0.00%)
Mean   RandWrite-MaxLatency-4         0.02 (  0.00%)        0.02 (  0.00%)
Mean   RandWrite-MaxLatency-8         1.83 (  0.00%)        1.96 ( -6.73%)
Mean   RandWrite-MaxLatency-16        1.52 (  0.00%)        1.33 ( 12.72%)

Zone fairness is completely gone.  The overall allocation distribution
on this system goes from 40%/60% to 10%/90%, and during the workload
the DMA32 zone is not used *at all*:

                              3.16-rc1    3.16-rc1
                                       seqreadv4r8
Zone normal velocity         11358.492   17996.733
Zone dma32 velocity           8213.852       0.000

Both negative effects stem from kswapd suddenly ignoring the classzone
index while the page allocator respects it: the page allocator will
keep the low wmark + lowmem reserves in DMA32 free, but kswapd won't
reclaim in there until it drops down to the high watermark.  The low
watermark + lowmem reserve is usually bigger than the high watermark,
so you effectively disable kswapd service in DMA32 for user requests.
The zone is then no longer used until it fills with enough kernel
pages to trigger kswapd, or the workload goes into direct reclaim.

The classzone change is a non-sensical change IMO, and there is no
useful description of it to be found in the changelog.  But for the
given tests it appears to be the only change in the entire series to
make a measurable difference; reverting it gets me back to baseline:

tiobench MB/sec
                                        3.16-rc1              3.16-rc1              3.16-rc1
                                                           seqreadv4r8  seqreadv4r8classzone
Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)      129.72 (  0.05%)
Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)      115.61 ( -0.11%)
Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)      110.15 ( -0.06%)
Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)      102.15 (  0.44%)
Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)       86.63 (  0.20%)

            3.16-rc1    3.16-rc1    3.16-rc1
                     seqreadv4r8seqreadv4r8classzone
User          272.45      277.17      272.23
System        197.89      186.30      193.73
Elapsed      4589.17     4356.23     4584.57

                              3.16-rc1    3.16-rc1    3.16-rc1
                                       seqreadv4r8seqreadv4r8classzone
Zone normal velocity         11358.492   17996.733   12695.547
Zone dma32 velocity           8213.852       0.000    6891.421

Please stop making multiple logical changes in a single patch/testing
unit.  This will make it easier to verify them, and hopefully make it
also more obvious if individual changes are underdocumented.  As it
stands, it's hard to impossible to verify the implementation when the
intentions are not fully documented.  Performance results can only do
so much.  They are meant to corroborate the model, not replace it.

And again, if you change the way zone fairness works, please always
include the zone velocity numbers or allocation numbers to show that
your throughput improvements don't just come from completely wrecking
fairness - or in this case from disabling an entire zone.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 17:16 ` [PATCH 0/5] Improve sequential read throughput v4r8 Johannes Weiner
@ 2014-07-01 18:39   ` Mel Gorman
  2014-07-01 20:58     ` Mel Gorman
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Mel Gorman @ 2014-07-01 18:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Tue, Jul 01, 2014 at 01:16:11PM -0400, Johannes Weiner wrote:
> On Mon, Jun 30, 2014 at 05:47:59PM +0100, Mel Gorman wrote:
> > Changelog since V3
> > o Push down kwapd changes to cover the balance gap
> > o Drop drop page distribution patch
> > 
> > Changelog since V2
> > o Simply fair zone policy cost reduction
> > o Drop CFQ patch
> > 
> > Changelog since v1
> > o Rebase to v3.16-rc2
> > o Move CFQ patch to end of series where it can be rejected easier if necessary
> > o Introduce page-reclaim related patch related to kswapd/fairzone interactions
> > o Rework fast zone policy patch
> > 
> > IO performance since 3.0 has been a mixed bag. In many respects we are
> > better and in some we are worse and one of those places is sequential
> > read throughput. This is visible in a number of benchmarks but I looked
> > at tiobench the closest. This is using ext3 on a mid-range desktop and
> > the series applied.
> > 
> >                                       3.16.0-rc2                 3.0.0            3.16.0-rc2
> >                                          vanilla               vanilla         fairzone-v4r5
> > Min    SeqRead-MB/sec-1         120.92 (  0.00%)      133.65 ( 10.53%)      140.68 ( 16.34%)
> > Min    SeqRead-MB/sec-2         100.25 (  0.00%)      121.74 ( 21.44%)      118.13 ( 17.84%)
> > Min    SeqRead-MB/sec-4          96.27 (  0.00%)      113.48 ( 17.88%)      109.84 ( 14.10%)
> > Min    SeqRead-MB/sec-8          83.55 (  0.00%)       97.87 ( 17.14%)       89.62 (  7.27%)
> > Min    SeqRead-MB/sec-16         66.77 (  0.00%)       82.59 ( 23.69%)       70.49 (  5.57%)
> > 
> > Overall system CPU usage is reduced
> > 
> >           3.16.0-rc2       3.0.0  3.16.0-rc2
> >              vanilla     vanilla fairzone-v4
> > User          390.13      251.45      396.13
> > System        404.41      295.13      389.61
> > Elapsed      5412.45     5072.42     5163.49
> > 
> > This series does not fully restore throughput performance to 3.0 levels
> > but it brings it close for lower thread counts. Higher thread counts are
> > known to be worse than 3.0 due to CFQ changes but there is no appetite
> > for changing the defaults there.
> 
> I ran tiobench locally and here are the results:
> 
> tiobench MB/sec
>                                         3.16-rc1              3.16-rc1
>                                                            seqreadv4r8
> Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)
> Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)
> Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)
> Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)
> Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)
> Mean   RandRead-MB/sec-1          1.14 (  0.00%)        1.11 ( -2.35%)
> Mean   RandRead-MB/sec-2          1.30 (  0.00%)        1.25 ( -3.85%)
> Mean   RandRead-MB/sec-4          1.50 (  0.00%)        1.46 ( -2.23%)
> Mean   RandRead-MB/sec-8          1.72 (  0.00%)        1.60 ( -6.96%)
> Mean   RandRead-MB/sec-16         1.72 (  0.00%)        1.69 ( -2.13%)
> 
> Seqread throughput is up, randread takes a small hit.  But allocation
> latency is badly screwed at higher concurrency levels:
> 

So the results are roughly similar. You don't state which filesystem it is
but FWIW if it's the ext3 filesystem using the ext4 driver then throughput
at higher levels is also affected by filesystem fragmentation. The problem
was outside the scope of the series.

> tiobench Maximum Latency
>                                             3.16-rc1              3.16-rc1
>                                                                seqreadv4r8
> Mean   SeqRead-MaxLatency-1          77.23 (  0.00%)       57.69 ( 25.30%)
> Mean   SeqRead-MaxLatency-2         228.80 (  0.00%)      218.50 (  4.50%)
> Mean   SeqRead-MaxLatency-4         329.58 (  0.00%)      325.93 (  1.11%)
> Mean   SeqRead-MaxLatency-8         485.13 (  0.00%)      475.35 (  2.02%)
> Mean   SeqRead-MaxLatency-16        599.10 (  0.00%)      637.89 ( -6.47%)
> Mean   RandRead-MaxLatency-1         66.98 (  0.00%)       18.21 ( 72.81%)
> Mean   RandRead-MaxLatency-2        132.88 (  0.00%)      119.61 (  9.98%)
> Mean   RandRead-MaxLatency-4        222.95 (  0.00%)      213.82 (  4.10%)
> Mean   RandRead-MaxLatency-8        982.99 (  0.00%)     1009.71 ( -2.72%)
> Mean   RandRead-MaxLatency-16       515.24 (  0.00%)     1883.82 (-265.62%)
> Mean   SeqWrite-MaxLatency-1        239.78 (  0.00%)      233.61 (  2.57%)
> Mean   SeqWrite-MaxLatency-2        517.85 (  0.00%)      413.39 ( 20.17%)
> Mean   SeqWrite-MaxLatency-4        249.10 (  0.00%)      416.33 (-67.14%)
> Mean   SeqWrite-MaxLatency-8        629.31 (  0.00%)      851.62 (-35.33%)
> Mean   SeqWrite-MaxLatency-16       987.05 (  0.00%)     1080.92 ( -9.51%)
> Mean   RandWrite-MaxLatency-1         0.01 (  0.00%)        0.01 (  0.00%)
> Mean   RandWrite-MaxLatency-2         0.02 (  0.00%)        0.02 (  0.00%)
> Mean   RandWrite-MaxLatency-4         0.02 (  0.00%)        0.02 (  0.00%)
> Mean   RandWrite-MaxLatency-8         1.83 (  0.00%)        1.96 ( -6.73%)
> Mean   RandWrite-MaxLatency-16        1.52 (  0.00%)        1.33 ( 12.72%)
> 
> Zone fairness is completely gone.  The overall allocation distribution
> on this system goes from 40%/60% to 10%/90%, and during the workload
> the DMA32 zone is not used *at all*:
> 

The zone fairness gets effectively disabled when the streaming is using all
of physical memory and reclaiming behind anyway as kswapd. The allocator is
using the preferred zone while reclaim scans behind it. If you run tiobench
with a size that fits within memory then the IO results themselves are
valid but it should show that the zone allocation is still spread fairly.

This is from a tiobench configuration that fits within memory.

                            3.16.0-rc2  3.16.0-rc2
                               vanilla fairzone-v4
DMA32 allocs                  10809658    10904632
Normal allocs                 18401594    18342985

In this case there was no reclaim activity.

>                               3.16-rc1    3.16-rc1
>                                        seqreadv4r8
> Zone normal velocity         11358.492   17996.733
> Zone dma32 velocity           8213.852       0.000
> 

Showing that when the IO workload is twice memory that it stays confined
within one zone. Considering that this is a streaming workload for the
most part and we're discarding behind it was of less concern considering
that interleaving results in the wrong reclaim decisions being made.

> Both negative effects stem from kswapd suddenly ignoring the classzone
> index while the page allocator respects it: the page allocator will
> keep the low wmark + lowmem reserves in DMA32 free, but kswapd won't
> reclaim in there until it drops down to the high watermark.  The low
> watermark + lowmem reserve is usually bigger than the high watermark,
> so you effectively disable kswapd service in DMA32 for user requests.
> The zone is then no longer used until it fills with enough kernel
> pages to trigger kswapd, or the workload goes into direct reclaim.
> 

Yes. If the classzone index was preserved or the balance gap then the same
regression exists. The interleaving from the allocator and ordering of kswapd
activity on the lower zones reclaimed pages before they were finished with.

> The classzone change is a non-sensical change IMO, and there is no
> useful description of it to be found in the changelog.  But for the
> given tests it appears to be the only change in the entire series to
> make a measurable difference; reverting it gets me back to baseline:
> 
> tiobench MB/sec
>                                         3.16-rc1              3.16-rc1              3.16-rc1
>                                                            seqreadv4r8  seqreadv4r8classzone
> Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)      129.72 (  0.05%)
> Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)      115.61 ( -0.11%)
> Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)      110.15 ( -0.06%)
> Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)      102.15 (  0.44%)
> Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)       86.63 (  0.20%)
> 

That is consistent with my own tests. The single patch that remained was
the logical change.

>             3.16-rc1    3.16-rc1    3.16-rc1
>                      seqreadv4r8seqreadv4r8classzone
> User          272.45      277.17      272.23
> System        197.89      186.30      193.73
> Elapsed      4589.17     4356.23     4584.57
> 
>                               3.16-rc1    3.16-rc1    3.16-rc1
>                                        seqreadv4r8seqreadv4r8classzone
> Zone normal velocity         11358.492   17996.733   12695.547
> Zone dma32 velocity           8213.852       0.000    6891.421
> 
> Please stop making multiple logical changes in a single patch/testing
> unit. 

In this case you would end up with two patches

Removal of balance gap -- no major difference measured
Removal of classzone_idx -- removes the lowmem reserve

The first patch on its own would have no useful documentation attached
which is why it was not split out.

> This will make it easier to verify them, and hopefully make it
> also more obvious if individual changes are underdocumented.  As it
> stands, it's hard to impossible to verify the implementation when the
> intentions are not fully documented.  Performance results can only do
> so much.  They are meant to corroborate the model, not replace it.
> 

The fair zone policy itself is partially working against the lowmem
reserve idea. The point of the lowmem reserve was to preserve the lower
zones when an upper zone can be used and the fair zone policy breaks
that. The fair zone policy ignores that and it was never reconciled. The
dirty page distribution does a different interleaving again and was never
reconciled with the fair zone policy or lowmem reserves. kswapd itself was
not using the classzone_idx it actually woken for although in this case
it may not matter. The end result is that the model is fairly inconsistent
which makes comparison against it a difficult exercise at best. About all
that was left was that from a performance perspective that the fair zone
allocation policy is not doing the right thing for streaming workloads.

> And again, if you change the way zone fairness works, please always
> include the zone velocity numbers or allocation numbers to show that
> your throughput improvements don't just come from completely wrecking
> fairness - or in this case from disabling an entire zone.

The fair zone policy is preserved until such time as the workload is
continually streaming data in and reclaiming out. The original fair zone
allocation policy patch (81c0a2bb515fd4daae8cab64352877480792b515) did not
describe what workload it measurably benefitted. It noted that pages can
get activated and live longer than they should which is completely true
but did not document why that mattered for streaming workloads or notice
that performance for those workloads got completely shot.

There is a concern that the pages on the lower zone potentially get preserved
forever. However, the interleaving from the fair zone policy would reach
the low watermark again and pages up to the high watermark would still
get rotated and reclaimed so it did not seem like it would be an issue.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 18:39   ` Mel Gorman
@ 2014-07-01 20:58     ` Mel Gorman
  2014-07-01 21:25     ` Johannes Weiner
  2014-07-01 22:38     ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-07-01 20:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Tue, Jul 01, 2014 at 07:39:15PM +0100, Mel Gorman wrote:
> The fair zone policy itself is partially working against the lowmem
> reserve idea. The point of the lowmem reserve was to preserve the lower
> zones when an upper zone can be used and the fair zone policy breaks
> that. The fair zone policy ignores that and it was never reconciled. The
> dirty page distribution does a different interleaving again and was never
> reconciled with the fair zone policy or lowmem reserves. kswapd itself was
> not using the classzone_idx it actually woken for although in this case
> it may not matter. The end result is that the model is fairly inconsistent
> which makes comparison against it a difficult exercise at best. About all
> that was left was that from a performance perspective that the fair zone
> allocation policy is not doing the right thing for streaming workloads.
> 

The inevitable feedback will be to reconcile those differences so I'm
redid the series and queued it for testing. Patch list currently looks
like

mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
mm: page_alloc: Add ALLOC_DIRTY for dirty page distribution
mm: page_alloc: Only apply the fair zone allocation policy if it's eligible
mm: page_alloc: Only apply either the fair zone or dirty page distribution policy, not both
mm: page_alloc: Reduce cost of the fair zone allocation policy
mm: page_alloc: Reconcile lowmem reserves with fair zone allocation policy
mm: vmscan: Fix oddities with classzone and zone balancing
mm: vmscan: Reconcile balance gap lowmem reclaim with fair zone allocation policy
mm: vmscan: Remove classzone considerations from kswapd decisions

About 13 hours to test for ext3 on the small machine, 3 days for the larger
machine. The test could be accelerated by either reducing the iterations or
the memory size of the machine but that would distort the results too badly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 18:39   ` Mel Gorman
  2014-07-01 20:58     ` Mel Gorman
@ 2014-07-01 21:25     ` Johannes Weiner
  2014-07-02 15:44       ` Johannes Weiner
  2014-07-01 22:38     ` Dave Chinner
  2 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2014-07-01 21:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Tue, Jul 01, 2014 at 07:39:15PM +0100, Mel Gorman wrote:
> On Tue, Jul 01, 2014 at 01:16:11PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 30, 2014 at 05:47:59PM +0100, Mel Gorman wrote:
> > > Changelog since V3
> > > o Push down kwapd changes to cover the balance gap
> > > o Drop drop page distribution patch
> > > 
> > > Changelog since V2
> > > o Simply fair zone policy cost reduction
> > > o Drop CFQ patch
> > > 
> > > Changelog since v1
> > > o Rebase to v3.16-rc2
> > > o Move CFQ patch to end of series where it can be rejected easier if necessary
> > > o Introduce page-reclaim related patch related to kswapd/fairzone interactions
> > > o Rework fast zone policy patch
> > > 
> > > IO performance since 3.0 has been a mixed bag. In many respects we are
> > > better and in some we are worse and one of those places is sequential
> > > read throughput. This is visible in a number of benchmarks but I looked
> > > at tiobench the closest. This is using ext3 on a mid-range desktop and
> > > the series applied.
> > > 
> > >                                       3.16.0-rc2                 3.0.0            3.16.0-rc2
> > >                                          vanilla               vanilla         fairzone-v4r5
> > > Min    SeqRead-MB/sec-1         120.92 (  0.00%)      133.65 ( 10.53%)      140.68 ( 16.34%)
> > > Min    SeqRead-MB/sec-2         100.25 (  0.00%)      121.74 ( 21.44%)      118.13 ( 17.84%)
> > > Min    SeqRead-MB/sec-4          96.27 (  0.00%)      113.48 ( 17.88%)      109.84 ( 14.10%)
> > > Min    SeqRead-MB/sec-8          83.55 (  0.00%)       97.87 ( 17.14%)       89.62 (  7.27%)
> > > Min    SeqRead-MB/sec-16         66.77 (  0.00%)       82.59 ( 23.69%)       70.49 (  5.57%)
> > > 
> > > Overall system CPU usage is reduced
> > > 
> > >           3.16.0-rc2       3.0.0  3.16.0-rc2
> > >              vanilla     vanilla fairzone-v4
> > > User          390.13      251.45      396.13
> > > System        404.41      295.13      389.61
> > > Elapsed      5412.45     5072.42     5163.49
> > > 
> > > This series does not fully restore throughput performance to 3.0 levels
> > > but it brings it close for lower thread counts. Higher thread counts are
> > > known to be worse than 3.0 due to CFQ changes but there is no appetite
> > > for changing the defaults there.
> > 
> > I ran tiobench locally and here are the results:
> > 
> > tiobench MB/sec
> >                                         3.16-rc1              3.16-rc1
> >                                                            seqreadv4r8
> > Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)
> > Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)
> > Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)
> > Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)
> > Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)
> > Mean   RandRead-MB/sec-1          1.14 (  0.00%)        1.11 ( -2.35%)
> > Mean   RandRead-MB/sec-2          1.30 (  0.00%)        1.25 ( -3.85%)
> > Mean   RandRead-MB/sec-4          1.50 (  0.00%)        1.46 ( -2.23%)
> > Mean   RandRead-MB/sec-8          1.72 (  0.00%)        1.60 ( -6.96%)
> > Mean   RandRead-MB/sec-16         1.72 (  0.00%)        1.69 ( -2.13%)
> > 
> > Seqread throughput is up, randread takes a small hit.  But allocation
> > latency is badly screwed at higher concurrency levels:
> > 
> 
> So the results are roughly similar. You don't state which filesystem it is
> but FWIW if it's the ext3 filesystem using the ext4 driver then throughput
> at higher levels is also affected by filesystem fragmentation. The problem
> was outside the scope of the series.

It's an ext4 filesystem.

> > tiobench Maximum Latency
> >                                             3.16-rc1              3.16-rc1
> >                                                                seqreadv4r8
> > Mean   SeqRead-MaxLatency-1          77.23 (  0.00%)       57.69 ( 25.30%)
> > Mean   SeqRead-MaxLatency-2         228.80 (  0.00%)      218.50 (  4.50%)
> > Mean   SeqRead-MaxLatency-4         329.58 (  0.00%)      325.93 (  1.11%)
> > Mean   SeqRead-MaxLatency-8         485.13 (  0.00%)      475.35 (  2.02%)
> > Mean   SeqRead-MaxLatency-16        599.10 (  0.00%)      637.89 ( -6.47%)
> > Mean   RandRead-MaxLatency-1         66.98 (  0.00%)       18.21 ( 72.81%)
> > Mean   RandRead-MaxLatency-2        132.88 (  0.00%)      119.61 (  9.98%)
> > Mean   RandRead-MaxLatency-4        222.95 (  0.00%)      213.82 (  4.10%)
> > Mean   RandRead-MaxLatency-8        982.99 (  0.00%)     1009.71 ( -2.72%)
> > Mean   RandRead-MaxLatency-16       515.24 (  0.00%)     1883.82 (-265.62%)
> > Mean   SeqWrite-MaxLatency-1        239.78 (  0.00%)      233.61 (  2.57%)
> > Mean   SeqWrite-MaxLatency-2        517.85 (  0.00%)      413.39 ( 20.17%)
> > Mean   SeqWrite-MaxLatency-4        249.10 (  0.00%)      416.33 (-67.14%)
> > Mean   SeqWrite-MaxLatency-8        629.31 (  0.00%)      851.62 (-35.33%)
> > Mean   SeqWrite-MaxLatency-16       987.05 (  0.00%)     1080.92 ( -9.51%)
> > Mean   RandWrite-MaxLatency-1         0.01 (  0.00%)        0.01 (  0.00%)
> > Mean   RandWrite-MaxLatency-2         0.02 (  0.00%)        0.02 (  0.00%)
> > Mean   RandWrite-MaxLatency-4         0.02 (  0.00%)        0.02 (  0.00%)
> > Mean   RandWrite-MaxLatency-8         1.83 (  0.00%)        1.96 ( -6.73%)
> > Mean   RandWrite-MaxLatency-16        1.52 (  0.00%)        1.33 ( 12.72%)
> > 
> > Zone fairness is completely gone.  The overall allocation distribution
> > on this system goes from 40%/60% to 10%/90%, and during the workload
> > the DMA32 zone is not used *at all*:
> > 
> 
> The zone fairness gets effectively disabled when the streaming is using all
> of physical memory and reclaiming behind anyway as kswapd. The allocator is
> using the preferred zone while reclaim scans behind it. If you run tiobench
> with a size that fits within memory then the IO results themselves are
> valid but it should show that the zone allocation is still spread fairly.
> 
> This is from a tiobench configuration that fits within memory.
> 
>                             3.16.0-rc2  3.16.0-rc2
>                                vanilla fairzone-v4
> DMA32 allocs                  10809658    10904632
> Normal allocs                 18401594    18342985
> 
> In this case there was no reclaim activity.
> 
> >                               3.16-rc1    3.16-rc1
> >                                        seqreadv4r8
> > Zone normal velocity         11358.492   17996.733
> > Zone dma32 velocity           8213.852       0.000
> > 
> 
> Showing that when the IO workload is twice memory that it stays confined
> within one zone. Considering that this is a streaming workload for the
> most part and we're discarding behind it was of less concern considering
> that interleaving results in the wrong reclaim decisions being made.

How can we tell a streaming workload from a thrashing one?  The VM can
only recognize multiple accesses within an LRU cycle, and you just cut
the LRU cycle in half.

Workingset adaptiveness is back in the toilet with your changes, you
can verify that easily by trying to cache one file slightly bigger
than memory through sequential reads, then another file of the same
size.  The second file never gets cached because it's thrashing in the
Normal zone while the unused file-1 gunk in DMA32 never gets the boot.

This is a correctness issue, which means that the other side of the
20% improvement in tiobench is a regression that scales with runtime
of file-2 access.

> > Both negative effects stem from kswapd suddenly ignoring the classzone
> > index while the page allocator respects it: the page allocator will
> > keep the low wmark + lowmem reserves in DMA32 free, but kswapd won't
> > reclaim in there until it drops down to the high watermark.  The low
> > watermark + lowmem reserve is usually bigger than the high watermark,
> > so you effectively disable kswapd service in DMA32 for user requests.
> > The zone is then no longer used until it fills with enough kernel
> > pages to trigger kswapd, or the workload goes into direct reclaim.
> > 
> 
> Yes. If the classzone index was preserved or the balance gap then the same
> regression exists. The interleaving from the allocator and ordering of kswapd
> activity on the lower zones reclaimed pages before they were finished with.

Is "readahead pages getting trashed before they are used" the main
explanation for this particular regression?

> > The classzone change is a non-sensical change IMO, and there is no
> > useful description of it to be found in the changelog.  But for the
> > given tests it appears to be the only change in the entire series to
> > make a measurable difference; reverting it gets me back to baseline:
> > 
> > tiobench MB/sec
> >                                         3.16-rc1              3.16-rc1              3.16-rc1
> >                                                            seqreadv4r8  seqreadv4r8classzone
> > Mean   SeqRead-MB/sec-1         129.66 (  0.00%)      156.16 ( 20.44%)      129.72 (  0.05%)
> > Mean   SeqRead-MB/sec-2         115.74 (  0.00%)      138.50 ( 19.66%)      115.61 ( -0.11%)
> > Mean   SeqRead-MB/sec-4         110.21 (  0.00%)      127.08 ( 15.31%)      110.15 ( -0.06%)
> > Mean   SeqRead-MB/sec-8         101.70 (  0.00%)      108.47 (  6.65%)      102.15 (  0.44%)
> > Mean   SeqRead-MB/sec-16         86.45 (  0.00%)       91.57 (  5.92%)       86.63 (  0.20%)
> > 
> 
> That is consistent with my own tests. The single patch that remained was
> the logical change.
> 
> >             3.16-rc1    3.16-rc1    3.16-rc1
> >                      seqreadv4r8seqreadv4r8classzone
> > User          272.45      277.17      272.23
> > System        197.89      186.30      193.73
> > Elapsed      4589.17     4356.23     4584.57
> > 
> >                               3.16-rc1    3.16-rc1    3.16-rc1
> >                                        seqreadv4r8seqreadv4r8classzone
> > Zone normal velocity         11358.492   17996.733   12695.547
> > Zone dma32 velocity           8213.852       0.000    6891.421
> > 
> > Please stop making multiple logical changes in a single patch/testing
> > unit. 
> 
> In this case you would end up with two patches
> 
> Removal of balance gap -- no major difference measured
> Removal of classzone_idx -- removes the lowmem reserve
> 
> The first patch on its own would have no useful documentation attached
> which is why it was not split out.

The balance gap is a self-contained concept that was introduced for a
specific reason, which I'm sure has nothing to do with lowmem
reserves.  If that reason doesn't exist anymore, removing it can be a
separate change that documents when and how the gap became obsolete.

Sure, there is only one motivation why you are actually removing both
things, which can be mentioned in the cover letter, but they are still
different logical changes in the reclaim/placement model.

> > This will make it easier to verify them, and hopefully make it
> > also more obvious if individual changes are underdocumented.  As it
> > stands, it's hard to impossible to verify the implementation when the
> > intentions are not fully documented.  Performance results can only do
> > so much.  They are meant to corroborate the model, not replace it.
> > 
> 
> The fair zone policy itself is partially working against the lowmem
> reserve idea. The point of the lowmem reserve was to preserve the lower
> zones when an upper zone can be used and the fair zone policy breaks
> that.

The allocator always filled all zones minus their lowmem reserves
before initiating reclaim, the fair policy just makes sure they fill
up at the same rate, but it still respects the chunks reserved for
non-user allocations, the lowmem reserves.

It's arguable whether there was an intentional best-effort mechanism
that preferred higher zones, in addition to the lowmem reserves, but
the upsides are not clear to me (nobody complained about lowmem
allocation problems after the change) and it doesn't integrate into
our multi-zone LRU aging model, so I got rid of it.

So no, I don't see conflicting concepts here.

> The fair zone policy ignores that and it was never reconciled. The
> dirty page distribution does a different interleaving again and was never
> reconciled with the fair zone policy or lowmem reserves.

A write will pick the zone that meets the watermark (low+reserve) and
has available fair quota - they should be roughly in sync - and hasn't
exhausted its zone dirty limit.  If there is no eligible zone, reclaim
is preferred over breaching the dirty limit (rather have clean cache
reclaimed than the reclaimer running into dirty cache), so we enter
the slowpath.  The only reason why the slowpath ultimately ignores the
dirty limit (after waking kswapd) is because the flushers are not NUMA
aware, but this is explicitely documented.  But kswapd is awake at
that point, so any setbacks in fairness should be temporary.

> kswapd itself was
> not using the classzone_idx it actually woken for although in this case
> it may not matter. The end result is that the model is fairly inconsistent
> which makes comparison against it a difficult exercise at best. About all
> that was left was that from a performance perspective that the fair zone
> allocation policy is not doing the right thing for streaming workloads.

But your changes are not doing the right thing for in-core workloads
and working set changes where predictable aging matters, plus they
take away whatever consistency we have in the placement model.  It's
not a good trade-off.

> > And again, if you change the way zone fairness works, please always
> > include the zone velocity numbers or allocation numbers to show that
> > your throughput improvements don't just come from completely wrecking
> > fairness - or in this case from disabling an entire zone.
> 
> The fair zone policy is preserved until such time as the workload is
> continually streaming data in and reclaiming out. The original fair zone
> allocation policy patch (81c0a2bb515fd4daae8cab64352877480792b515) did not
> describe what workload it measurably benefitted. It noted that pages can
> get activated and live longer than they should which is completely true
> but did not document why that mattered for streaming workloads or notice
> that performance for those workloads got completely shot.

It's also still not clear what's causing the regression.  Your first
theory was allocator overhead, but it couldn't get profiled, and
reducing the overhead and pointless zonelist walks significantly
didn't make a real difference in the end result.  The second theory of
lower zones being scanned excessively when they are balanced turned
out to not even match the code.  Lastly, the balance gap was suspected
as the reason for unfavorable lower zone reclaim, but removing it
didn't help, either.

These explanations make no sense.  If pages of a streaming writer have
enough time in memory to not thrash with a single zone, the fair
policy should make even MORE time in memory available to them and not
thrash them.  The fair policy is a necessity for multi-zone aging to
make any sense and having predictable reclaim and activation behavior.
That's why it's obviously not meant to benefit streaming workloads,
but it shouldn't harm them, either.  Certainly not 20%.  If streaming
pages thrash, something is up, the solution isn't to just disable the
second zone or otherwise work around the issue.

> There is a concern that the pages on the lower zone potentially get preserved
> forever. However, the interleaving from the fair zone policy would reach
> the low watermark again and pages up to the high watermark would still
> get rotated and reclaimed so it did not seem like it would be an issue.

There is no interleaving because the page allocator recognizes the
lowmem reserve and doesn't reach the now much lower kswapd trigger
point.  The zone is full from a page allocator point of view, and
balanced from a kswapd point of view.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 18:39   ` Mel Gorman
  2014-07-01 20:58     ` Mel Gorman
  2014-07-01 21:25     ` Johannes Weiner
@ 2014-07-01 22:38     ` Dave Chinner
  2014-07-01 23:09       ` Mel Gorman
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2014-07-01 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Tue, Jul 01, 2014 at 07:39:15PM +0100, Mel Gorman wrote:
> On Tue, Jul 01, 2014 at 01:16:11PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 30, 2014 at 05:47:59PM +0100, Mel Gorman wrote:
> > Seqread throughput is up, randread takes a small hit.  But allocation
> > latency is badly screwed at higher concurrency levels:
> 
> So the results are roughly similar. You don't state which filesystem it is
> but FWIW if it's the ext3 filesystem using the ext4 driver then throughput
> at higher levels is also affected by filesystem fragmentation. The problem
> was outside the scope of the series.

I'd suggest you're both going wrong that the "using ext3" point.

Use ext4 or XFS for your performance measurements because that's
what everyone is using for the systems these days. iNot to mention
they don'thave all the crappy allocation artifacts that ext3 has,
nor the throughput limitations caused by the ext3 journal, and so
on.

Fundamentally, ext3 performance is simply not a relevant performance
metric anymore - it's a legacy filesystem in maintenance mode and
has been for a few years now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 22:38     ` Dave Chinner
@ 2014-07-01 23:09       ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-07-01 23:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 02, 2014 at 08:38:17AM +1000, Dave Chinner wrote:
> On Tue, Jul 01, 2014 at 07:39:15PM +0100, Mel Gorman wrote:
> > On Tue, Jul 01, 2014 at 01:16:11PM -0400, Johannes Weiner wrote:
> > > On Mon, Jun 30, 2014 at 05:47:59PM +0100, Mel Gorman wrote:
> > > Seqread throughput is up, randread takes a small hit.  But allocation
> > > latency is badly screwed at higher concurrency levels:
> > 
> > So the results are roughly similar. You don't state which filesystem it is
> > but FWIW if it's the ext3 filesystem using the ext4 driver then throughput
> > at higher levels is also affected by filesystem fragmentation. The problem
> > was outside the scope of the series.
> 
> I'd suggest you're both going wrong that the "using ext3" point.
> 
> Use ext4 or XFS for your performance measurements because that's
> what everyone is using for the systems these days. iNot to mention
> they don'thave all the crappy allocation artifacts that ext3 has,
> nor the throughput limitations caused by the ext3 journal, and so
> on.
> 
> Fundamentally, ext3 performance is simply not a relevant performance
> metric anymore - it's a legacy filesystem in maintenance mode and
> has been for a few years now...
> 

The problem crosses filesystems. ext3 is simply the first in the queue
because by and large it behaved the worst.  Covering the rest of them
simply takes more time and with different results as you may expect. Here
are the xfs results for the smaller of the machines as it was able to get
that far before it got reset

                                      3.16.0-rc2                 3.0.0            3.16.0-rc2
                                         vanilla               vanilla           fairzone-v4
Min    SeqRead-MB/sec-1          92.69 (  0.00%)       99.68 (  7.54%)      104.47 ( 12.71%)
Min    SeqRead-MB/sec-2         106.81 (  0.00%)      123.43 ( 15.56%)      123.24 ( 15.38%)
Min    SeqRead-MB/sec-4         101.89 (  0.00%)      113.78 ( 11.67%)      116.85 ( 14.68%)
Min    SeqRead-MB/sec-8          95.31 (  0.00%)       91.40 ( -4.10%)      101.68 (  6.68%)
Min    SeqRead-MB/sec-16         81.84 (  0.00%)       88.53 (  8.17%)       86.63 (  5.85%)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-01 21:25     ` Johannes Weiner
@ 2014-07-02 15:44       ` Johannes Weiner
  2014-07-02 15:53         ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2014-07-02 15:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Tue, Jul 01, 2014 at 05:25:38PM -0400, Johannes Weiner wrote:
> These explanations make no sense.  If pages of a streaming writer have
> enough time in memory to not thrash with a single zone, the fair
> policy should make even MORE time in memory available to them and not
> thrash them.  The fair policy is a necessity for multi-zone aging to
> make any sense and having predictable reclaim and activation behavior.
> That's why it's obviously not meant to benefit streaming workloads,
> but it shouldn't harm them, either.  Certainly not 20%.  If streaming
> pages thrash, something is up, the solution isn't to just disable the
> second zone or otherwise work around the issue.

Hey, funny story.

I tried reproducing this with an isolated tester just to be sure,
stealing tiobench's do_read_test(), but I wouldn't get any results.

I compared the original fair policy commit with its parent, I compared
a current vanilla kernel to a crude #ifdef'd policy disabling, and I
compared vanilla to your patch series - every kernel yields 132MB/s.

Then I realized, 132MB/s is the disk limit anyway - how the hell did I
get 150MB/s peak speeds for sequential cold cache IO with seqreadv4?

So I looked at the tiobench source code and it turns out, it's not
cold cache at all: it first does the write test, then the read test on
the same file!

The file is bigger than memory, so you would expect the last X percent
of the file to be cached after the seq write and the subsequent seq
read to push the tail out before getting to it - standard working set
bigger than memory behavior.

But without fairness, a chunk from the beginning of the file gets
stuck in the DMA32 zone and never pushed out while writing, so when
the reader comes along, it gets random parts from cache!

All patches that showed "major improvements" ruined fairness and led
to non-linear caching of the test file during the write, and the read
speedups came from the file being partially served from cache.

Sequential IO is fine.  This benchmark needs a whack over the head.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve sequential read throughput v4r8
  2014-07-02 15:44       ` Johannes Weiner
@ 2014-07-02 15:53         ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2014-07-02 15:53 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 02, 2014 at 11:44:39AM -0400, Johannes Weiner wrote:
> On Tue, Jul 01, 2014 at 05:25:38PM -0400, Johannes Weiner wrote:
> > These explanations make no sense.  If pages of a streaming writer have
> > enough time in memory to not thrash with a single zone, the fair
> > policy should make even MORE time in memory available to them and not
> > thrash them.  The fair policy is a necessity for multi-zone aging to
> > make any sense and having predictable reclaim and activation behavior.
> > That's why it's obviously not meant to benefit streaming workloads,
> > but it shouldn't harm them, either.  Certainly not 20%.  If streaming
> > pages thrash, something is up, the solution isn't to just disable the
> > second zone or otherwise work around the issue.
> 
> Hey, funny story.
> 
> I tried reproducing this with an isolated tester just to be sure,
> stealing tiobench's do_read_test(), but I wouldn't get any results.
> 
> I compared the original fair policy commit with its parent, I compared
> a current vanilla kernel to a crude #ifdef'd policy disabling, and I
> compared vanilla to your patch series - every kernel yields 132MB/s.
> 
> Then I realized, 132MB/s is the disk limit anyway - how the hell did I
> get 150MB/s peak speeds for sequential cold cache IO with seqreadv4?
> 
> So I looked at the tiobench source code and it turns out, it's not
> cold cache at all: it first does the write test, then the read test on
> the same file!
> 
> The file is bigger than memory, so you would expect the last X percent
> of the file to be cached after the seq write and the subsequent seq
> read to push the tail out before getting to it - standard working set
> bigger than memory behavior.
> 
> But without fairness, a chunk from the beginning of the file gets
> stuck in the DMA32 zone and never pushed out while writing, so when
> the reader comes along, it gets random parts from cache!
> 
> All patches that showed "major improvements" ruined fairness and led
> to non-linear caching of the test file during the write, and the read
> speedups came from the file being partially served from cache.
> 

Well, shit. Yes, artificially preserving would give an apparent boost
but be the completely wrong thing to do.

Andrew, please drop the series from mmotm. I'll pick apart whatever is
left that makes sense and resubmit what's left over if necessary.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-07-02 15:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-30 16:47 [PATCH 0/5] Improve sequential read throughput v4r8 Mel Gorman
2014-06-30 16:48 ` [PATCH 1/4] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-06-30 16:48 ` [PATCH 2/4] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
2014-06-30 16:48 ` [PATCH 3/4] mm: vmscan: Do not reclaim from lower zones if they are balanced Mel Gorman
2014-06-30 16:48 ` [PATCH 4/4] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
2014-06-30 21:14   ` Andrew Morton
2014-06-30 21:51     ` Mel Gorman
2014-06-30 22:09       ` Andrew Morton
2014-07-01  8:02         ` Mel Gorman
2014-07-01 17:16 ` [PATCH 0/5] Improve sequential read throughput v4r8 Johannes Weiner
2014-07-01 18:39   ` Mel Gorman
2014-07-01 20:58     ` Mel Gorman
2014-07-01 21:25     ` Johannes Weiner
2014-07-02 15:44       ` Johannes Weiner
2014-07-02 15:53         ` Mel Gorman
2014-07-01 22:38     ` Dave Chinner
2014-07-01 23:09       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).