All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Improve sequential read throughput v2
@ 2014-06-25  7:58 ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

Changelog since v1
o Rebase to v3.16-rc2
o Move CFQ patch to end of series where it can be rejected easier if necessary
o Introduce page-reclaim related patch related to kswapd/fairzone interactions
o Rework fast zone policy patch

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential
read throughput. This is visible in a number of benchmarks but I looked
at tiobench the closest. This is using ext3 on a mid-range desktop and
comparing against 3.0.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla                cfq600               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.43 ( 16.10%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      118.18 ( 17.32%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      110.84 ( 15.40%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       92.40 ( 12.05%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       76.68 ( 15.53%)       79.49 ( 19.77%)

This series does not fully restore throughput performance to 3.0 levels
but it brings it acceptably close. While throughput for higher numbers
of threads is lower, it is known that it can be tuned by increasing
target_latency or disabling low_latency giving higher overall throughput
at the cost of latency and IO fairness.

This series in ordered in ascending-likelihood-to-cause-controversary so
that a partial series can still potentially be merged even if parts of it
are naked (e.g. CGQ). For reference, here is the series without the CFQ
patch at the end.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla             lessdirty               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      141.04 ( 16.60%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      116.26 ( 15.42%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.52 ( 14.02%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.60 (  7.45%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.87 (  5.27%)       79.49 ( 19.77%)


 block/cfq-iosched.c            |   2 +-
 include/linux/mmzone.h         | 210 ++++++++++++++++++++++-------------------
 include/linux/writeback.h      |   1 +
 include/trace/events/pagemap.h |  16 ++--
 mm/internal.h                  |   1 +
 mm/mm_init.c                   |   5 +-
 mm/page-writeback.c            |  15 +--
 mm/page_alloc.c                | 206 ++++++++++++++++++++++++++--------------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |  16 ++--
 mm/vmstat.c                    |   4 +-
 11 files changed, 285 insertions(+), 195 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/6] Improve sequential read throughput v2
@ 2014-06-25  7:58 ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

Changelog since v1
o Rebase to v3.16-rc2
o Move CFQ patch to end of series where it can be rejected easier if necessary
o Introduce page-reclaim related patch related to kswapd/fairzone interactions
o Rework fast zone policy patch

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential
read throughput. This is visible in a number of benchmarks but I looked
at tiobench the closest. This is using ext3 on a mid-range desktop and
comparing against 3.0.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla                cfq600               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.43 ( 16.10%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      118.18 ( 17.32%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      110.84 ( 15.40%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       92.40 ( 12.05%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       76.68 ( 15.53%)       79.49 ( 19.77%)

This series does not fully restore throughput performance to 3.0 levels
but it brings it acceptably close. While throughput for higher numbers
of threads is lower, it is known that it can be tuned by increasing
target_latency or disabling low_latency giving higher overall throughput
at the cost of latency and IO fairness.

This series in ordered in ascending-likelihood-to-cause-controversary so
that a partial series can still potentially be merged even if parts of it
are naked (e.g. CGQ). For reference, here is the series without the CFQ
patch at the end.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla             lessdirty               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      141.04 ( 16.60%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      116.26 ( 15.42%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.52 ( 14.02%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.60 (  7.45%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.87 (  5.27%)       79.49 ( 19.77%)


 block/cfq-iosched.c            |   2 +-
 include/linux/mmzone.h         | 210 ++++++++++++++++++++++-------------------
 include/linux/writeback.h      |   1 +
 include/trace/events/pagemap.h |  16 ++--
 mm/internal.h                  |   1 +
 mm/mm_init.c                   |   5 +-
 mm/page-writeback.c            |  15 +--
 mm/page_alloc.c                | 206 ++++++++++++++++++++++++++--------------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |  16 ++--
 mm/vmstat.c                    |   4 +-
 11 files changed, 285 insertions(+), 195 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that arguably the biggest change is reducing the size of the
lowmem_reserve type. It should still be large enough but by shrinking it
the fields used by the page allocator fast path all fit in one cache line.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 201 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |  13 ++--
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 105 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..a2f6443 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,17 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	unsigned int lowmem_reserve[MAX_NR_ZONES];
 
+	struct per_cpu_pageset __percpu *pageset;
 #ifdef CONFIG_NUMA
 	int node;
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
 #endif
-	struct per_cpu_pageset __percpu *pageset;
 	/*
-	 * free areas of different sizes
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned int inactive_ratio;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +357,37 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
 	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	unsigned long		dirty_balance_reserve;
 
 	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
 	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
+	unsigned long percpu_drift_mark;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
+	const char		*name;
 
 	/*
-	 * Discontig memory support fields.
+	 * Number of MIGRATE_RESEVE page block. To maintain for just
+	 * optimization. Protected by zone->lock.
 	 */
+	int			nr_migrate_reserve_block;
+
 	struct pglist_data	*zone_pgdat;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -504,16 +436,89 @@ struct zone {
 	unsigned long		present_pages;
 	unsigned long		managed_pages;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * Number of MIGRATE_RESEVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	int			nr_migrate_reserve_block;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	unsigned long		pages_scanned;	   /* since last reclaim */
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
 
+#ifdef CONFIG_COMPACTION
 	/*
-	 * rarely used fields:
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
 	 */
-	const char		*name;
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..ebbdbcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1699,7 +1699,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1714,7 +1713,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3245,7 +3244,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %u", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5566,7 +5565,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			unsigned int max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
@@ -5617,6 +5616,7 @@ static void setup_per_zone_lowmem_reserve(void)
 			idx = j;
 			while (idx) {
 				struct zone *lower_zone;
+				unsigned long reserve;
 
 				idx--;
 
@@ -5624,8 +5624,11 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = managed_pages /
+				reserve = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
+				if (WARN_ON(reserve > UINT_MAX))
+					reserve = UINT_MAX;
+				lower_zone->lowmem_reserve[j] = reserve;
 				managed_pages += lower_zone->managed_pages;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..c6d6fae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%u",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %u", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that arguably the biggest change is reducing the size of the
lowmem_reserve type. It should still be large enough but by shrinking it
the fields used by the page allocator fast path all fit in one cache line.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 201 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |  13 ++--
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 105 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..a2f6443 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,17 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	unsigned int lowmem_reserve[MAX_NR_ZONES];
 
+	struct per_cpu_pageset __percpu *pageset;
 #ifdef CONFIG_NUMA
 	int node;
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
 #endif
-	struct per_cpu_pageset __percpu *pageset;
 	/*
-	 * free areas of different sizes
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned int inactive_ratio;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +357,37 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
 	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	unsigned long		dirty_balance_reserve;
 
 	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
 	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
+	unsigned long percpu_drift_mark;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
+	const char		*name;
 
 	/*
-	 * Discontig memory support fields.
+	 * Number of MIGRATE_RESEVE page block. To maintain for just
+	 * optimization. Protected by zone->lock.
 	 */
+	int			nr_migrate_reserve_block;
+
 	struct pglist_data	*zone_pgdat;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -504,16 +436,89 @@ struct zone {
 	unsigned long		present_pages;
 	unsigned long		managed_pages;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * Number of MIGRATE_RESEVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	int			nr_migrate_reserve_block;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	unsigned long		pages_scanned;	   /* since last reclaim */
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
 
+#ifdef CONFIG_COMPACTION
 	/*
-	 * rarely used fields:
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
 	 */
-	const char		*name;
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..ebbdbcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1699,7 +1699,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1714,7 +1713,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3245,7 +3244,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %u", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5566,7 +5565,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			unsigned int max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
@@ -5617,6 +5616,7 @@ static void setup_per_zone_lowmem_reserve(void)
 			idx = j;
 			while (idx) {
 				struct zone *lower_zone;
+				unsigned long reserve;
 
 				idx--;
 
@@ -5624,8 +5624,11 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = managed_pages /
+				reserve = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
+				if (WARN_ON(reserve > UINT_MAX))
+					reserve = UINT_MAX;
+				lower_zone->lowmem_reserve[j] = reserve;
 				managed_pages += lower_zone->managed_pages;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..c6d6fae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%u",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %u", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible zone
until the watermark is depleted, woke kswapd and moved onto the next zone.
kswapd scans zones in the opposite direction so the scanning lists on
64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

This patch will only scan/reclaim from lower zones if they have not
reached their watermark. This should not break the normal page aging
as the proportional allocations due to the fair zone allocation policy
should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla        checklow-v2r14               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.63 ( 16.26%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      117.95 ( 17.10%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.54 ( 14.04%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.22 (  6.99%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.14 (  4.17%)       79.49 ( 19.77%)
Mean   RandWrite-MB/sec-16        1.34 (  0.00%)        1.34 (  0.00%)        1.34 (  0.25%)

Note that this patch makes a large performance difference for lower
numbers of threads and brings performance closer to 3.0 figures. It was
also tested against xfs and there are similar gains although I don't have
3.0 figures to compare against. There are still regressions for higher
number of threads but this is related to changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..40c3af8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3124,12 +3124,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3152,6 +3153,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
+			if (zone_balanced(zone, order, 0, 0))
+				continue;
+
 			/*
 			 * There should be no need to raise the scanning
 			 * priority if enough pages are already being scanned
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible zone
until the watermark is depleted, woke kswapd and moved onto the next zone.
kswapd scans zones in the opposite direction so the scanning lists on
64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

This patch will only scan/reclaim from lower zones if they have not
reached their watermark. This should not break the normal page aging
as the proportional allocations due to the fair zone allocation policy
should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla        checklow-v2r14               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.63 ( 16.26%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      117.95 ( 17.10%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.54 ( 14.04%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.22 (  6.99%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.14 (  4.17%)       79.49 ( 19.77%)
Mean   RandWrite-MB/sec-16        1.34 (  0.00%)        1.34 (  0.00%)        1.34 (  0.25%)

Note that this patch makes a large performance difference for lower
numbers of threads and brings performance closer to 3.0 figures. It was
also tested against xfs and there are similar gains although I don't have
3.0 figures to compare against. There are still regressions for higher
number of threads but this is related to changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..40c3af8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3124,12 +3124,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3152,6 +3153,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
+			if (zone_balanced(zone, order, 0, 0))
+				continue;
+
 			/*
 			 * There should be no need to raise the scanning
 			 * priority if enough pages are already being scanned
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

This patch makes a number of changes that should reduce the overall cost

o Do not apply the fair zone policy to small zones such as DMA
o Abort the fair zone allocation policy once remote or small zones are
  encountered
o Use a simplier scan when resetting NR_ALLOC_BATCH
o Use a simple flag to identify depleted zones instead of accessing a
  potentially write-intensive cache line for counters
o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
  to avoid doing a rescan of the zonelist when the counters are reset

On UMA machines, the effect is marginal. Even judging from system CPU
usage it's small for the tiobench test

          3.16.0-rc2  3.16.0-rc2
            checklow    fairzone
User          396.24      396.23
System        395.23      391.50
Elapsed      5182.65     5165.49

And the number of pages allocated from each zone is comparable

                            3.16.0-rc2  3.16.0-rc2
                              checklow    fairzone
DMA allocs                           0           0
DMA32 allocs                   7374217     7920241
Normal allocs                999277551   996568115

On NUMA machines, the scanning overhead is higher as zones are scanned
that are ineligible for use by zone allocation policy. This patch fixes
the zone-order zonelist policy and reduces the numbers of zones scanned
by the allocator.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 ++
 mm/mm_init.c           |   5 +-
 mm/page_alloc.c        | 179 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 128 insertions(+), 63 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2f6443..afb08be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
@@ -694,6 +700,7 @@ struct zonelist_cache;
 struct zoneref {
 	struct zone *zone;	/* Pointer to actual zone */
 	int zone_idx;		/* zone_idx(zoneref->zone) */
+	bool fair_enabled;	/* eligible for fair zone policy */
 };
 
 /*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4074caf..37b7337 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -54,8 +54,9 @@ void mminit_verify_zonelist(void)
 			/* Iterate the zonelist */
 			for_each_zone_zonelist(zone, z, zonelist, zoneid) {
 #ifdef CONFIG_NUMA
-				printk(KERN_CONT "%d:%s ",
-					zone->node, zone->name);
+				printk(KERN_CONT "%d:%s%s ",
+					zone->node, zone->name,
+					z->fair_enabled ? "(F)" : "");
 #else
 				printk(KERN_CONT "0:%s ", zone->name);
 #endif /* CONFIG_NUMA */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebbdbcd..e954bf1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1541,10 +1541,11 @@ int split_free_page(struct page *page)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static inline
+static
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			bool acct_fair)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1596,7 +1597,11 @@ again:
 					  get_freepage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (acct_fair) {
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+		if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+			zone_set_flag(zone, ZONE_FAIR_DEPLETED);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1908,6 +1913,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		if (!zone_is_fair_depleted(zone))
+			continue;
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1925,8 +1944,12 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	struct zone *fair_zone = NULL;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1949,11 +1972,15 @@ zonelist_scan:
 		 * time the page has in memory before being reclaimed.
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
-			if (!zone_local(preferred_zone, zone))
-				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+			if (!zone_local(preferred_zone, zone) || !z->fair_enabled)
+				break;
+
+			if (fair_zone && zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
+			}
 		}
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -2049,22 +2076,27 @@ zonelist_scan:
 		}
 
 try_this_zone:
+		if (alloc_flags & ALLOC_FAIR) {
+			if (zone_is_fair_depleted(zone)) {
+				if (!fair_zone)
+					fair_zone = zone;
+				nr_fair_skipped++;
+				continue;
+			}
+		}
+
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype,
+						z->fair_enabled);
 		if (page)
 			break;
+
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2073,8 +2105,39 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	if (alloc_flags & ALLOC_FAIR) {
+		if (nr_fair_skipped)
+			reset_alloc_batches(preferred_zone);
+
+		/*
+		 * If there was a zone with a depleted fair policy batch but
+		 * that made the watermark then use it now instead of stalling
+		 * waiting on kswapd to make progress
+		 */
+		if (fair_zone)
+			return buffered_rmqueue(preferred_zone, fair_zone,
+					order, gfp_mask, migratetype, true);
+
+		/* Recheck remote nodes if there are any */
+		if (nr_online_nodes > 1) {
+			alloc_flags &= ~ALLOC_FAIR;
+			zonelist_rescan = true;
+		}
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2395,28 +2458,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2748,33 +2789,18 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	if (preferred_zoneref->fair_enabled)
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
@@ -3287,10 +3313,19 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
+			struct zoneref *zoneref, struct zone *preferred_zone)
 {
+	int zone_type = zone_idx(zone);
+	bool fair_enabled = zone_local(zone, preferred_zone);
+	if (zone_type == 0 &&
+			zone->managed_pages < (pgdat->node_present_pages >> 4))
+		fair_enabled = false;
+
 	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
+	zoneref->zone_idx = zone_type;
+	zoneref->fair_enabled = fair_enabled;
+	return fair_enabled;
 }
 
 /*
@@ -3303,17 +3338,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	struct zone *preferred_zone = NULL;
+	int nr_fair = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			if (!preferred_zone)
+				preferred_zone = zone;
+
+			nr_fair += zoneref_set_zone(pgdat, zone,
+				&zonelist->_zonerefs[nr_zones++],
+				preferred_zone);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
+	if (nr_fair <= 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
+
 	return nr_zones;
 }
 
@@ -3510,6 +3554,7 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
 	zonelist->_zonerefs[j].zone = NULL;
 	zonelist->_zonerefs[j].zone_idx = 0;
+	zonelist->_zonerefs[j].fair_enabled = false;
 }
 
 /*
@@ -3538,8 +3583,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
-	struct zone *z;
+	struct zone *z, *preferred_zone = NULL;
 	struct zonelist *zonelist;
+	int nr_fair = 0;
 
 	zonelist = &pgdat->node_zonelists[0];
 	pos = 0;
@@ -3547,15 +3593,26 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (!preferred_zone)
+				preferred_zone = z;
 			if (populated_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
+				nr_fair += zoneref_set_zone(pgdat, z,
+					&zonelist->_zonerefs[pos++],
+					preferred_zone);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
 	zonelist->_zonerefs[pos].zone = NULL;
 	zonelist->_zonerefs[pos].zone_idx = 0;
+	zonelist->_zonerefs[pos].fair_enabled = false;
+
+	/*
+	 * For this policy, the fair zone allocation policy is disabled as the
+	 * stated priority is to preserve lower zones, not balance them fairly.
+	 */
+	if (nr_fair == 1 || nr_online_nodes > 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
 }
 
 static int default_zonelist_order(void)
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

This patch makes a number of changes that should reduce the overall cost

o Do not apply the fair zone policy to small zones such as DMA
o Abort the fair zone allocation policy once remote or small zones are
  encountered
o Use a simplier scan when resetting NR_ALLOC_BATCH
o Use a simple flag to identify depleted zones instead of accessing a
  potentially write-intensive cache line for counters
o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
  to avoid doing a rescan of the zonelist when the counters are reset

On UMA machines, the effect is marginal. Even judging from system CPU
usage it's small for the tiobench test

          3.16.0-rc2  3.16.0-rc2
            checklow    fairzone
User          396.24      396.23
System        395.23      391.50
Elapsed      5182.65     5165.49

And the number of pages allocated from each zone is comparable

                            3.16.0-rc2  3.16.0-rc2
                              checklow    fairzone
DMA allocs                           0           0
DMA32 allocs                   7374217     7920241
Normal allocs                999277551   996568115

On NUMA machines, the scanning overhead is higher as zones are scanned
that are ineligible for use by zone allocation policy. This patch fixes
the zone-order zonelist policy and reduces the numbers of zones scanned
by the allocator.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 ++
 mm/mm_init.c           |   5 +-
 mm/page_alloc.c        | 179 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 128 insertions(+), 63 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2f6443..afb08be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
@@ -694,6 +700,7 @@ struct zonelist_cache;
 struct zoneref {
 	struct zone *zone;	/* Pointer to actual zone */
 	int zone_idx;		/* zone_idx(zoneref->zone) */
+	bool fair_enabled;	/* eligible for fair zone policy */
 };
 
 /*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4074caf..37b7337 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -54,8 +54,9 @@ void mminit_verify_zonelist(void)
 			/* Iterate the zonelist */
 			for_each_zone_zonelist(zone, z, zonelist, zoneid) {
 #ifdef CONFIG_NUMA
-				printk(KERN_CONT "%d:%s ",
-					zone->node, zone->name);
+				printk(KERN_CONT "%d:%s%s ",
+					zone->node, zone->name,
+					z->fair_enabled ? "(F)" : "");
 #else
 				printk(KERN_CONT "0:%s ", zone->name);
 #endif /* CONFIG_NUMA */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebbdbcd..e954bf1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1541,10 +1541,11 @@ int split_free_page(struct page *page)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static inline
+static
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			bool acct_fair)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1596,7 +1597,11 @@ again:
 					  get_freepage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (acct_fair) {
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+		if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+			zone_set_flag(zone, ZONE_FAIR_DEPLETED);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1908,6 +1913,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		if (!zone_is_fair_depleted(zone))
+			continue;
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1925,8 +1944,12 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	struct zone *fair_zone = NULL;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1949,11 +1972,15 @@ zonelist_scan:
 		 * time the page has in memory before being reclaimed.
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
-			if (!zone_local(preferred_zone, zone))
-				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+			if (!zone_local(preferred_zone, zone) || !z->fair_enabled)
+				break;
+
+			if (fair_zone && zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
+			}
 		}
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -2049,22 +2076,27 @@ zonelist_scan:
 		}
 
 try_this_zone:
+		if (alloc_flags & ALLOC_FAIR) {
+			if (zone_is_fair_depleted(zone)) {
+				if (!fair_zone)
+					fair_zone = zone;
+				nr_fair_skipped++;
+				continue;
+			}
+		}
+
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype,
+						z->fair_enabled);
 		if (page)
 			break;
+
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2073,8 +2105,39 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	if (alloc_flags & ALLOC_FAIR) {
+		if (nr_fair_skipped)
+			reset_alloc_batches(preferred_zone);
+
+		/*
+		 * If there was a zone with a depleted fair policy batch but
+		 * that made the watermark then use it now instead of stalling
+		 * waiting on kswapd to make progress
+		 */
+		if (fair_zone)
+			return buffered_rmqueue(preferred_zone, fair_zone,
+					order, gfp_mask, migratetype, true);
+
+		/* Recheck remote nodes if there are any */
+		if (nr_online_nodes > 1) {
+			alloc_flags &= ~ALLOC_FAIR;
+			zonelist_rescan = true;
+		}
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2395,28 +2458,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2748,33 +2789,18 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	if (preferred_zoneref->fair_enabled)
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
@@ -3287,10 +3313,19 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
+			struct zoneref *zoneref, struct zone *preferred_zone)
 {
+	int zone_type = zone_idx(zone);
+	bool fair_enabled = zone_local(zone, preferred_zone);
+	if (zone_type == 0 &&
+			zone->managed_pages < (pgdat->node_present_pages >> 4))
+		fair_enabled = false;
+
 	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
+	zoneref->zone_idx = zone_type;
+	zoneref->fair_enabled = fair_enabled;
+	return fair_enabled;
 }
 
 /*
@@ -3303,17 +3338,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	struct zone *preferred_zone = NULL;
+	int nr_fair = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			if (!preferred_zone)
+				preferred_zone = zone;
+
+			nr_fair += zoneref_set_zone(pgdat, zone,
+				&zonelist->_zonerefs[nr_zones++],
+				preferred_zone);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
+	if (nr_fair <= 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
+
 	return nr_zones;
 }
 
@@ -3510,6 +3554,7 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
 	zonelist->_zonerefs[j].zone = NULL;
 	zonelist->_zonerefs[j].zone_idx = 0;
+	zonelist->_zonerefs[j].fair_enabled = false;
 }
 
 /*
@@ -3538,8 +3583,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
-	struct zone *z;
+	struct zone *z, *preferred_zone = NULL;
 	struct zonelist *zonelist;
+	int nr_fair = 0;
 
 	zonelist = &pgdat->node_zonelists[0];
 	pos = 0;
@@ -3547,15 +3593,26 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (!preferred_zone)
+				preferred_zone = z;
 			if (populated_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
+				nr_fair += zoneref_set_zone(pgdat, z,
+					&zonelist->_zonerefs[pos++],
+					preferred_zone);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
 	zonelist->_zonerefs[pos].zone = NULL;
 	zonelist->_zonerefs[pos].zone_idx = 0;
+	zonelist->_zonerefs[pos].fair_enabled = false;
+
+	/*
+	 * For this policy, the fair zone allocation policy is disabled as the
+	 * stated priority is to preserve lower zones, not balance them fairly.
+	 */
+	if (nr_fair == 1 || nr_online_nodes > 1)
+		zonelist->_zonerefs[0].fair_enabled = false;
 }
 
 static int default_zonelist_order(void)
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

When allocating a page cache page for writing the allocator makes an attempt
to proportionally distribute dirty pages between populated zones. The call
to zone_dirty_ok is more expensive than expected because of the number of
vmstats it examines. This patch caches some of that information to reduce
the cost. It means the proportional allocation is based on stale data but
the heuristic should not need perfectly accurate information. As before,
the benefit is marginal but cumulative and depends on the size of the
machine. For a very small machine the effect is visible in system CPU time
for a tiobench test but it'll vary between workloads.

          3.16.0-rc2  3.16.0-rc2
            fairzone   lessdirty
User          393.76      389.03
System        391.50      388.64
Elapsed      5182.86     5186.28

Even though the benefit is small it's generally worthwhile reducing overhead
in the page allocator as it has a tendency to creep up over time.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h    |  2 ++
 include/linux/writeback.h |  1 +
 mm/internal.h             |  1 +
 mm/page-writeback.c       | 15 +++++++++------
 mm/page_alloc.c           | 16 ++++++++++++----
 5 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index afb08be..cab5137 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,8 @@ struct zone {
 	/* zone flags, see below */
 	unsigned long		flags;
 
+	unsigned long		dirty_limit_cached;
+
 	ZONE_PADDING(_pad2_)
 
 	/* Write-intensive fields used by page reclaim */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..90190d4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -121,6 +121,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirty_limit(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..f31e3b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -370,5 +370,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
+#define ALLOC_DIRTY		0x200 /* spread GFP_WRITE allocations */
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..1990e9a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -298,10 +298,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  * Returns the maximum number of dirty pages allowed in a zone, based
  * on the zone's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+unsigned long zone_dirty_limit(struct zone *zone)
 {
 	unsigned long zone_memory = zone_dirtyable_memory(zone);
-	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
@@ -310,9 +309,6 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 	else
 		dirty = vm_dirty_ratio * zone_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
-		dirty += dirty / 4;
-
 	return dirty;
 }
 
@@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
  */
 bool zone_dirty_ok(struct zone *zone)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	unsigned long limit = zone->dirty_limit_cached;
+	struct task_struct *tsk = current;
+
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		limit = zone_dirty_limit(zone);
+		zone->dirty_limit_cached = limit;
+		limit += limit / 4;
+	}
 
 	return zone_page_state(zone, NR_FILE_DIRTY) +
 	       zone_page_state(zone, NR_UNSTABLE_NFS) +
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e954bf1..e1b17b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1942,9 +1942,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
-	int nr_fair_skipped = 0;
+	int nr_fair_skipped = 0, nr_fail_dirty = 0;
 	struct zone *fair_zone = NULL;
 	bool zonelist_rescan;
 
@@ -2007,8 +2005,11 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if ((alloc_flags & ALLOC_DIRTY) && !zone_dirty_ok(zone)) {
+			nr_fail_dirty++;
+			zone->dirty_limit_cached = zone_dirty_limit(zone);
 			continue;
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
@@ -2134,6 +2135,11 @@ this_zone_full:
 		zonelist_rescan = true;
 	}
 
+	if ((alloc_flags & ALLOC_DIRTY) && nr_fail_dirty) {
+		alloc_flags &= ~ALLOC_DIRTY;
+		zonelist_rescan = true;
+	}
+
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2791,6 +2797,8 @@ retry_cpuset:
 
 	if (preferred_zoneref->fair_enabled)
 		alloc_flags |= ALLOC_FAIR;
+	if (gfp_mask & __GFP_WRITE)
+		alloc_flags |= ALLOC_DIRTY;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

When allocating a page cache page for writing the allocator makes an attempt
to proportionally distribute dirty pages between populated zones. The call
to zone_dirty_ok is more expensive than expected because of the number of
vmstats it examines. This patch caches some of that information to reduce
the cost. It means the proportional allocation is based on stale data but
the heuristic should not need perfectly accurate information. As before,
the benefit is marginal but cumulative and depends on the size of the
machine. For a very small machine the effect is visible in system CPU time
for a tiobench test but it'll vary between workloads.

          3.16.0-rc2  3.16.0-rc2
            fairzone   lessdirty
User          393.76      389.03
System        391.50      388.64
Elapsed      5182.86     5186.28

Even though the benefit is small it's generally worthwhile reducing overhead
in the page allocator as it has a tendency to creep up over time.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h    |  2 ++
 include/linux/writeback.h |  1 +
 mm/internal.h             |  1 +
 mm/page-writeback.c       | 15 +++++++++------
 mm/page_alloc.c           | 16 ++++++++++++----
 5 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index afb08be..cab5137 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,8 @@ struct zone {
 	/* zone flags, see below */
 	unsigned long		flags;
 
+	unsigned long		dirty_limit_cached;
+
 	ZONE_PADDING(_pad2_)
 
 	/* Write-intensive fields used by page reclaim */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..90190d4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -121,6 +121,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirty_limit(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..f31e3b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -370,5 +370,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
+#define ALLOC_DIRTY		0x200 /* spread GFP_WRITE allocations */
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..1990e9a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -298,10 +298,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  * Returns the maximum number of dirty pages allowed in a zone, based
  * on the zone's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+unsigned long zone_dirty_limit(struct zone *zone)
 {
 	unsigned long zone_memory = zone_dirtyable_memory(zone);
-	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
@@ -310,9 +309,6 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 	else
 		dirty = vm_dirty_ratio * zone_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
-		dirty += dirty / 4;
-
 	return dirty;
 }
 
@@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
  */
 bool zone_dirty_ok(struct zone *zone)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	unsigned long limit = zone->dirty_limit_cached;
+	struct task_struct *tsk = current;
+
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		limit = zone_dirty_limit(zone);
+		zone->dirty_limit_cached = limit;
+		limit += limit / 4;
+	}
 
 	return zone_page_state(zone, NR_FILE_DIRTY) +
 	       zone_page_state(zone, NR_UNSTABLE_NFS) +
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e954bf1..e1b17b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1942,9 +1942,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
-	int nr_fair_skipped = 0;
+	int nr_fair_skipped = 0, nr_fail_dirty = 0;
 	struct zone *fair_zone = NULL;
 	bool zonelist_rescan;
 
@@ -2007,8 +2005,11 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if ((alloc_flags & ALLOC_DIRTY) && !zone_dirty_ok(zone)) {
+			nr_fail_dirty++;
+			zone->dirty_limit_cached = zone_dirty_limit(zone);
 			continue;
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
@@ -2134,6 +2135,11 @@ this_zone_full:
 		zonelist_rescan = true;
 	}
 
+	if ((alloc_flags & ALLOC_DIRTY) && nr_fail_dirty) {
+		alloc_flags &= ~ALLOC_DIRTY;
+		zonelist_rescan = true;
+	}
+
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2791,6 +2797,8 @@ retry_cpuset:
 
 	if (preferred_zoneref->fair_enabled)
 		alloc_flags |= ALLOC_FAIR;
+	if (gfp_mask & __GFP_WRITE)
+		alloc_flags |= ALLOC_DIRTY;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-25  7:58 ` Mel Gorman
@ 2014-06-25  7:58   ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The existing CFQ default target_latency results in very poor performance
for larger numbers of threads doing sequential reads. While this can be
easily described as a tuning problem for users, it is one that is tricky
to detect. This patch updates the default to benefit smaller machines.
Dave Chinner points out that it is dangerous to assume that people know
how to tune their IO scheduler. Jeff Moyer asked what workloads even
care about threaded readers but it's reasonable to assume file,
media, database and multi-user servers all experience large sequential
readers from multiple sources at the same time.

It's a bit depressing to note how much slower this relatively simple case
is in comparison to 3.0.  The following is from tiobench on a mid-range
desktop using ext3 as the test filesystem although it's known other
filesystems experience similar trouble.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                 lessdirty                cfq600                     vanilla
Min    SeqRead-MB/sec-1         140.79 (  0.00%)      140.43 ( -0.26%)      134.04 ( -4.79%)
Min    SeqRead-MB/sec-2         118.08 (  0.00%)      118.18 (  0.08%)      120.76 (  2.27%)
Min    SeqRead-MB/sec-4         108.47 (  0.00%)      110.84 (  2.18%)      114.49 (  5.55%)
Min    SeqRead-MB/sec-8          87.20 (  0.00%)       92.40 (  5.96%)       98.04 ( 12.43%)
Min    SeqRead-MB/sec-16         68.98 (  0.00%)       76.68 ( 11.16%)       79.49 ( 15.24%)

The full series including this patch brings performance within an acceptable
distance of 3.0.0-vanilla considering that read latencies and fairness are
generally better now at the cost of overall throughput.

Here is the very high-level view of the iostats

                  3.16.0-rc2  3.16.0-rc2       3.0.0
                   lessdirty      cfq600     vanilla
Mean sda-avgqusz      935.48      957.28     1000.70
Mean sda-avgrqsz      575.27      579.85      600.71
Mean sda-await       4405.00     4471.12     4887.67
Mean sda-r_await       82.43       87.95      108.53
Mean sda-w_await    13272.23    10783.67    11599.83
Mean sda-rrqm          14.12       10.14       19.68
Mean sda-wrqm        1631.24     1744.00    11999.46
Max  sda-avgqusz     2179.79     2238.95     2626.78
Max  sda-avgrqsz     1021.03     1021.97     1024.00
Max  sda-await      15007.79    13600.51    24971.00
Max  sda-r_await      897.78      893.09     5308.00
Max  sda-w_await   207814.40   179483.79   177698.47
Max  sda-rrqm          68.40       45.60       73.30
Max  sda-wrqm       19544.00    19619.20    58058.40

await figures are generally ok.  Average wait times are still acceptable
and the worst-case read wait times are ok. Queue sizes and request sizes
generally look ok. It's worth noting that the iostats are generally *far*
better than 3.0.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 block/cfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cadc378..876ae44 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -32,7 +32,7 @@ static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 static int cfq_group_idle = HZ / 125;
-static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static const int cfq_target_latency = HZ * 6/10; /* 600 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-25  7:58   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-25  7:58 UTC (permalink / raw)
  To: Linux Kernel, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jens Axboe, Jeff Moyer, Dave Chinner, Mel Gorman

The existing CFQ default target_latency results in very poor performance
for larger numbers of threads doing sequential reads. While this can be
easily described as a tuning problem for users, it is one that is tricky
to detect. This patch updates the default to benefit smaller machines.
Dave Chinner points out that it is dangerous to assume that people know
how to tune their IO scheduler. Jeff Moyer asked what workloads even
care about threaded readers but it's reasonable to assume file,
media, database and multi-user servers all experience large sequential
readers from multiple sources at the same time.

It's a bit depressing to note how much slower this relatively simple case
is in comparison to 3.0.  The following is from tiobench on a mid-range
desktop using ext3 as the test filesystem although it's known other
filesystems experience similar trouble.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                 lessdirty                cfq600                     vanilla
Min    SeqRead-MB/sec-1         140.79 (  0.00%)      140.43 ( -0.26%)      134.04 ( -4.79%)
Min    SeqRead-MB/sec-2         118.08 (  0.00%)      118.18 (  0.08%)      120.76 (  2.27%)
Min    SeqRead-MB/sec-4         108.47 (  0.00%)      110.84 (  2.18%)      114.49 (  5.55%)
Min    SeqRead-MB/sec-8          87.20 (  0.00%)       92.40 (  5.96%)       98.04 ( 12.43%)
Min    SeqRead-MB/sec-16         68.98 (  0.00%)       76.68 ( 11.16%)       79.49 ( 15.24%)

The full series including this patch brings performance within an acceptable
distance of 3.0.0-vanilla considering that read latencies and fairness are
generally better now at the cost of overall throughput.

Here is the very high-level view of the iostats

                  3.16.0-rc2  3.16.0-rc2       3.0.0
                   lessdirty      cfq600     vanilla
Mean sda-avgqusz      935.48      957.28     1000.70
Mean sda-avgrqsz      575.27      579.85      600.71
Mean sda-await       4405.00     4471.12     4887.67
Mean sda-r_await       82.43       87.95      108.53
Mean sda-w_await    13272.23    10783.67    11599.83
Mean sda-rrqm          14.12       10.14       19.68
Mean sda-wrqm        1631.24     1744.00    11999.46
Max  sda-avgqusz     2179.79     2238.95     2626.78
Max  sda-avgrqsz     1021.03     1021.97     1024.00
Max  sda-await      15007.79    13600.51    24971.00
Max  sda-r_await      897.78      893.09     5308.00
Max  sda-w_await   207814.40   179483.79   177698.47
Max  sda-rrqm          68.40       45.60       73.30
Max  sda-wrqm       19544.00    19619.20    58058.40

await figures are generally ok.  Average wait times are still acceptable
and the worst-case read wait times are ok. Queue sizes and request sizes
generally look ok. It's worth noting that the iostats are generally *far*
better than 3.0.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 block/cfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cadc378..876ae44 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -32,7 +32,7 @@ static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 static int cfq_group_idle = HZ / 125;
-static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static const int cfq_target_latency = HZ * 6/10; /* 600 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-25  7:58   ` Mel Gorman
@ 2014-06-25 23:32     ` Andrew Morton
  -1 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2014-06-25 23:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, 25 Jun 2014 08:58:46 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Historically kswapd scanned from DMA->Movable in the opposite direction
> to the page allocator to avoid allocating behind kswapd direction of
> progress. The fair zone allocation policy altered this in a non-obvious
> manner.
> 
> Traditionally, the page allocator prefers to use the highest eligible zone
> until the watermark is depleted, woke kswapd and moved onto the next zone.
> kswapd scans zones in the opposite direction so the scanning lists on
> 64-bit look like this;
> 
> ...
>
> Note that this patch makes a large performance difference for lower
> numbers of threads and brings performance closer to 3.0 figures. It was
> also tested against xfs and there are similar gains although I don't have
> 3.0 figures to compare against. There are still regressions for higher
> number of threads but this is related to changes in the CFQ IO scheduler.
> 

Why did this patch make a difference to sequential read performance? 
IOW, by what means was/is reclaim interfering with sequential reads?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-25 23:32     ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2014-06-25 23:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, 25 Jun 2014 08:58:46 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Historically kswapd scanned from DMA->Movable in the opposite direction
> to the page allocator to avoid allocating behind kswapd direction of
> progress. The fair zone allocation policy altered this in a non-obvious
> manner.
> 
> Traditionally, the page allocator prefers to use the highest eligible zone
> until the watermark is depleted, woke kswapd and moved onto the next zone.
> kswapd scans zones in the opposite direction so the scanning lists on
> 64-bit look like this;
> 
> ...
>
> Note that this patch makes a large performance difference for lower
> numbers of threads and brings performance closer to 3.0 figures. It was
> also tested against xfs and there are similar gains although I don't have
> 3.0 figures to compare against. There are still regressions for higher
> number of threads but this is related to changes in the CFQ IO scheduler.
> 

Why did this patch make a difference to sequential read performance? 
IOW, by what means was/is reclaim interfering with sequential reads?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-25  7:58   ` Mel Gorman
@ 2014-06-25 23:35     ` Andrew Morton
  -1 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2014-06-25 23:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:

> @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
>   */
>  bool zone_dirty_ok(struct zone *zone)
>  {
> -	unsigned long limit = zone_dirty_limit(zone);
> +	unsigned long limit = zone->dirty_limit_cached;
> +	struct task_struct *tsk = current;
> +
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		limit = zone_dirty_limit(zone);
> +		zone->dirty_limit_cached = limit;
> +		limit += limit / 4;
> +	}

Could we get a comment in here explaining what we're doing and why
PF_LESS_THROTTLE and rt_task control whether we do it?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-25 23:35     ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2014-06-25 23:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:

> @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
>   */
>  bool zone_dirty_ok(struct zone *zone)
>  {
> -	unsigned long limit = zone_dirty_limit(zone);
> +	unsigned long limit = zone->dirty_limit_cached;
> +	struct task_struct *tsk = current;
> +
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		limit = zone_dirty_limit(zone);
> +		zone->dirty_limit_cached = limit;
> +		limit += limit / 4;
> +	}

Could we get a comment in here explaining what we're doing and why
PF_LESS_THROTTLE and rt_task control whether we do it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-25 23:35     ` Andrew Morton
@ 2014-06-26  8:43       ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26  8:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> >   */
> >  bool zone_dirty_ok(struct zone *zone)
> >  {
> > -	unsigned long limit = zone_dirty_limit(zone);
> > +	unsigned long limit = zone->dirty_limit_cached;
> > +	struct task_struct *tsk = current;
> > +
> > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > +		limit = zone_dirty_limit(zone);
> > +		zone->dirty_limit_cached = limit;
> > +		limit += limit / 4;
> > +	}
> 
> Could we get a comment in here explaining what we're doing and why
> PF_LESS_THROTTLE and rt_task control whether we do it?
> 

        /*
         * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
         * and real-time tasks to prioritise their allocations.
         * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
         * blocking tasks that can clean pages.
         */

That's fairly weak though. It would also seem reasonable to just delete
this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
path if dirty pages are already fairly distributed between zones.
Johannes, any objection to that limit raising logic being deleted?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-26  8:43       ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26  8:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> >   */
> >  bool zone_dirty_ok(struct zone *zone)
> >  {
> > -	unsigned long limit = zone_dirty_limit(zone);
> > +	unsigned long limit = zone->dirty_limit_cached;
> > +	struct task_struct *tsk = current;
> > +
> > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > +		limit = zone_dirty_limit(zone);
> > +		zone->dirty_limit_cached = limit;
> > +		limit += limit / 4;
> > +	}
> 
> Could we get a comment in here explaining what we're doing and why
> PF_LESS_THROTTLE and rt_task control whether we do it?
> 

        /*
         * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
         * and real-time tasks to prioritise their allocations.
         * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
         * blocking tasks that can clean pages.
         */

That's fairly weak though. It would also seem reasonable to just delete
this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
path if dirty pages are already fairly distributed between zones.
Johannes, any objection to that limit raising logic being deleted?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-25 23:32     ` Andrew Morton
@ 2014-06-26 10:17       ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, Jun 25, 2014 at 04:32:50PM -0700, Andrew Morton wrote:
> On Wed, 25 Jun 2014 08:58:46 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Historically kswapd scanned from DMA->Movable in the opposite direction
> > to the page allocator to avoid allocating behind kswapd direction of
> > progress. The fair zone allocation policy altered this in a non-obvious
> > manner.
> > 
> > Traditionally, the page allocator prefers to use the highest eligible zone
> > until the watermark is depleted, woke kswapd and moved onto the next zone.
> > kswapd scans zones in the opposite direction so the scanning lists on
> > 64-bit look like this;
> > 
> > ...
> >
> > Note that this patch makes a large performance difference for lower
> > numbers of threads and brings performance closer to 3.0 figures. It was
> > also tested against xfs and there are similar gains although I don't have
> > 3.0 figures to compare against. There are still regressions for higher
> > number of threads but this is related to changes in the CFQ IO scheduler.
> > 
> 
> Why did this patch make a difference to sequential read performance? 
> IOW, by what means was/is reclaim interfering with sequential reads?
> 

The fair zone allocator is interleaving between Normal/DMA32. Kswapd is
reclaiming from DMA->Highest where Highest is an unbalanced zone. Kswapd
will reclaim from DMA32 even if it is balanced if Normal is below watermarks.

Let N = high_wmark(Normal) + high_wmark(DMA32)

Of the last N allocations, some percentage will be allocated from Normal
and some from DMA32. The percentage depends on the ratio of the zone sizes
and when their watermarks were hit. If Normal is unbalanced, DMA32 will be
shrunk by kswapd. If DMA32 is unbalanced only DMA32 will be shrunk. This
leads to a difference of ages between DMA32 and Normal. Relatively young
pages are then continually rotated and reclaimed from DMA32 due to the
higher zone being unbalanced.

A debugging patch showed that some PageReadahead pages are reaching the
end of the LRU. The number is not very high but it's there. Monitoring
of nr_free_pages on a per-zone basis show that there is constant reclaim
of the lower zones even when the watermarks should be ok. The iostats
showed that without the page there are more pages being read.

I believe the difference in sequential read performance is because relatively
young pages recently readahead are being reclaimed from the lower zones.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-26 10:17       ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Jeff Moyer, Dave Chinner

On Wed, Jun 25, 2014 at 04:32:50PM -0700, Andrew Morton wrote:
> On Wed, 25 Jun 2014 08:58:46 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Historically kswapd scanned from DMA->Movable in the opposite direction
> > to the page allocator to avoid allocating behind kswapd direction of
> > progress. The fair zone allocation policy altered this in a non-obvious
> > manner.
> > 
> > Traditionally, the page allocator prefers to use the highest eligible zone
> > until the watermark is depleted, woke kswapd and moved onto the next zone.
> > kswapd scans zones in the opposite direction so the scanning lists on
> > 64-bit look like this;
> > 
> > ...
> >
> > Note that this patch makes a large performance difference for lower
> > numbers of threads and brings performance closer to 3.0 figures. It was
> > also tested against xfs and there are similar gains although I don't have
> > 3.0 figures to compare against. There are still regressions for higher
> > number of threads but this is related to changes in the CFQ IO scheduler.
> > 
> 
> Why did this patch make a difference to sequential read performance? 
> IOW, by what means was/is reclaim interfering with sequential reads?
> 

The fair zone allocator is interleaving between Normal/DMA32. Kswapd is
reclaiming from DMA->Highest where Highest is an unbalanced zone. Kswapd
will reclaim from DMA32 even if it is balanced if Normal is below watermarks.

Let N = high_wmark(Normal) + high_wmark(DMA32)

Of the last N allocations, some percentage will be allocated from Normal
and some from DMA32. The percentage depends on the ratio of the zone sizes
and when their watermarks were hit. If Normal is unbalanced, DMA32 will be
shrunk by kswapd. If DMA32 is unbalanced only DMA32 will be shrunk. This
leads to a difference of ages between DMA32 and Normal. Relatively young
pages are then continually rotated and reclaimed from DMA32 due to the
higher zone being unbalanced.

A debugging patch showed that some PageReadahead pages are reaching the
end of the LRU. The number is not very high but it's there. Monitoring
of nr_free_pages on a per-zone basis show that there is constant reclaim
of the lower zones even when the watermarks should be ok. The iostats
showed that without the page there are more pages being read.

I believe the difference in sequential read performance is because relatively
young pages recently readahead are being reclaimed from the lower zones.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-26  8:43       ` Mel Gorman
  (?)
@ 2014-06-26 14:37       ` Johannes Weiner
  2014-06-26 14:56           ` Mel Gorman
  -1 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2014-06-26 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Jens Axboe,
	Jeff Moyer, Dave Chinner

On Thu, Jun 26, 2014 at 09:43:14AM +0100, Mel Gorman wrote:
> On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> > On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> > >   */
> > >  bool zone_dirty_ok(struct zone *zone)
> > >  {
> > > -	unsigned long limit = zone_dirty_limit(zone);
> > > +	unsigned long limit = zone->dirty_limit_cached;
> > > +	struct task_struct *tsk = current;
> > > +
> > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > +		limit = zone_dirty_limit(zone);
> > > +		zone->dirty_limit_cached = limit;
> > > +		limit += limit / 4;
> > > +	}
> > 
> > Could we get a comment in here explaining what we're doing and why
> > PF_LESS_THROTTLE and rt_task control whether we do it?
> > 
> 
>         /*
>          * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
>          * and real-time tasks to prioritise their allocations.
>          * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
>          * blocking tasks that can clean pages.
>          */
> 
> That's fairly weak though. It would also seem reasonable to just delete
> this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
> path if dirty pages are already fairly distributed between zones.
> Johannes, any objection to that limit raising logic being deleted?

I copied that over from global_dirty_limits() such that the big
picture and the per-zone picture have the same view - otherwise these
tasks fall back to first fit zone allocations before global limits
start throttling dirtiers and waking up the flushers.  This increases
the probability of reclaim running into dirty pages.

Would you remove it from global_dirty_limits() as well?

On that note, I don't really understand why global_dirty_limits()
raises the *background* limit for less-throttle/rt tasks, shouldn't it
only raise the dirty limit?  Sure, the throttle point is somewhere
between the two limits, but we don't really want to defer waking up
the flushers for them.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-26 14:37       ` Johannes Weiner
@ 2014-06-26 14:56           ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 14:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Jens Axboe,
	Jeff Moyer, Dave Chinner

On Thu, Jun 26, 2014 at 10:37:38AM -0400, Johannes Weiner wrote:
> On Thu, Jun 26, 2014 at 09:43:14AM +0100, Mel Gorman wrote:
> > On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> > > On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> > > >   */
> > > >  bool zone_dirty_ok(struct zone *zone)
> > > >  {
> > > > -	unsigned long limit = zone_dirty_limit(zone);
> > > > +	unsigned long limit = zone->dirty_limit_cached;
> > > > +	struct task_struct *tsk = current;
> > > > +
> > > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > > +		limit = zone_dirty_limit(zone);
> > > > +		zone->dirty_limit_cached = limit;
> > > > +		limit += limit / 4;
> > > > +	}
> > > 
> > > Could we get a comment in here explaining what we're doing and why
> > > PF_LESS_THROTTLE and rt_task control whether we do it?
> > > 
> > 
> >         /*
> >          * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
> >          * and real-time tasks to prioritise their allocations.
> >          * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
> >          * blocking tasks that can clean pages.
> >          */
> > 
> > That's fairly weak though. It would also seem reasonable to just delete
> > this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
> > path if dirty pages are already fairly distributed between zones.
> > Johannes, any objection to that limit raising logic being deleted?
> 
> I copied that over from global_dirty_limits() such that the big
> picture and the per-zone picture have the same view - otherwise these
> tasks fall back to first fit zone allocations before global limits
> start throttling dirtiers and waking up the flushers.  This increases
> the probability of reclaim running into dirty pages.
> 
> Would you remove it from global_dirty_limits() as well?
> 
> On that note, I don't really understand why global_dirty_limits()
> raises the *background* limit for less-throttle/rt tasks, shouldn't it
> only raise the dirty limit?  Sure, the throttle point is somewhere
> between the two limits, but we don't really want to defer waking up
> the flushers for them.

All of which is fair enough and is something that should be examined on
a rainy day (shouldn't take too long in Ireland). I'm not going to touch
it within this series though. It's outside the scope of what I'm trying
to do here -- restore performance of tiobench and bonnie++ to as close to
3.0 levels as possible. The series is tripping up enough on the fair zone
and CFQ aspects as it is without increasing the scope :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-26 14:56           ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 14:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Jens Axboe,
	Jeff Moyer, Dave Chinner

On Thu, Jun 26, 2014 at 10:37:38AM -0400, Johannes Weiner wrote:
> On Thu, Jun 26, 2014 at 09:43:14AM +0100, Mel Gorman wrote:
> > On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> > > On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> > > >   */
> > > >  bool zone_dirty_ok(struct zone *zone)
> > > >  {
> > > > -	unsigned long limit = zone_dirty_limit(zone);
> > > > +	unsigned long limit = zone->dirty_limit_cached;
> > > > +	struct task_struct *tsk = current;
> > > > +
> > > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > > +		limit = zone_dirty_limit(zone);
> > > > +		zone->dirty_limit_cached = limit;
> > > > +		limit += limit / 4;
> > > > +	}
> > > 
> > > Could we get a comment in here explaining what we're doing and why
> > > PF_LESS_THROTTLE and rt_task control whether we do it?
> > > 
> > 
> >         /*
> >          * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
> >          * and real-time tasks to prioritise their allocations.
> >          * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
> >          * blocking tasks that can clean pages.
> >          */
> > 
> > That's fairly weak though. It would also seem reasonable to just delete
> > this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
> > path if dirty pages are already fairly distributed between zones.
> > Johannes, any objection to that limit raising logic being deleted?
> 
> I copied that over from global_dirty_limits() such that the big
> picture and the per-zone picture have the same view - otherwise these
> tasks fall back to first fit zone allocations before global limits
> start throttling dirtiers and waking up the flushers.  This increases
> the probability of reclaim running into dirty pages.
> 
> Would you remove it from global_dirty_limits() as well?
> 
> On that note, I don't really understand why global_dirty_limits()
> raises the *background* limit for less-throttle/rt tasks, shouldn't it
> only raise the dirty limit?  Sure, the throttle point is somewhere
> between the two limits, but we don't really want to defer waking up
> the flushers for them.

All of which is fair enough and is something that should be examined on
a rainy day (shouldn't take too long in Ireland). I'm not going to touch
it within this series though. It's outside the scope of what I'm trying
to do here -- restore performance of tiobench and bonnie++ to as close to
3.0 levels as possible. The series is tripping up enough on the fair zone
and CFQ aspects as it is without increasing the scope :(

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-26 14:56           ` Mel Gorman
@ 2014-06-26 15:11             ` Johannes Weiner
  -1 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2014-06-26 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Jens Axboe,
	Jeff Moyer, Dave Chinner

On Thu, Jun 26, 2014 at 03:56:32PM +0100, Mel Gorman wrote:
> On Thu, Jun 26, 2014 at 10:37:38AM -0400, Johannes Weiner wrote:
> > On Thu, Jun 26, 2014 at 09:43:14AM +0100, Mel Gorman wrote:
> > > On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> > > > On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > > > 
> > > > > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> > > > >   */
> > > > >  bool zone_dirty_ok(struct zone *zone)
> > > > >  {
> > > > > -	unsigned long limit = zone_dirty_limit(zone);
> > > > > +	unsigned long limit = zone->dirty_limit_cached;
> > > > > +	struct task_struct *tsk = current;
> > > > > +
> > > > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > > > +		limit = zone_dirty_limit(zone);
> > > > > +		zone->dirty_limit_cached = limit;
> > > > > +		limit += limit / 4;
> > > > > +	}
> > > > 
> > > > Could we get a comment in here explaining what we're doing and why
> > > > PF_LESS_THROTTLE and rt_task control whether we do it?
> > > > 
> > > 
> > >         /*
> > >          * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
> > >          * and real-time tasks to prioritise their allocations.
> > >          * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
> > >          * blocking tasks that can clean pages.
> > >          */
> > > 
> > > That's fairly weak though. It would also seem reasonable to just delete
> > > this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
> > > path if dirty pages are already fairly distributed between zones.
> > > Johannes, any objection to that limit raising logic being deleted?
> > 
> > I copied that over from global_dirty_limits() such that the big
> > picture and the per-zone picture have the same view - otherwise these
> > tasks fall back to first fit zone allocations before global limits
> > start throttling dirtiers and waking up the flushers.  This increases
> > the probability of reclaim running into dirty pages.
> > 
> > Would you remove it from global_dirty_limits() as well?
> > 
> > On that note, I don't really understand why global_dirty_limits()
> > raises the *background* limit for less-throttle/rt tasks, shouldn't it
> > only raise the dirty limit?  Sure, the throttle point is somewhere
> > between the two limits, but we don't really want to defer waking up
> > the flushers for them.
> 
> All of which is fair enough and is something that should be examined on
> a rainy day (shouldn't take too long in Ireland). I'm not going to touch
> it within this series though. It's outside the scope of what I'm trying
> to do here -- restore performance of tiobench and bonnie++ to as close to
> 3.0 levels as possible. The series is tripping up enough on the fair zone
> and CFQ aspects as it is without increasing the scope :(

You asked to remove it, I'm just asking follow-up questions ;-)

I agree that this is out-of-scope for your patches and it probably
should be left alone for now.  However, I do like your comment and
wouldn't mind including it in this change.

Would everybody be okay with that?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-26 15:11             ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2014-06-26 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Jens Axboe,
	Jeff Moyer, Dave Chinner

On Thu, Jun 26, 2014 at 03:56:32PM +0100, Mel Gorman wrote:
> On Thu, Jun 26, 2014 at 10:37:38AM -0400, Johannes Weiner wrote:
> > On Thu, Jun 26, 2014 at 09:43:14AM +0100, Mel Gorman wrote:
> > > On Wed, Jun 25, 2014 at 04:35:28PM -0700, Andrew Morton wrote:
> > > > On Wed, 25 Jun 2014 08:58:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > > > 
> > > > > @@ -325,7 +321,14 @@ static unsigned long zone_dirty_limit(struct zone *zone)
> > > > >   */
> > > > >  bool zone_dirty_ok(struct zone *zone)
> > > > >  {
> > > > > -	unsigned long limit = zone_dirty_limit(zone);
> > > > > +	unsigned long limit = zone->dirty_limit_cached;
> > > > > +	struct task_struct *tsk = current;
> > > > > +
> > > > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > > > +		limit = zone_dirty_limit(zone);
> > > > > +		zone->dirty_limit_cached = limit;
> > > > > +		limit += limit / 4;
> > > > > +	}
> > > > 
> > > > Could we get a comment in here explaining what we're doing and why
> > > > PF_LESS_THROTTLE and rt_task control whether we do it?
> > > > 
> > > 
> > >         /*
> > >          * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
> > >          * and real-time tasks to prioritise their allocations.
> > >          * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
> > >          * blocking tasks that can clean pages.
> > >          */
> > > 
> > > That's fairly weak though. It would also seem reasonable to just delete
> > > this check and allow PF_LESS_THROTTLE and rt_tasks to fall into the slow
> > > path if dirty pages are already fairly distributed between zones.
> > > Johannes, any objection to that limit raising logic being deleted?
> > 
> > I copied that over from global_dirty_limits() such that the big
> > picture and the per-zone picture have the same view - otherwise these
> > tasks fall back to first fit zone allocations before global limits
> > start throttling dirtiers and waking up the flushers.  This increases
> > the probability of reclaim running into dirty pages.
> > 
> > Would you remove it from global_dirty_limits() as well?
> > 
> > On that note, I don't really understand why global_dirty_limits()
> > raises the *background* limit for less-throttle/rt tasks, shouldn't it
> > only raise the dirty limit?  Sure, the throttle point is somewhere
> > between the two limits, but we don't really want to defer waking up
> > the flushers for them.
> 
> All of which is fair enough and is something that should be examined on
> a rainy day (shouldn't take too long in Ireland). I'm not going to touch
> it within this series though. It's outside the scope of what I'm trying
> to do here -- restore performance of tiobench and bonnie++ to as close to
> 3.0 levels as possible. The series is tripping up enough on the fair zone
> and CFQ aspects as it is without increasing the scope :(

You asked to remove it, I'm just asking follow-up questions ;-)

I agree that this is out-of-scope for your patches and it probably
should be left alone for now.  However, I do like your comment and
wouldn't mind including it in this change.

Would everybody be okay with that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-25  7:58   ` Mel Gorman
@ 2014-06-26 15:36     ` Jeff Moyer
  -1 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 15:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

> The existing CFQ default target_latency results in very poor performance
> for larger numbers of threads doing sequential reads. While this can be
> easily described as a tuning problem for users, it is one that is tricky
> to detect. This patch updates the default to benefit smaller machines.
> Dave Chinner points out that it is dangerous to assume that people know
> how to tune their IO scheduler. Jeff Moyer asked what workloads even
> care about threaded readers but it's reasonable to assume file,
> media, database and multi-user servers all experience large sequential
> readers from multiple sources at the same time.

Right, and I guess I hadn't considered that case as I thought folks used
more than one spinning disk for such workloads.

My main reservation about this change is that you've only provided
numbers for one benchmark.  To bump the default target_latency, ideally
we'd know how it affects other workloads.  However, I'm having a hard
time justifying putting any time into this for a couple of reasons:
1) blk-mq pretty much does away with the i/o scheduler, and that is the
   future
2) there is work in progress to convert cfq into bfq, and that will
   essentially make any effort put into this irrelevant (so it might be
   interesting to test your workload with bfq)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-26 15:36     ` Jeff Moyer
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 15:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

> The existing CFQ default target_latency results in very poor performance
> for larger numbers of threads doing sequential reads. While this can be
> easily described as a tuning problem for users, it is one that is tricky
> to detect. This patch updates the default to benefit smaller machines.
> Dave Chinner points out that it is dangerous to assume that people know
> how to tune their IO scheduler. Jeff Moyer asked what workloads even
> care about threaded readers but it's reasonable to assume file,
> media, database and multi-user servers all experience large sequential
> readers from multiple sources at the same time.

Right, and I guess I hadn't considered that case as I thought folks used
more than one spinning disk for such workloads.

My main reservation about this change is that you've only provided
numbers for one benchmark.  To bump the default target_latency, ideally
we'd know how it affects other workloads.  However, I'm having a hard
time justifying putting any time into this for a couple of reasons:
1) blk-mq pretty much does away with the i/o scheduler, and that is the
   future
2) there is work in progress to convert cfq into bfq, and that will
   essentially make any effort put into this irrelevant (so it might be
   interesting to test your workload with bfq)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-26 15:36     ` Jeff Moyer
@ 2014-06-26 16:19       ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 16:19 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > The existing CFQ default target_latency results in very poor performance
> > for larger numbers of threads doing sequential reads. While this can be
> > easily described as a tuning problem for users, it is one that is tricky
> > to detect. This patch updates the default to benefit smaller machines.
> > Dave Chinner points out that it is dangerous to assume that people know
> > how to tune their IO scheduler. Jeff Moyer asked what workloads even
> > care about threaded readers but it's reasonable to assume file,
> > media, database and multi-user servers all experience large sequential
> > readers from multiple sources at the same time.
> 
> Right, and I guess I hadn't considered that case as I thought folks used
> more than one spinning disk for such workloads.
> 

They probably are but by and large my IO testing is based on simple
storage. The reasoning is that if we get the simple case wrong then we
probably are getting the complex case wrong too or at least not performing
as well as we should. I also don't use SSD on my own machines for the
same reason.

> My main reservation about this change is that you've only provided
> numbers for one benchmark. 

The other obvious one to run would be pgbench workloads but it's a rathole of
arguing whether the configuration is valid and whether it's inappropriate
to test on simple storage. The tiobench tests alone take a long time to
complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.

> To bump the default target_latency, ideally
> we'd know how it affects other workloads.  However, I'm having a hard
> time justifying putting any time into this for a couple of reasons:
> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
>    future
> 2) there is work in progress to convert cfq into bfq, and that will
>    essentially make any effort put into this irrelevant (so it might be
>    interesting to test your workload with bfq)
> 

Ok, you've convinced me and I'll drop this patch. For anyone based on
kernels from around this time they can tune CFQ or buy a better disk.
Hopefully they will find this via Google.

	There are multiple process sequential read regressions that are
	getting progressively worse since 3.0 that are partially explained
	by changes to CFQ. You may have experienced this if you are using
	a kernel somewhere between 3.3 and 3.15. It's not really a bug in
	CFQ but a difference in objectives. CFQ in later kernels is more
	fair to threads and this partially sacrifices overall throughput
	for lower latency experienced by each of the processes.  If you
	see that the problem goes away using a different IO scheduler but
	need to use CFQ for whatever reason then tune target_latency to
	higher values. 600 appears to work reasonable well for a single
	disk but you may need higher. Keep an eye on IO fairness if that
	is something that your workload is sensitive to it. Your other
	option is to disable low_latency in CFQ. As always, check what the
	most recent kernel is particularly if there have been interesting
	things happening in either the blk-mq or with the bfq IO scheduler.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-26 16:19       ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 16:19 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > The existing CFQ default target_latency results in very poor performance
> > for larger numbers of threads doing sequential reads. While this can be
> > easily described as a tuning problem for users, it is one that is tricky
> > to detect. This patch updates the default to benefit smaller machines.
> > Dave Chinner points out that it is dangerous to assume that people know
> > how to tune their IO scheduler. Jeff Moyer asked what workloads even
> > care about threaded readers but it's reasonable to assume file,
> > media, database and multi-user servers all experience large sequential
> > readers from multiple sources at the same time.
> 
> Right, and I guess I hadn't considered that case as I thought folks used
> more than one spinning disk for such workloads.
> 

They probably are but by and large my IO testing is based on simple
storage. The reasoning is that if we get the simple case wrong then we
probably are getting the complex case wrong too or at least not performing
as well as we should. I also don't use SSD on my own machines for the
same reason.

> My main reservation about this change is that you've only provided
> numbers for one benchmark. 

The other obvious one to run would be pgbench workloads but it's a rathole of
arguing whether the configuration is valid and whether it's inappropriate
to test on simple storage. The tiobench tests alone take a long time to
complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.

> To bump the default target_latency, ideally
> we'd know how it affects other workloads.  However, I'm having a hard
> time justifying putting any time into this for a couple of reasons:
> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
>    future
> 2) there is work in progress to convert cfq into bfq, and that will
>    essentially make any effort put into this irrelevant (so it might be
>    interesting to test your workload with bfq)
> 

Ok, you've convinced me and I'll drop this patch. For anyone based on
kernels from around this time they can tune CFQ or buy a better disk.
Hopefully they will find this via Google.

	There are multiple process sequential read regressions that are
	getting progressively worse since 3.0 that are partially explained
	by changes to CFQ. You may have experienced this if you are using
	a kernel somewhere between 3.3 and 3.15. It's not really a bug in
	CFQ but a difference in objectives. CFQ in later kernels is more
	fair to threads and this partially sacrifices overall throughput
	for lower latency experienced by each of the processes.  If you
	see that the problem goes away using a different IO scheduler but
	need to use CFQ for whatever reason then tune target_latency to
	higher values. 600 appears to work reasonable well for a single
	disk but you may need higher. Keep an eye on IO fairness if that
	is something that your workload is sensitive to it. Your other
	option is to disable low_latency in CFQ. As always, check what the
	most recent kernel is particularly if there have been interesting
	things happening in either the blk-mq or with the bfq IO scheduler.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-26 16:19       ` Mel Gorman
@ 2014-06-26 16:50         ` Jeff Moyer
  -1 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 16:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

> On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
>> Right, and I guess I hadn't considered that case as I thought folks used
>> more than one spinning disk for such workloads.
>> 
>
> They probably are but by and large my IO testing is based on simple
> storage. The reasoning is that if we get the simple case wrong then we
> probably are getting the complex case wrong too or at least not performing
> as well as we should. I also don't use SSD on my own machines for the
> same reason.

A single disk is actually the hard case in this instance, but I
understand what you're saying.  ;-)

>> My main reservation about this change is that you've only provided
>> numbers for one benchmark. 
>
> The other obvious one to run would be pgbench workloads but it's a rathole of
> arguing whether the configuration is valid and whether it's inappropriate
> to test on simple storage. The tiobench tests alone take a long time to
> complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.

And we should probably run our standard set of I/O exercisers at the
very least.  But, like I said, it seems like wasted effort.

>> To bump the default target_latency, ideally
>> we'd know how it affects other workloads.  However, I'm having a hard
>> time justifying putting any time into this for a couple of reasons:
>> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
>>    future
>> 2) there is work in progress to convert cfq into bfq, and that will
>>    essentially make any effort put into this irrelevant (so it might be
>>    interesting to test your workload with bfq)
>> 
>
> Ok, you've convinced me and I'll drop this patch. For anyone based on
> kernels from around this time they can tune CFQ or buy a better disk.
> Hopefully they will find this via Google.

Funny, I wasn't weighing in against your patch.  I was merely indicating
that I personally wasn't going to invest the time to validate it.  But,
if you're ok with dropping it, that's obviously fine with me.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-26 16:50         ` Jeff Moyer
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 16:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

> On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
>> Right, and I guess I hadn't considered that case as I thought folks used
>> more than one spinning disk for such workloads.
>> 
>
> They probably are but by and large my IO testing is based on simple
> storage. The reasoning is that if we get the simple case wrong then we
> probably are getting the complex case wrong too or at least not performing
> as well as we should. I also don't use SSD on my own machines for the
> same reason.

A single disk is actually the hard case in this instance, but I
understand what you're saying.  ;-)

>> My main reservation about this change is that you've only provided
>> numbers for one benchmark. 
>
> The other obvious one to run would be pgbench workloads but it's a rathole of
> arguing whether the configuration is valid and whether it's inappropriate
> to test on simple storage. The tiobench tests alone take a long time to
> complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.

And we should probably run our standard set of I/O exercisers at the
very least.  But, like I said, it seems like wasted effort.

>> To bump the default target_latency, ideally
>> we'd know how it affects other workloads.  However, I'm having a hard
>> time justifying putting any time into this for a couple of reasons:
>> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
>>    future
>> 2) there is work in progress to convert cfq into bfq, and that will
>>    essentially make any effort put into this irrelevant (so it might be
>>    interesting to test your workload with bfq)
>> 
>
> Ok, you've convinced me and I'll drop this patch. For anyone based on
> kernels from around this time they can tune CFQ or buy a better disk.
> Hopefully they will find this via Google.

Funny, I wasn't weighing in against your patch.  I was merely indicating
that I personally wasn't going to invest the time to validate it.  But,
if you're ok with dropping it, that's obviously fine with me.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-26 16:50         ` Jeff Moyer
@ 2014-06-26 17:45           ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 17:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

On Thu, Jun 26, 2014 at 12:50:32PM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
> >> Right, and I guess I hadn't considered that case as I thought folks used
> >> more than one spinning disk for such workloads.
> >> 
> >
> > They probably are but by and large my IO testing is based on simple
> > storage. The reasoning is that if we get the simple case wrong then we
> > probably are getting the complex case wrong too or at least not performing
> > as well as we should. I also don't use SSD on my own machines for the
> > same reason.
> 
> A single disk is actually the hard case in this instance, but I
> understand what you're saying.  ;-)
> 
> >> My main reservation about this change is that you've only provided
> >> numbers for one benchmark. 
> >
> > The other obvious one to run would be pgbench workloads but it's a rathole of
> > arguing whether the configuration is valid and whether it's inappropriate
> > to test on simple storage. The tiobench tests alone take a long time to
> > complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.
> 
> And we should probably run our standard set of I/O exercisers at the
> very least.  But, like I said, it seems like wasted effort.
> 

Out of curiousity, what do you consider to be the standard set of I/O
exercisers? I have a whole battery of them that are run against major
releases to track performance over time -- tiobench (it's stupid, but too
many people use it), fsmark used in various configurations (single/multi
threaded, zero-sized and large files), postmark (file sizes fairly
small, working set 2xRAM), bonnie++ (2xRAM), ffsb used in a mail server
configuration (taken from btrfs tests), dbench3 (checking in-memory updates,
not a realistic IO benchmark), dbench4 (bit more realistic although high
thread counts it gets silly and overall it's not a stable predictor of
performance), sysbench in various configurations, pgbench used in limited
configurations, stutter which tends to hit the worse-case interactivity
issues experienced on desktops and kernel builds are the main ones. It
takes days to churn through the full set of tests which is why I don't do
it for a patch series.  I selected tiobench this time because it was the
most reliable test to cover both single and multiple-sources-of-IO cases.
If I merge a major change I'll usually then watch the next major release
and double check that nothing else broke.

> >> To bump the default target_latency, ideally
> >> we'd know how it affects other workloads.  However, I'm having a hard
> >> time justifying putting any time into this for a couple of reasons:
> >> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
> >>    future
> >> 2) there is work in progress to convert cfq into bfq, and that will
> >>    essentially make any effort put into this irrelevant (so it might be
> >>    interesting to test your workload with bfq)
> >> 
> >
> > Ok, you've convinced me and I'll drop this patch. For anyone based on
> > kernels from around this time they can tune CFQ or buy a better disk.
> > Hopefully they will find this via Google.
> 
> Funny, I wasn't weighing in against your patch.  I was merely indicating
> that I personally wasn't going to invest the time to validate it.  But,
> if you're ok with dropping it, that's obviously fine with me.
> 

I fear the writing is on the wall that it'll never pass the "have you
tested every workload" test and no matter what a counter-example will be
found where it's the wrong setting. If CFQ is going to be irrelevant soon
it's just not worth wasting the electricity against a mainline kernel.

I'm still interested in what you consider your standard set of IO exercisers
though because I can slot any missing parts into the tests that run for
every mainline release.

The main one I'm missing is the postgres folks fsync benchmark. I wrote
the automation months ago but never activated it because there are enough
known problems already.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-26 17:45           ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-06-26 17:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

On Thu, Jun 26, 2014 at 12:50:32PM -0400, Jeff Moyer wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Thu, Jun 26, 2014 at 11:36:50AM -0400, Jeff Moyer wrote:
> >> Right, and I guess I hadn't considered that case as I thought folks used
> >> more than one spinning disk for such workloads.
> >> 
> >
> > They probably are but by and large my IO testing is based on simple
> > storage. The reasoning is that if we get the simple case wrong then we
> > probably are getting the complex case wrong too or at least not performing
> > as well as we should. I also don't use SSD on my own machines for the
> > same reason.
> 
> A single disk is actually the hard case in this instance, but I
> understand what you're saying.  ;-)
> 
> >> My main reservation about this change is that you've only provided
> >> numbers for one benchmark. 
> >
> > The other obvious one to run would be pgbench workloads but it's a rathole of
> > arguing whether the configuration is valid and whether it's inappropriate
> > to test on simple storage. The tiobench tests alone take a long time to
> > complete -- 1.5 hours on a simple machine, 7 hours on a low-end NUMA machine.
> 
> And we should probably run our standard set of I/O exercisers at the
> very least.  But, like I said, it seems like wasted effort.
> 

Out of curiousity, what do you consider to be the standard set of I/O
exercisers? I have a whole battery of them that are run against major
releases to track performance over time -- tiobench (it's stupid, but too
many people use it), fsmark used in various configurations (single/multi
threaded, zero-sized and large files), postmark (file sizes fairly
small, working set 2xRAM), bonnie++ (2xRAM), ffsb used in a mail server
configuration (taken from btrfs tests), dbench3 (checking in-memory updates,
not a realistic IO benchmark), dbench4 (bit more realistic although high
thread counts it gets silly and overall it's not a stable predictor of
performance), sysbench in various configurations, pgbench used in limited
configurations, stutter which tends to hit the worse-case interactivity
issues experienced on desktops and kernel builds are the main ones. It
takes days to churn through the full set of tests which is why I don't do
it for a patch series.  I selected tiobench this time because it was the
most reliable test to cover both single and multiple-sources-of-IO cases.
If I merge a major change I'll usually then watch the next major release
and double check that nothing else broke.

> >> To bump the default target_latency, ideally
> >> we'd know how it affects other workloads.  However, I'm having a hard
> >> time justifying putting any time into this for a couple of reasons:
> >> 1) blk-mq pretty much does away with the i/o scheduler, and that is the
> >>    future
> >> 2) there is work in progress to convert cfq into bfq, and that will
> >>    essentially make any effort put into this irrelevant (so it might be
> >>    interesting to test your workload with bfq)
> >> 
> >
> > Ok, you've convinced me and I'll drop this patch. For anyone based on
> > kernels from around this time they can tune CFQ or buy a better disk.
> > Hopefully they will find this via Google.
> 
> Funny, I wasn't weighing in against your patch.  I was merely indicating
> that I personally wasn't going to invest the time to validate it.  But,
> if you're ok with dropping it, that's obviously fine with me.
> 

I fear the writing is on the wall that it'll never pass the "have you
tested every workload" test and no matter what a counter-example will be
found where it's the wrong setting. If CFQ is going to be irrelevant soon
it's just not worth wasting the electricity against a mainline kernel.

I'm still interested in what you consider your standard set of IO exercisers
though because I can slot any missing parts into the tests that run for
every mainline release.

The main one I'm missing is the postgres folks fsync benchmark. I wrote
the automation months ago but never activated it because there are enough
known problems already.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
  2014-06-26 17:45           ` Mel Gorman
@ 2014-06-26 18:04             ` Jeff Moyer
  -1 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

>> And we should probably run our standard set of I/O exercisers at the
>> very least.  But, like I said, it seems like wasted effort.
>> 
>
> Out of curiousity, what do you consider to be the standard set of I/O
> exercisers?

Yes, that was vague, sorry.  I was referring to any io generator that
will perform sequential and random I/O (writes, re-writes, reads, random
writes, random reads, strided reads, backwards reads, etc).  We use
iozone internally, testing both buffered and direct I/O, varying file
and record sizes and across multiple file systems.  Data sets that fall
inside of the page cache tend to have a high standard deviation, so, as
an I/O guy, I ignore those.  ;-)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] cfq: Increase default value of target_latency
@ 2014-06-26 18:04             ` Jeff Moyer
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-06-26 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner,
	Jens Axboe, Dave Chinner

Mel Gorman <mgorman@suse.de> writes:

>> And we should probably run our standard set of I/O exercisers at the
>> very least.  But, like I said, it seems like wasted effort.
>> 
>
> Out of curiousity, what do you consider to be the standard set of I/O
> exercisers?

Yes, that was vague, sorry.  I was referring to any io generator that
will perform sequential and random I/O (writes, re-writes, reads, random
writes, random reads, strided reads, backwards reads, etc).  We use
iozone internally, testing both buffered and direct I/O, varying file
and record sizes and across multiple file systems.  Data sets that fall
inside of the page cache tend to have a high standard deviation, so, as
an I/O guy, I ignore those.  ;-)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-07-09  8:13   ` Mel Gorman
@ 2014-07-10 12:01     ` Johannes Weiner
  -1 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:03AM +0100, Mel Gorman wrote:
> The LRU insertion and activate tracepoints take PFN as a parameter forcing
> the overhead to the caller.  Move the overhead to the tracepoint fast-assign
> method to ensure the cost is only incurred when the tracepoint is active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
@ 2014-07-10 12:01     ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:03AM +0100, Mel Gorman wrote:
> The LRU insertion and activate tracepoints take PFN as a parameter forcing
> the overhead to the caller.  Move the overhead to the tracepoint fast-assign
> method to ensure the cost is only incurred when the tracepoint is active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
@ 2014-07-09  8:13   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
@ 2014-07-09  8:13   ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-07-10 12:01 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-25  7:58 [PATCH 0/6] Improve sequential read throughput v2 Mel Gorman
2014-06-25  7:58 ` Mel Gorman
2014-06-25  7:58 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-25  7:58 ` [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-25  7:58 ` [PATCH 3/6] mm: vmscan: Do not reclaim from lower zones if they are balanced Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-25 23:32   ` Andrew Morton
2014-06-25 23:32     ` Andrew Morton
2014-06-26 10:17     ` Mel Gorman
2014-06-26 10:17       ` Mel Gorman
2014-06-25  7:58 ` [PATCH 4/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-25  7:58 ` [PATCH 5/6] mm: page_alloc: Reduce cost of dirty zone balancing Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-25 23:35   ` Andrew Morton
2014-06-25 23:35     ` Andrew Morton
2014-06-26  8:43     ` Mel Gorman
2014-06-26  8:43       ` Mel Gorman
2014-06-26 14:37       ` Johannes Weiner
2014-06-26 14:56         ` Mel Gorman
2014-06-26 14:56           ` Mel Gorman
2014-06-26 15:11           ` Johannes Weiner
2014-06-26 15:11             ` Johannes Weiner
2014-06-25  7:58 ` [PATCH 6/6] cfq: Increase default value of target_latency Mel Gorman
2014-06-25  7:58   ` Mel Gorman
2014-06-26 15:36   ` Jeff Moyer
2014-06-26 15:36     ` Jeff Moyer
2014-06-26 16:19     ` Mel Gorman
2014-06-26 16:19       ` Mel Gorman
2014-06-26 16:50       ` Jeff Moyer
2014-06-26 16:50         ` Jeff Moyer
2014-06-26 17:45         ` Mel Gorman
2014-06-26 17:45           ` Mel Gorman
2014-06-26 18:04           ` Jeff Moyer
2014-06-26 18:04             ` Jeff Moyer
2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-07-09  8:13   ` Mel Gorman
2014-07-10 12:01   ` Johannes Weiner
2014-07-10 12:01     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.