All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Improve sequential read throughput v3
@ 2014-06-27  8:14 ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Changelog since V2
o Simply fair zone policy cost reduction
o Drop CFQ patch

Changelog since v1
o Rebase to v3.16-rc2
o Move CFQ patch to end of series where it can be rejected easier if necessary
o Introduce page-reclaim related patch related to kswapd/fairzone interactions
o Rework fast zone policy patch

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential
read throughput. This is visible in a number of benchmarks but I looked
at tiobench the closest. This is using ext3 on a mid-range desktop and
the series applied.

                                      3.16.0-rc2            3.16.0-rc2
                                         vanilla             lessdirty
Min    SeqRead-MB/sec-1         120.92 (  0.00%)      140.73 ( 16.38%)
Min    SeqRead-MB/sec-2         100.25 (  0.00%)      117.43 ( 17.14%)
Min    SeqRead-MB/sec-4          96.27 (  0.00%)      109.01 ( 13.23%)
Min    SeqRead-MB/sec-8          83.55 (  0.00%)       90.86 (  8.75%)
Min    SeqRead-MB/sec-16         66.77 (  0.00%)       74.12 ( 11.01%)

Overall system CPU usage is reduced

          3.16.0-rc2  3.16.0-rc2
             vanilla lessdirty-v3
User          390.13      390.20
System        404.41      379.08
Elapsed      5412.45     5123.74

This series does not fully restore throughput performance to 3.0 levels
but it brings it close for lower thread counts. Higher thread counts are
known to be worse than 3.0 due to CFQ changes but there is no appetite
for changing the defaults there.

 include/linux/mmzone.h         | 210 ++++++++++++++++++++++-------------------
 include/linux/writeback.h      |   1 +
 include/trace/events/pagemap.h |  16 ++--
 mm/internal.h                  |   1 +
 mm/mm_init.c                   |   4 +-
 mm/page-writeback.c            |  23 +++--
 mm/page_alloc.c                | 173 ++++++++++++++++++++-------------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |  16 ++--
 mm/vmstat.c                    |   4 +-
 10 files changed, 258 insertions(+), 194 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/5] Improve sequential read throughput v3
@ 2014-06-27  8:14 ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Changelog since V2
o Simply fair zone policy cost reduction
o Drop CFQ patch

Changelog since v1
o Rebase to v3.16-rc2
o Move CFQ patch to end of series where it can be rejected easier if necessary
o Introduce page-reclaim related patch related to kswapd/fairzone interactions
o Rework fast zone policy patch

IO performance since 3.0 has been a mixed bag. In many respects we are
better and in some we are worse and one of those places is sequential
read throughput. This is visible in a number of benchmarks but I looked
at tiobench the closest. This is using ext3 on a mid-range desktop and
the series applied.

                                      3.16.0-rc2            3.16.0-rc2
                                         vanilla             lessdirty
Min    SeqRead-MB/sec-1         120.92 (  0.00%)      140.73 ( 16.38%)
Min    SeqRead-MB/sec-2         100.25 (  0.00%)      117.43 ( 17.14%)
Min    SeqRead-MB/sec-4          96.27 (  0.00%)      109.01 ( 13.23%)
Min    SeqRead-MB/sec-8          83.55 (  0.00%)       90.86 (  8.75%)
Min    SeqRead-MB/sec-16         66.77 (  0.00%)       74.12 ( 11.01%)

Overall system CPU usage is reduced

          3.16.0-rc2  3.16.0-rc2
             vanilla lessdirty-v3
User          390.13      390.20
System        404.41      379.08
Elapsed      5412.45     5123.74

This series does not fully restore throughput performance to 3.0 levels
but it brings it close for lower thread counts. Higher thread counts are
known to be worse than 3.0 due to CFQ changes but there is no appetite
for changing the defaults there.

 include/linux/mmzone.h         | 210 ++++++++++++++++++++++-------------------
 include/linux/writeback.h      |   1 +
 include/trace/events/pagemap.h |  16 ++--
 mm/internal.h                  |   1 +
 mm/mm_init.c                   |   4 +-
 mm/page-writeback.c            |  23 +++--
 mm/page_alloc.c                | 173 ++++++++++++++++++++-------------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |  16 ++--
 mm/vmstat.c                    |   4 +-
 10 files changed, 258 insertions(+), 194 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-06-27  8:14 ` Mel Gorman
@ 2014-06-27  8:14   ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/5] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
@ 2014-06-27  8:14   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
  2014-06-27  8:14 ` Mel Gorman
@ 2014-06-27  8:14   ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that arguably the biggest change is reducing the size of the
lowmem_reserve type. It should still be large enough but by shrinking it
the fields used by the page allocator fast path all fit in one cache line.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 201 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |  13 ++--
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 105 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..a2f6443 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,17 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	unsigned int lowmem_reserve[MAX_NR_ZONES];
 
+	struct per_cpu_pageset __percpu *pageset;
 #ifdef CONFIG_NUMA
 	int node;
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
 #endif
-	struct per_cpu_pageset __percpu *pageset;
 	/*
-	 * free areas of different sizes
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned int inactive_ratio;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +357,37 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
 	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	unsigned long		dirty_balance_reserve;
 
 	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
 	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
+	unsigned long percpu_drift_mark;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
+	const char		*name;
 
 	/*
-	 * Discontig memory support fields.
+	 * Number of MIGRATE_RESEVE page block. To maintain for just
+	 * optimization. Protected by zone->lock.
 	 */
+	int			nr_migrate_reserve_block;
+
 	struct pglist_data	*zone_pgdat;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -504,16 +436,89 @@ struct zone {
 	unsigned long		present_pages;
 	unsigned long		managed_pages;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * Number of MIGRATE_RESEVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	int			nr_migrate_reserve_block;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	unsigned long		pages_scanned;	   /* since last reclaim */
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
 
+#ifdef CONFIG_COMPACTION
 	/*
-	 * rarely used fields:
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
 	 */
-	const char		*name;
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..ebbdbcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1699,7 +1699,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1714,7 +1713,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3245,7 +3244,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %u", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5566,7 +5565,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			unsigned int max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
@@ -5617,6 +5616,7 @@ static void setup_per_zone_lowmem_reserve(void)
 			idx = j;
 			while (idx) {
 				struct zone *lower_zone;
+				unsigned long reserve;
 
 				idx--;
 
@@ -5624,8 +5624,11 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = managed_pages /
+				reserve = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
+				if (WARN_ON(reserve > UINT_MAX))
+					reserve = UINT_MAX;
+				lower_zone->lowmem_reserve[j] = reserve;
 				managed_pages += lower_zone->managed_pages;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..c6d6fae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%u",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %u", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
@ 2014-06-27  8:14   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that arguably the biggest change is reducing the size of the
lowmem_reserve type. It should still be large enough but by shrinking it
the fields used by the page allocator fast path all fit in one cache line.

On the test configuration I used the overall size of struct zone shrunk
by one cache line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 201 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |  13 ++--
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 105 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..a2f6443 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,17 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	unsigned int lowmem_reserve[MAX_NR_ZONES];
 
+	struct per_cpu_pageset __percpu *pageset;
 #ifdef CONFIG_NUMA
 	int node;
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
 #endif
-	struct per_cpu_pageset __percpu *pageset;
 	/*
-	 * free areas of different sizes
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned int inactive_ratio;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +357,37 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
 	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	unsigned long		dirty_balance_reserve;
 
 	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
 	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
+	unsigned long percpu_drift_mark;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
+	const char		*name;
 
 	/*
-	 * Discontig memory support fields.
+	 * Number of MIGRATE_RESEVE page block. To maintain for just
+	 * optimization. Protected by zone->lock.
 	 */
+	int			nr_migrate_reserve_block;
+
 	struct pglist_data	*zone_pgdat;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -504,16 +436,89 @@ struct zone {
 	unsigned long		present_pages;
 	unsigned long		managed_pages;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * Number of MIGRATE_RESEVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	int			nr_migrate_reserve_block;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	unsigned long		pages_scanned;	   /* since last reclaim */
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
 
+#ifdef CONFIG_COMPACTION
 	/*
-	 * rarely used fields:
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
 	 */
-	const char		*name;
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..ebbdbcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1699,7 +1699,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1714,7 +1713,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3245,7 +3244,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %u", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5566,7 +5565,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			unsigned int max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
@@ -5617,6 +5616,7 @@ static void setup_per_zone_lowmem_reserve(void)
 			idx = j;
 			while (idx) {
 				struct zone *lower_zone;
+				unsigned long reserve;
 
 				idx--;
 
@@ -5624,8 +5624,11 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = managed_pages /
+				reserve = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
+				if (WARN_ON(reserve > UINT_MAX))
+					reserve = UINT_MAX;
+				lower_zone->lowmem_reserve[j] = reserve;
 				managed_pages += lower_zone->managed_pages;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..c6d6fae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%u",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %u", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-27  8:14 ` Mel Gorman
@ 2014-06-27  8:14   ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible zone
until the watermark is depleted, woke kswapd and moved onto the next zone.
kswapd scans zones in the opposite direction so the scanning lists on
64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
some percentage will be allocated from Normal and some from DMA32. The
percentage depends on the ratio of the zone sizes and when their watermarks
were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
is unbalanced only DMA32 will be shrunk. This leads to a difference of
ages between DMA32 and Normal. Relatively young pages are then continually
rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
Some of these pages may be recently read-ahead pages requiring that the page
be re-read from disk and impacting overall performance.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

This patch will only scan/reclaim from lower zones if they have not
reached their watermark. This should not break the normal page aging
as the proportional allocations due to the fair zone allocation policy
should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla        checklow-v2r14               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.63 ( 16.26%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      117.95 ( 17.10%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.54 ( 14.04%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.22 (  6.99%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.14 (  4.17%)       79.49 ( 19.77%)
Mean   RandWrite-MB/sec-16        1.34 (  0.00%)        1.34 (  0.00%)        1.34 (  0.25%)

It was also tested against xfs and there are similar gains. There are
still regressions for higher number of threads but this is related to
changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..40c3af8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3124,12 +3124,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3152,6 +3153,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
+			if (zone_balanced(zone, order, 0, 0))
+				continue;
+
 			/*
 			 * There should be no need to raise the scanning
 			 * priority if enough pages are already being scanned
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-27  8:14   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible zone
until the watermark is depleted, woke kswapd and moved onto the next zone.
kswapd scans zones in the opposite direction so the scanning lists on
64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
some percentage will be allocated from Normal and some from DMA32. The
percentage depends on the ratio of the zone sizes and when their watermarks
were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
is unbalanced only DMA32 will be shrunk. This leads to a difference of
ages between DMA32 and Normal. Relatively young pages are then continually
rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
Some of these pages may be recently read-ahead pages requiring that the page
be re-read from disk and impacting overall performance.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

This patch will only scan/reclaim from lower zones if they have not
reached their watermark. This should not break the normal page aging
as the proportional allocations due to the fair zone allocation policy
should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla        checklow-v2r14               vanilla
Min    SeqRead-MB/sec-1         120.96 (  0.00%)      140.63 ( 16.26%)      134.04 ( 10.81%)
Min    SeqRead-MB/sec-2         100.73 (  0.00%)      117.95 ( 17.10%)      120.76 ( 19.88%)
Min    SeqRead-MB/sec-4          96.05 (  0.00%)      109.54 ( 14.04%)      114.49 ( 19.20%)
Min    SeqRead-MB/sec-8          82.46 (  0.00%)       88.22 (  6.99%)       98.04 ( 18.89%)
Min    SeqRead-MB/sec-16         66.37 (  0.00%)       69.14 (  4.17%)       79.49 ( 19.77%)
Mean   RandWrite-MB/sec-16        1.34 (  0.00%)        1.34 (  0.00%)        1.34 (  0.25%)

It was also tested against xfs and there are similar gains. There are
still regressions for higher number of threads but this is related to
changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..40c3af8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3124,12 +3124,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3152,6 +3153,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
+			if (zone_balanced(zone, order, 0, 0))
+				continue;
+
 			/*
 			 * There should be no need to raise the scanning
 			 * priority if enough pages are already being scanned
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-27  8:14 ` Mel Gorman
@ 2014-06-27  8:14   ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

This patch makes a number of changes that should reduce the overall cost

o Do not apply the fair zone policy to small zones such as DMA
o Abort the fair zone allocation policy once remote or small zones are
  encountered
o Use a simplier scan when resetting NR_ALLOC_BATCH
o Use a simple flag to identify depleted zones instead of accessing a
  potentially write-intensive cache line for counters
o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
  to avoid doing a rescan of the zonelist when the counters are reset

On UMA machines, the effect is marginal. Even judging from system CPU
usage it's small for the tiobench test

          3.16.0-rc2  3.16.0-rc2
            checklow    fairzone
User          396.24      396.23
System        395.23      391.50
Elapsed      5182.65     5165.49

And the number of pages allocated from each zone is comparable

                            3.16.0-rc2  3.16.0-rc2
                              checklow    fairzone
DMA allocs                           0           0
DMA32 allocs                   7374217     7920241
Normal allocs                999277551   996568115

On NUMA machines, the scanning overhead is higher as zones are scanned
that are ineligible for use by zone allocation policy. This patch fixes
the zone-order zonelist policy and reduces the numbers of zones scanned
by the allocator.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 +++
 mm/mm_init.c           |   4 +-
 mm/page_alloc.c        | 146 +++++++++++++++++++++++++++++--------------------
 3 files changed, 96 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2f6443..f7f93d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
@@ -716,6 +722,7 @@ struct zoneref {
 struct zonelist {
 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
 	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
+	bool fair_enabled;	/* eligible for fair zone policy */
 #ifdef CONFIG_NUMA
 	struct zonelist_cache zlcache;			     // optional ...
 #endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4074caf..27ef4fa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -47,9 +47,9 @@ void mminit_verify_zonelist(void)
 				continue;
 
 			/* Print information about the zonelist */
-			printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
+			printk(KERN_DEBUG "mminit::zonelist %s %d:%s(%s) = ",
 				listid > 0 ? "thisnode" : "general", nid,
-				zone->name);
+				zone->name, zonelist->fair_enabled ? "F" : "");
 
 			/* Iterate the zonelist */
 			for_each_zone_zonelist(zone, z, zonelist, zoneid) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebbdbcd..d2ed2e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1544,7 +1544,7 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype, bool acct_fair)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1596,7 +1596,11 @@ again:
 					  get_freepage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (acct_fair) {
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+		if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+			zone_set_flag(zone, ZONE_FAIR_DEPLETED);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		if (!zone_is_fair_depleted(zone))
+			continue;
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1925,8 +1943,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1950,10 +1971,14 @@ zonelist_scan:
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
+				break;
+
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
+			}
 		}
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -2050,21 +2075,16 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+				gfp_mask, migratetype, zonelist->fair_enabled);
 		if (page)
 			break;
+
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2073,8 +2093,25 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
+		alloc_flags &= ~ALLOC_FAIR;
+		zonelist_rescan = true;
+		reset_alloc_batches(preferred_zone);
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2395,28 +2432,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2748,33 +2763,18 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	if (zonelist->fair_enabled)
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
@@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
+			struct zoneref *zoneref, struct zone *preferred_zone)
 {
+	int zone_type = zone_idx(zone);
+	bool fair_enabled = zone_local(zone, preferred_zone);
+	if (zone_type == 0 &&
+			zone->managed_pages < (pgdat->node_present_pages >> 4))
+		fair_enabled = false;
+
 	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
+	zoneref->zone_idx = zone_type;
+	return fair_enabled;
 }
 
 /*
@@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	struct zone *preferred_zone = NULL;
+	int nr_fair = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			if (!preferred_zone)
+				preferred_zone = zone;
+
+			nr_fair += zoneref_set_zone(pgdat, zone,
+				&zonelist->_zonerefs[nr_zones++],
+				preferred_zone);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
+	if (nr_fair <= 1)
+		zonelist->fair_enabled = false;
+
 	return nr_zones;
 }
 
@@ -3538,8 +3555,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
-	struct zone *z;
+	struct zone *z, *preferred_zone = NULL;
 	struct zonelist *zonelist;
+	int nr_fair = 0;
 
 	zonelist = &pgdat->node_zonelists[0];
 	pos = 0;
@@ -3547,15 +3565,25 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (!preferred_zone)
+				preferred_zone = z;
 			if (populated_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
+				nr_fair += zoneref_set_zone(pgdat, z,
+					&zonelist->_zonerefs[pos++],
+					preferred_zone);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
 	zonelist->_zonerefs[pos].zone = NULL;
 	zonelist->_zonerefs[pos].zone_idx = 0;
+
+	/*
+	 * For this policy, the fair zone allocation policy is disabled as the
+	 * stated priority is to preserve lower zones, not balance them fairly.
+	 */
+	if (nr_fair == 1 || nr_online_nodes > 1)
+		zonelist->fair_enabled = false;
 }
 
 static int default_zonelist_order(void)
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
@ 2014-06-27  8:14   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

This patch makes a number of changes that should reduce the overall cost

o Do not apply the fair zone policy to small zones such as DMA
o Abort the fair zone allocation policy once remote or small zones are
  encountered
o Use a simplier scan when resetting NR_ALLOC_BATCH
o Use a simple flag to identify depleted zones instead of accessing a
  potentially write-intensive cache line for counters
o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
  to avoid doing a rescan of the zonelist when the counters are reset

On UMA machines, the effect is marginal. Even judging from system CPU
usage it's small for the tiobench test

          3.16.0-rc2  3.16.0-rc2
            checklow    fairzone
User          396.24      396.23
System        395.23      391.50
Elapsed      5182.65     5165.49

And the number of pages allocated from each zone is comparable

                            3.16.0-rc2  3.16.0-rc2
                              checklow    fairzone
DMA allocs                           0           0
DMA32 allocs                   7374217     7920241
Normal allocs                999277551   996568115

On NUMA machines, the scanning overhead is higher as zones are scanned
that are ineligible for use by zone allocation policy. This patch fixes
the zone-order zonelist policy and reduces the numbers of zones scanned
by the allocator.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 +++
 mm/mm_init.c           |   4 +-
 mm/page_alloc.c        | 146 +++++++++++++++++++++++++++++--------------------
 3 files changed, 96 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2f6443..f7f93d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
@@ -716,6 +722,7 @@ struct zoneref {
 struct zonelist {
 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
 	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
+	bool fair_enabled;	/* eligible for fair zone policy */
 #ifdef CONFIG_NUMA
 	struct zonelist_cache zlcache;			     // optional ...
 #endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4074caf..27ef4fa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -47,9 +47,9 @@ void mminit_verify_zonelist(void)
 				continue;
 
 			/* Print information about the zonelist */
-			printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
+			printk(KERN_DEBUG "mminit::zonelist %s %d:%s(%s) = ",
 				listid > 0 ? "thisnode" : "general", nid,
-				zone->name);
+				zone->name, zonelist->fair_enabled ? "F" : "");
 
 			/* Iterate the zonelist */
 			for_each_zone_zonelist(zone, z, zonelist, zoneid) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebbdbcd..d2ed2e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1544,7 +1544,7 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype, bool acct_fair)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1596,7 +1596,11 @@ again:
 					  get_freepage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (acct_fair) {
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+		if (zone_page_state(zone, NR_ALLOC_BATCH) == 0)
+			zone_set_flag(zone, ZONE_FAIR_DEPLETED);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		if (!zone_is_fair_depleted(zone))
+			continue;
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1925,8 +1943,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1950,10 +1971,14 @@ zonelist_scan:
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
+				break;
+
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
+			}
 		}
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -2050,21 +2075,16 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+				gfp_mask, migratetype, zonelist->fair_enabled);
 		if (page)
 			break;
+
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2073,8 +2093,25 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
+		alloc_flags &= ~ALLOC_FAIR;
+		zonelist_rescan = true;
+		reset_alloc_batches(preferred_zone);
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2395,28 +2432,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2748,33 +2763,18 @@ retry_cpuset:
 		goto out;
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
+	if (zonelist->fair_enabled)
+		alloc_flags |= ALLOC_FAIR;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
@@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
+			struct zoneref *zoneref, struct zone *preferred_zone)
 {
+	int zone_type = zone_idx(zone);
+	bool fair_enabled = zone_local(zone, preferred_zone);
+	if (zone_type == 0 &&
+			zone->managed_pages < (pgdat->node_present_pages >> 4))
+		fair_enabled = false;
+
 	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
+	zoneref->zone_idx = zone_type;
+	return fair_enabled;
 }
 
 /*
@@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	struct zone *preferred_zone = NULL;
+	int nr_fair = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			if (!preferred_zone)
+				preferred_zone = zone;
+
+			nr_fair += zoneref_set_zone(pgdat, zone,
+				&zonelist->_zonerefs[nr_zones++],
+				preferred_zone);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
+	if (nr_fair <= 1)
+		zonelist->fair_enabled = false;
+
 	return nr_zones;
 }
 
@@ -3538,8 +3555,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
-	struct zone *z;
+	struct zone *z, *preferred_zone = NULL;
 	struct zonelist *zonelist;
+	int nr_fair = 0;
 
 	zonelist = &pgdat->node_zonelists[0];
 	pos = 0;
@@ -3547,15 +3565,25 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (!preferred_zone)
+				preferred_zone = z;
 			if (populated_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
+				nr_fair += zoneref_set_zone(pgdat, z,
+					&zonelist->_zonerefs[pos++],
+					preferred_zone);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
 	zonelist->_zonerefs[pos].zone = NULL;
 	zonelist->_zonerefs[pos].zone_idx = 0;
+
+	/*
+	 * For this policy, the fair zone allocation policy is disabled as the
+	 * stated priority is to preserve lower zones, not balance them fairly.
+	 */
+	if (nr_fair == 1 || nr_online_nodes > 1)
+		zonelist->fair_enabled = false;
 }
 
 static int default_zonelist_order(void)
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] mm: page_alloc: Reduce cost of dirty zone balancing
  2014-06-27  8:14 ` Mel Gorman
@ 2014-06-27  8:14   ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

When allocating a page cache page for writing the allocator makes an attempt
to proportionally distribute dirty pages between populated zones. The call
to zone_dirty_ok is more expensive than expected because of the number of
vmstats it examines. This patch caches some of that information to reduce
the cost. It means the proportional allocation is based on stale data but
the heuristic should not need perfectly accurate information. As before,
the benefit is marginal but cumulative and depends on the size of the
machine. For a very small machine the effect is visible in system CPU time
for a tiobench test but it'll vary between workloads.

          3.16.0-rc2  3.16.0-rc2
            fairzone   lessdirty
User          393.76      389.03
System        391.50      388.64
Elapsed      5182.86     5186.28

Even though the benefit is small it's generally worthwhile reducing overhead
in the page allocator as it has a tendency to creep up over time.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h    |  2 ++
 include/linux/writeback.h |  1 +
 mm/internal.h             |  1 +
 mm/page-writeback.c       | 23 +++++++++++++++--------
 mm/page_alloc.c           | 16 ++++++++++++----
 5 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f7f93d4..cd7a3d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,8 @@ struct zone {
 	/* zone flags, see below */
 	unsigned long		flags;
 
+	unsigned long		dirty_limit_cached;
+
 	ZONE_PADDING(_pad2_)
 
 	/* Write-intensive fields used by page reclaim */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..90190d4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -121,6 +121,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirty_limit(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..f31e3b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -370,5 +370,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
+#define ALLOC_DIRTY		0x200 /* spread GFP_WRITE allocations */
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..a7d11b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -256,8 +256,6 @@ static unsigned long global_dirtyable_memory(void)
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
  * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
@@ -298,10 +296,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  * Returns the maximum number of dirty pages allowed in a zone, based
  * on the zone's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+unsigned long zone_dirty_limit(struct zone *zone)
 {
 	unsigned long zone_memory = zone_dirtyable_memory(zone);
-	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
@@ -310,9 +307,6 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 	else
 		dirty = vm_dirty_ratio * zone_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
-		dirty += dirty / 4;
-
 	return dirty;
 }
 
@@ -325,7 +319,20 @@ static unsigned long zone_dirty_limit(struct zone *zone)
  */
 bool zone_dirty_ok(struct zone *zone)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	unsigned long limit = zone->dirty_limit_cached;
+	struct task_struct *tsk = current;
+
+	/*
+	 * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
+	 * and real-time tasks to prioritise their allocations.
+	 * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
+	 * blocking tasks that can clean pages.
+	 */
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		limit = zone_dirty_limit(zone);
+		zone->dirty_limit_cached = limit;
+		limit += limit / 4;
+	}
 
 	return zone_page_state(zone, NR_FILE_DIRTY) +
 	       zone_page_state(zone, NR_UNSTABLE_NFS) +
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2ed2e0..cf8e858 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,9 +1941,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
-	int nr_fair_skipped = 0;
+	int nr_fair_skipped = 0, nr_fail_dirty = 0;
 	bool zonelist_rescan;
 
 zonelist_scan:
@@ -2005,8 +2003,11 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if ((alloc_flags & ALLOC_DIRTY) && !zone_dirty_ok(zone)) {
+			nr_fail_dirty++;
+			zone->dirty_limit_cached = zone_dirty_limit(zone);
 			continue;
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
@@ -2108,6 +2109,11 @@ this_zone_full:
 		zonelist_rescan = true;
 	}
 
+	if ((alloc_flags & ALLOC_DIRTY) && nr_fail_dirty) {
+		alloc_flags &= ~ALLOC_DIRTY;
+		zonelist_rescan = true;
+	}
+
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2765,6 +2771,8 @@ retry_cpuset:
 
 	if (zonelist->fair_enabled)
 		alloc_flags |= ALLOC_FAIR;
+	if (gfp_mask & __GFP_WRITE)
+		alloc_flags |= ALLOC_DIRTY;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] mm: page_alloc: Reduce cost of dirty zone balancing
@ 2014-06-27  8:14   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

When allocating a page cache page for writing the allocator makes an attempt
to proportionally distribute dirty pages between populated zones. The call
to zone_dirty_ok is more expensive than expected because of the number of
vmstats it examines. This patch caches some of that information to reduce
the cost. It means the proportional allocation is based on stale data but
the heuristic should not need perfectly accurate information. As before,
the benefit is marginal but cumulative and depends on the size of the
machine. For a very small machine the effect is visible in system CPU time
for a tiobench test but it'll vary between workloads.

          3.16.0-rc2  3.16.0-rc2
            fairzone   lessdirty
User          393.76      389.03
System        391.50      388.64
Elapsed      5182.86     5186.28

Even though the benefit is small it's generally worthwhile reducing overhead
in the page allocator as it has a tendency to creep up over time.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h    |  2 ++
 include/linux/writeback.h |  1 +
 mm/internal.h             |  1 +
 mm/page-writeback.c       | 23 +++++++++++++++--------
 mm/page_alloc.c           | 16 ++++++++++++----
 5 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f7f93d4..cd7a3d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,8 @@ struct zone {
 	/* zone flags, see below */
 	unsigned long		flags;
 
+	unsigned long		dirty_limit_cached;
+
 	ZONE_PADDING(_pad2_)
 
 	/* Write-intensive fields used by page reclaim */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..90190d4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -121,6 +121,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirty_limit(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..f31e3b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -370,5 +370,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
+#define ALLOC_DIRTY		0x200 /* spread GFP_WRITE allocations */
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..a7d11b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -256,8 +256,6 @@ static unsigned long global_dirtyable_memory(void)
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
  * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
@@ -298,10 +296,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  * Returns the maximum number of dirty pages allowed in a zone, based
  * on the zone's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+unsigned long zone_dirty_limit(struct zone *zone)
 {
 	unsigned long zone_memory = zone_dirtyable_memory(zone);
-	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
@@ -310,9 +307,6 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 	else
 		dirty = vm_dirty_ratio * zone_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
-		dirty += dirty / 4;
-
 	return dirty;
 }
 
@@ -325,7 +319,20 @@ static unsigned long zone_dirty_limit(struct zone *zone)
  */
 bool zone_dirty_ok(struct zone *zone)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	unsigned long limit = zone->dirty_limit_cached;
+	struct task_struct *tsk = current;
+
+	/*
+	 * The dirty limits are lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
+	 * and real-time tasks to prioritise their allocations.
+	 * PF_LESS_THROTTLE tasks may be cleaning memory and rt tasks may be
+	 * blocking tasks that can clean pages.
+	 */
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		limit = zone_dirty_limit(zone);
+		zone->dirty_limit_cached = limit;
+		limit += limit / 4;
+	}
 
 	return zone_page_state(zone, NR_FILE_DIRTY) +
 	       zone_page_state(zone, NR_UNSTABLE_NFS) +
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2ed2e0..cf8e858 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,9 +1941,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
-	int nr_fair_skipped = 0;
+	int nr_fair_skipped = 0, nr_fail_dirty = 0;
 	bool zonelist_rescan;
 
 zonelist_scan:
@@ -2005,8 +2003,11 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if ((alloc_flags & ALLOC_DIRTY) && !zone_dirty_ok(zone)) {
+			nr_fail_dirty++;
+			zone->dirty_limit_cached = zone_dirty_limit(zone);
 			continue;
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
@@ -2108,6 +2109,11 @@ this_zone_full:
 		zonelist_rescan = true;
 	}
 
+	if ((alloc_flags & ALLOC_DIRTY) && nr_fail_dirty) {
+		alloc_flags &= ~ALLOC_DIRTY;
+		zonelist_rescan = true;
+	}
+
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2765,6 +2771,8 @@ retry_cpuset:
 
 	if (zonelist->fair_enabled)
 		alloc_flags |= ALLOC_FAIR;
+	if (gfp_mask & __GFP_WRITE)
+		alloc_flags |= ALLOC_DIRTY;
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-27  8:14   ` Mel Gorman
@ 2014-06-27 17:26     ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-27 17:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 09:14:38AM +0100, Mel Gorman wrote:
> Historically kswapd scanned from DMA->Movable in the opposite direction
> to the page allocator to avoid allocating behind kswapd direction of
> progress. The fair zone allocation policy altered this in a non-obvious
> manner.
> 
> Traditionally, the page allocator prefers to use the highest eligible zone
> until the watermark is depleted, woke kswapd and moved onto the next zone.

That's not quite right, the page allocator tries all zones in the
zonelist, then wakes up kswapd, then tries again from the beginning.

> kswapd scans zones in the opposite direction so the scanning lists on
> 64-bit look like this;
> 
> Page alloc		Kswapd
> ----------              ------
> Movable			DMA
> Normal			DMA32
> DMA32			Normal
> DMA			Movable
> 
> If kswapd scanned in the same direction as the page allocator then it is
> possible that kswapd would proportionally reclaim the lower zones that
> were never used as the page allocator was always allocating behind the
> reclaim. This would work as follows
> 
> 	pgalloc hits Normal low wmark
> 					kswapd reclaims Normal
> 					kswapd reclaims DMA32
> 	pgalloc hits Normal low wmark
> 					kswapd reclaims Normal
> 					kswapd reclaims DMA32
> 
> The introduction of the fair zone allocation policy fundamentally altered
> this problem by interleaving between zones until the low watermark is
> reached. There are at least two issues with this
> 
> o The page allocator can allocate behind kswapds progress (scans/reclaims
>   lower zone and fair zone allocation policy then uses those pages)
> o When the low watermark of the high zone is reached there may recently
>   allocated pages allocated from the lower zone but as kswapd scans
>   dma->highmem to the highest zone needing balancing it'll reclaim the
>   lower zone even if it was balanced.
> 
> Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
> some percentage will be allocated from Normal and some from DMA32. The
> percentage depends on the ratio of the zone sizes and when their watermarks
> were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
> is unbalanced only DMA32 will be shrunk. This leads to a difference of
> ages between DMA32 and Normal. Relatively young pages are then continually
> rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
> Some of these pages may be recently read-ahead pages requiring that the page
> be re-read from disk and impacting overall performance.
> 
> The problem is fundamental to the fact we have per-zone LRU and allocation
> policies and ideally we would only have per-node allocation and LRU lists.
> This would avoid the need for the fair zone allocation policy but the
> low-memory-starvation issue would have to be addressed again from scratch.
> 
> This patch will only scan/reclaim from lower zones if they have not
> reached their watermark. This should not break the normal page aging
> as the proportional allocations due to the fair zone allocation policy
> should compensate.

That's already the case, kswapd_shrink_zone() checks whether the zone
is balanced before scanning in, so something in this analysis is off -
but I have to admit that I have trouble following it.

The only difference in the two checks is that the outer one you add
does not enforce the balance gap, which means that we stop reclaiming
zones a little earlier than before.  I guess this is where the
throughput improvements come from, but there is a chance it will
regress latency for bursty allocations.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-27 17:26     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-27 17:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 09:14:38AM +0100, Mel Gorman wrote:
> Historically kswapd scanned from DMA->Movable in the opposite direction
> to the page allocator to avoid allocating behind kswapd direction of
> progress. The fair zone allocation policy altered this in a non-obvious
> manner.
> 
> Traditionally, the page allocator prefers to use the highest eligible zone
> until the watermark is depleted, woke kswapd and moved onto the next zone.

That's not quite right, the page allocator tries all zones in the
zonelist, then wakes up kswapd, then tries again from the beginning.

> kswapd scans zones in the opposite direction so the scanning lists on
> 64-bit look like this;
> 
> Page alloc		Kswapd
> ----------              ------
> Movable			DMA
> Normal			DMA32
> DMA32			Normal
> DMA			Movable
> 
> If kswapd scanned in the same direction as the page allocator then it is
> possible that kswapd would proportionally reclaim the lower zones that
> were never used as the page allocator was always allocating behind the
> reclaim. This would work as follows
> 
> 	pgalloc hits Normal low wmark
> 					kswapd reclaims Normal
> 					kswapd reclaims DMA32
> 	pgalloc hits Normal low wmark
> 					kswapd reclaims Normal
> 					kswapd reclaims DMA32
> 
> The introduction of the fair zone allocation policy fundamentally altered
> this problem by interleaving between zones until the low watermark is
> reached. There are at least two issues with this
> 
> o The page allocator can allocate behind kswapds progress (scans/reclaims
>   lower zone and fair zone allocation policy then uses those pages)
> o When the low watermark of the high zone is reached there may recently
>   allocated pages allocated from the lower zone but as kswapd scans
>   dma->highmem to the highest zone needing balancing it'll reclaim the
>   lower zone even if it was balanced.
> 
> Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
> some percentage will be allocated from Normal and some from DMA32. The
> percentage depends on the ratio of the zone sizes and when their watermarks
> were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
> is unbalanced only DMA32 will be shrunk. This leads to a difference of
> ages between DMA32 and Normal. Relatively young pages are then continually
> rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
> Some of these pages may be recently read-ahead pages requiring that the page
> be re-read from disk and impacting overall performance.
> 
> The problem is fundamental to the fact we have per-zone LRU and allocation
> policies and ideally we would only have per-node allocation and LRU lists.
> This would avoid the need for the fair zone allocation policy but the
> low-memory-starvation issue would have to be addressed again from scratch.
> 
> This patch will only scan/reclaim from lower zones if they have not
> reached their watermark. This should not break the normal page aging
> as the proportional allocations due to the fair zone allocation policy
> should compensate.

That's already the case, kswapd_shrink_zone() checks whether the zone
is balanced before scanning in, so something in this analysis is off -
but I have to admit that I have trouble following it.

The only difference in the two checks is that the outer one you add
does not enforce the balance gap, which means that we stop reclaiming
zones a little earlier than before.  I guess this is where the
throughput improvements come from, but there is a chance it will
regress latency for bursty allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
  2014-06-27 17:26     ` Johannes Weiner
@ 2014-06-27 18:42       ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27 18:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 01:26:57PM -0400, Johannes Weiner wrote:
> On Fri, Jun 27, 2014 at 09:14:38AM +0100, Mel Gorman wrote:
> > Historically kswapd scanned from DMA->Movable in the opposite direction
> > to the page allocator to avoid allocating behind kswapd direction of
> > progress. The fair zone allocation policy altered this in a non-obvious
> > manner.
> > 
> > Traditionally, the page allocator prefers to use the highest eligible zone
> > until the watermark is depleted, woke kswapd and moved onto the next zone.
> 
> That's not quite right, the page allocator tries all zones in the
> zonelist, then wakes up kswapd, then tries again from the beginning.
> 

You're right of course, sorry about that. It still is the case that once
kswapd is awake that it can reclaim from the lower zones for longer than
the higher zones and this does not play well with the fair zone policy
interleaving between them.

> > kswapd scans zones in the opposite direction so the scanning lists on
> > 64-bit look like this;
> > 
> > Page alloc		Kswapd
> > ----------              ------
> > Movable			DMA
> > Normal			DMA32
> > DMA32			Normal
> > DMA			Movable
> > 
> > If kswapd scanned in the same direction as the page allocator then it is
> > possible that kswapd would proportionally reclaim the lower zones that
> > were never used as the page allocator was always allocating behind the
> > reclaim. This would work as follows
> > 
> > 	pgalloc hits Normal low wmark
> > 					kswapd reclaims Normal
> > 					kswapd reclaims DMA32
> > 	pgalloc hits Normal low wmark
> > 					kswapd reclaims Normal
> > 					kswapd reclaims DMA32
> > 
> > The introduction of the fair zone allocation policy fundamentally altered
> > this problem by interleaving between zones until the low watermark is
> > reached. There are at least two issues with this
> > 
> > o The page allocator can allocate behind kswapds progress (scans/reclaims
> >   lower zone and fair zone allocation policy then uses those pages)
> > o When the low watermark of the high zone is reached there may recently
> >   allocated pages allocated from the lower zone but as kswapd scans
> >   dma->highmem to the highest zone needing balancing it'll reclaim the
> >   lower zone even if it was balanced.
> > 
> > Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
> > some percentage will be allocated from Normal and some from DMA32. The
> > percentage depends on the ratio of the zone sizes and when their watermarks
> > were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
> > is unbalanced only DMA32 will be shrunk. This leads to a difference of
> > ages between DMA32 and Normal. Relatively young pages are then continually
> > rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
> > Some of these pages may be recently read-ahead pages requiring that the page
> > be re-read from disk and impacting overall performance.
> > 
> > The problem is fundamental to the fact we have per-zone LRU and allocation
> > policies and ideally we would only have per-node allocation and LRU lists.
> > This would avoid the need for the fair zone allocation policy but the
> > low-memory-starvation issue would have to be addressed again from scratch.
> > 
> > This patch will only scan/reclaim from lower zones if they have not
> > reached their watermark. This should not break the normal page aging
> > as the proportional allocations due to the fair zone allocation policy
> > should compensate.
> 
> That's already the case, kswapd_shrink_zone() checks whether the zone
> is balanced before scanning in, so something in this analysis is off -
> but I have to admit that I have trouble following it.
> 
> The only difference in the two checks is that the outer one you add
> does not enforce the balance gap, which means that we stop reclaiming
> zones a little earlier than before. 

Not only is the balance gap a factor (which can be large) but the classzone
reserves are also applied which will keep kswapd reclaiming longer.  Still,
you're right. It's far more appropriate to push the checks down where the
buffer head reserves are accounted for.

I still think that ultimately we want to move the LRUs to a per-node
basis. It should be possible to do this piecemeal and keep the free page
lists on a per-zone basis and have direct reclaim skip through the LRU
finding lowmem pages if that is required. It would be a fairly far-reaching
change but overall the aging would make more sense and it would remove
the fair zone allocation policy entirely. It'd hurt while reclaiming for
lowmem on 32-bit with large highmem zones but I think that is far less a
concern than it would have been 10 years ago.

> I guess this is where the
> throughput improvements come from, but there is a chance it will
> regress latency for bursty allocations.

Any suggestions on how to check that? I would expect that bursty
allocations are falling into the slow path and potentially doing direct
reclaim. That would make the effect difficult to measure.

This is the version that is currently being tested

---8<---
mm: vmscan: Do not reclaim from lower zones if they are balanced

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible
zones in order until the low watermarks are reached and then wakes kswapd.
Once kswapd is awake, it scans zones in the opposite direction so the
scanning lists on 64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
some percentage will be allocated from Normal and some from DMA32. The
percentage depends on the ratio of the zone sizes and when their watermarks
were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
is unbalanced only DMA32 will be shrunk. This leads to a difference of
ages between DMA32 and Normal. Relatively young pages are then continually
rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
Some of these pages may be recently read-ahead pages requiring that the page
be re-read from disk and impacting overall performance.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

kswapd shrinks equally from all zones up to the high watermark plus a
balance gap and the lowmem reserves. This patch removes the additional
reclaim from lower zones on the grounds that the fair zone allocation
policy will typically be interleaving between the zones.  This should not
break the normal page aging as the proportional allocations due to the
fair zone allocation policy should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla           checklow-v4               vanilla
TO-BE-UPDATED

There are still regressions for higher number of threads but this is
related to changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h |  9 ---------
 mm/vmscan.c          | 46 ++++++++++++++++------------------------------
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bdbee8..1680307 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -165,15 +165,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..3e315c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2294,7 +2294,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 /* Returns true if compaction should go ahead for a high-order request */
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long balance_gap, watermark;
+	unsigned long watermark;
 	bool watermark_ok;
 
 	/* Do not consider compaction for orders reclaim is meant to satisfy */
@@ -2307,9 +2307,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+	watermark = high_wmark_pages(zone) + (2UL << sc->order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
 
 	/*
@@ -2816,11 +2814,9 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			  unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order)
 {
-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx, 0))
+	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0))
 		return false;
 
 	if (IS_ENABLED(CONFIG_COMPACTION) && order &&
@@ -2877,7 +2873,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 		}
 
-		if (zone_balanced(zone, order, 0, i))
+		if (zone_balanced(zone, order))
 			balanced_pages += zone->managed_pages;
 		else if (!order)
 			return false;
@@ -2934,7 +2930,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       unsigned long *nr_attempted)
 {
 	int testorder = sc->order;
-	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
@@ -2956,21 +2951,11 @@ static bool kswapd_shrink_zone(struct zone *zone,
 		testorder = 0;
 
 	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-
-	/*
 	 * If there is no low memory pressure or the zone is balanced then no
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder,
-						balance_gap, classzone_idx))
+	if (!lowmem_pressure && zone_balanced(zone, testorder))
 		return true;
 
 	shrink_zone(zone, sc);
@@ -2993,7 +2978,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+	    zone_balanced(zone, testorder)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 	}
@@ -3079,7 +3064,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order)) {
 				end_zone = i;
 				break;
 			} else {
@@ -3124,12 +3109,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3397,7 +3383,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced
@ 2014-06-27 18:42       ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27 18:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 01:26:57PM -0400, Johannes Weiner wrote:
> On Fri, Jun 27, 2014 at 09:14:38AM +0100, Mel Gorman wrote:
> > Historically kswapd scanned from DMA->Movable in the opposite direction
> > to the page allocator to avoid allocating behind kswapd direction of
> > progress. The fair zone allocation policy altered this in a non-obvious
> > manner.
> > 
> > Traditionally, the page allocator prefers to use the highest eligible zone
> > until the watermark is depleted, woke kswapd and moved onto the next zone.
> 
> That's not quite right, the page allocator tries all zones in the
> zonelist, then wakes up kswapd, then tries again from the beginning.
> 

You're right of course, sorry about that. It still is the case that once
kswapd is awake that it can reclaim from the lower zones for longer than
the higher zones and this does not play well with the fair zone policy
interleaving between them.

> > kswapd scans zones in the opposite direction so the scanning lists on
> > 64-bit look like this;
> > 
> > Page alloc		Kswapd
> > ----------              ------
> > Movable			DMA
> > Normal			DMA32
> > DMA32			Normal
> > DMA			Movable
> > 
> > If kswapd scanned in the same direction as the page allocator then it is
> > possible that kswapd would proportionally reclaim the lower zones that
> > were never used as the page allocator was always allocating behind the
> > reclaim. This would work as follows
> > 
> > 	pgalloc hits Normal low wmark
> > 					kswapd reclaims Normal
> > 					kswapd reclaims DMA32
> > 	pgalloc hits Normal low wmark
> > 					kswapd reclaims Normal
> > 					kswapd reclaims DMA32
> > 
> > The introduction of the fair zone allocation policy fundamentally altered
> > this problem by interleaving between zones until the low watermark is
> > reached. There are at least two issues with this
> > 
> > o The page allocator can allocate behind kswapds progress (scans/reclaims
> >   lower zone and fair zone allocation policy then uses those pages)
> > o When the low watermark of the high zone is reached there may recently
> >   allocated pages allocated from the lower zone but as kswapd scans
> >   dma->highmem to the highest zone needing balancing it'll reclaim the
> >   lower zone even if it was balanced.
> > 
> > Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
> > some percentage will be allocated from Normal and some from DMA32. The
> > percentage depends on the ratio of the zone sizes and when their watermarks
> > were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
> > is unbalanced only DMA32 will be shrunk. This leads to a difference of
> > ages between DMA32 and Normal. Relatively young pages are then continually
> > rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
> > Some of these pages may be recently read-ahead pages requiring that the page
> > be re-read from disk and impacting overall performance.
> > 
> > The problem is fundamental to the fact we have per-zone LRU and allocation
> > policies and ideally we would only have per-node allocation and LRU lists.
> > This would avoid the need for the fair zone allocation policy but the
> > low-memory-starvation issue would have to be addressed again from scratch.
> > 
> > This patch will only scan/reclaim from lower zones if they have not
> > reached their watermark. This should not break the normal page aging
> > as the proportional allocations due to the fair zone allocation policy
> > should compensate.
> 
> That's already the case, kswapd_shrink_zone() checks whether the zone
> is balanced before scanning in, so something in this analysis is off -
> but I have to admit that I have trouble following it.
> 
> The only difference in the two checks is that the outer one you add
> does not enforce the balance gap, which means that we stop reclaiming
> zones a little earlier than before. 

Not only is the balance gap a factor (which can be large) but the classzone
reserves are also applied which will keep kswapd reclaiming longer.  Still,
you're right. It's far more appropriate to push the checks down where the
buffer head reserves are accounted for.

I still think that ultimately we want to move the LRUs to a per-node
basis. It should be possible to do this piecemeal and keep the free page
lists on a per-zone basis and have direct reclaim skip through the LRU
finding lowmem pages if that is required. It would be a fairly far-reaching
change but overall the aging would make more sense and it would remove
the fair zone allocation policy entirely. It'd hurt while reclaiming for
lowmem on 32-bit with large highmem zones but I think that is far less a
concern than it would have been 10 years ago.

> I guess this is where the
> throughput improvements come from, but there is a chance it will
> regress latency for bursty allocations.

Any suggestions on how to check that? I would expect that bursty
allocations are falling into the slow path and potentially doing direct
reclaim. That would make the effect difficult to measure.

This is the version that is currently being tested

---8<---
mm: vmscan: Do not reclaim from lower zones if they are balanced

Historically kswapd scanned from DMA->Movable in the opposite direction
to the page allocator to avoid allocating behind kswapd direction of
progress. The fair zone allocation policy altered this in a non-obvious
manner.

Traditionally, the page allocator prefers to use the highest eligible
zones in order until the low watermarks are reached and then wakes kswapd.
Once kswapd is awake, it scans zones in the opposite direction so the
scanning lists on 64-bit look like this;

Page alloc		Kswapd
----------              ------
Movable			DMA
Normal			DMA32
DMA32			Normal
DMA			Movable

If kswapd scanned in the same direction as the page allocator then it is
possible that kswapd would proportionally reclaim the lower zones that
were never used as the page allocator was always allocating behind the
reclaim. This would work as follows

	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32
	pgalloc hits Normal low wmark
					kswapd reclaims Normal
					kswapd reclaims DMA32

The introduction of the fair zone allocation policy fundamentally altered
this problem by interleaving between zones until the low watermark is
reached. There are at least two issues with this

o The page allocator can allocate behind kswapds progress (scans/reclaims
  lower zone and fair zone allocation policy then uses those pages)
o When the low watermark of the high zone is reached there may recently
  allocated pages allocated from the lower zone but as kswapd scans
  dma->highmem to the highest zone needing balancing it'll reclaim the
  lower zone even if it was balanced.

Let N = high_wmark(Normal) + high_wmark(DMA32). Of the last N allocations,
some percentage will be allocated from Normal and some from DMA32. The
percentage depends on the ratio of the zone sizes and when their watermarks
were hit. If Normal is unbalanced, DMA32 will be shrunk by kswapd. If DMA32
is unbalanced only DMA32 will be shrunk. This leads to a difference of
ages between DMA32 and Normal. Relatively young pages are then continually
rotated and reclaimed from DMA32 due to the higher zone being unbalanced.
Some of these pages may be recently read-ahead pages requiring that the page
be re-read from disk and impacting overall performance.

The problem is fundamental to the fact we have per-zone LRU and allocation
policies and ideally we would only have per-node allocation and LRU lists.
This would avoid the need for the fair zone allocation policy but the
low-memory-starvation issue would have to be addressed again from scratch.

kswapd shrinks equally from all zones up to the high watermark plus a
balance gap and the lowmem reserves. This patch removes the additional
reclaim from lower zones on the grounds that the fair zone allocation
policy will typically be interleaving between the zones.  This should not
break the normal page aging as the proportional allocations due to the
fair zone allocation policy should compensate.

tiobench was used to evaluate this because it includes a simple
sequential reader which is the most obvious regression. It also has threaded
readers that produce reasonably steady figures.

                                      3.16.0-rc2            3.16.0-rc2                 3.0.0
                                         vanilla           checklow-v4               vanilla
TO-BE-UPDATED

There are still regressions for higher number of threads but this is
related to changes in the CFQ IO scheduler.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h |  9 ---------
 mm/vmscan.c          | 46 ++++++++++++++++------------------------------
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bdbee8..1680307 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -165,15 +165,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..3e315c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2294,7 +2294,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 /* Returns true if compaction should go ahead for a high-order request */
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long balance_gap, watermark;
+	unsigned long watermark;
 	bool watermark_ok;
 
 	/* Do not consider compaction for orders reclaim is meant to satisfy */
@@ -2307,9 +2307,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+	watermark = high_wmark_pages(zone) + (2UL << sc->order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
 
 	/*
@@ -2816,11 +2814,9 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			  unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order)
 {
-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx, 0))
+	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0))
 		return false;
 
 	if (IS_ENABLED(CONFIG_COMPACTION) && order &&
@@ -2877,7 +2873,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 		}
 
-		if (zone_balanced(zone, order, 0, i))
+		if (zone_balanced(zone, order))
 			balanced_pages += zone->managed_pages;
 		else if (!order)
 			return false;
@@ -2934,7 +2930,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       unsigned long *nr_attempted)
 {
 	int testorder = sc->order;
-	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
@@ -2956,21 +2951,11 @@ static bool kswapd_shrink_zone(struct zone *zone,
 		testorder = 0;
 
 	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-
-	/*
 	 * If there is no low memory pressure or the zone is balanced then no
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder,
-						balance_gap, classzone_idx))
+	if (!lowmem_pressure && zone_balanced(zone, testorder))
 		return true;
 
 	shrink_zone(zone, sc);
@@ -2993,7 +2978,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+	    zone_balanced(zone, testorder)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 	}
@@ -3079,7 +3064,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order)) {
 				end_zone = i;
 				break;
 			} else {
@@ -3124,12 +3109,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * at the last zone which needs scanning. We do this because
+		 * the page allocators prefers to work in the opposite
+		 * direction and we want to avoid the page allocator reclaiming
+		 * behind kswapd's direction of progress. Due to the fair zone
+		 * allocation policy interleaving allocations between zones
+		 * we no longer proportionally scan the lower zones if the
+		 * watermarks are ok.
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -3397,7 +3383,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-27  8:14   ` Mel Gorman
@ 2014-06-27 18:57     ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-27 18:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> The fair zone allocation policy round-robins allocations between zones
> within a node to avoid age inversion problems during reclaim. If the
> first allocation fails, the batch counts is reset and a second attempt
> made before entering the slow path.
> 
> One assumption made with this scheme is that batches expire at roughly the
> same time and the resets each time are justified. This assumption does not
> hold when zones reach their low watermark as the batches will be consumed
> at uneven rates.  Allocation failure due to watermark depletion result in
> additional zonelist scans for the reset and another watermark check before
> hitting the slowpath.
> 
> This patch makes a number of changes that should reduce the overall cost
> 
> o Do not apply the fair zone policy to small zones such as DMA
> o Abort the fair zone allocation policy once remote or small zones are
>   encountered
> o Use a simplier scan when resetting NR_ALLOC_BATCH
> o Use a simple flag to identify depleted zones instead of accessing a
>   potentially write-intensive cache line for counters
> o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
>   to avoid doing a rescan of the zonelist when the counters are reset
> 
> On UMA machines, the effect is marginal. Even judging from system CPU
> usage it's small for the tiobench test
> 
>           3.16.0-rc2  3.16.0-rc2
>             checklow    fairzone
> User          396.24      396.23
> System        395.23      391.50
> Elapsed      5182.65     5165.49

The next patch reports fairzone at 5182.86 again, so I'm guessing this
patch is not actually reliably reducing the runtime to 5165.49, that's
just runtime variation.

> And the number of pages allocated from each zone is comparable
> 
>                             3.16.0-rc2  3.16.0-rc2
>                               checklow    fairzone
> DMA allocs                           0           0
> DMA32 allocs                   7374217     7920241
> Normal allocs                999277551   996568115

Wow, the DMA32 zone gets less than 1% of the allocations.  What are
the zone sizes in this machine?

> @@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  
>  #endif	/* CONFIG_NUMA */
>  
> +static void reset_alloc_batches(struct zone *preferred_zone)
> +{
> +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> +
> +	do {
> +		if (!zone_is_fair_depleted(zone))
> +			continue;
> +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> +			high_wmark_pages(zone) - low_wmark_pages(zone) -
> +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> +	} while (zone++ != preferred_zone);

get_page_from_freelist() looks at the batches in zonelist order, why
reset them in node_zones order?  Sure they are the same for all the
cases we care about now, but it's a non-obvious cross-depedency...

Does this even make a measurable difference?  It's a slow path after
you fixed the excessive resets below.

> @@ -2073,8 +2093,25 @@ this_zone_full:
>  		 * for !PFMEMALLOC purposes.
>  		 */
>  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +		return page;
> +	}
>  
> -	return page;
> +	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
> +		alloc_flags &= ~ALLOC_FAIR;
> +		zonelist_rescan = true;
> +		reset_alloc_batches(preferred_zone);
> +	}

Yes, it happens quite often that get_page_from_freelist() fails due to
watermarks while all the batches are fine, so resetting the batches
and rescanning the zonelist is kind of a waste of time.  However, in
this situation, we are waiting for kswapd to make progress on the
watermarks, and it doesn't really matter where we are wasting time...

In this micro benchmark that doesn't really do much besides allocating
and reclaiming IO-less cache pages, the performance difference is less
than 1% with this patch applied:

old: 19.835353264 seconds time elapsed                                          ( +-  0.39% )
new: 19.587258161 seconds time elapsed                                          ( +-  0.34% )

But overall I agree with this particular change.

> @@ -2748,33 +2763,18 @@ retry_cpuset:
>  		goto out;
>  	classzone_idx = zonelist_zone_idx(preferred_zoneref);
>  
> +	if (zonelist->fair_enabled)
> +		alloc_flags |= ALLOC_FAIR;
>  #ifdef CONFIG_CMA
>  	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  #endif
> -retry:
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, alloc_flags,
>  			preferred_zone, classzone_idx, migratetype);
>  	if (unlikely(!page)) {
>  		/*
> -		 * The first pass makes sure allocations are spread
> -		 * fairly within the local node.  However, the local
> -		 * node might have free pages left after the fairness
> -		 * batches are exhausted, and remote zones haven't
> -		 * even been considered yet.  Try once more without
> -		 * fairness, and include remote zones now, before
> -		 * entering the slowpath and waking kswapd: prefer
> -		 * spilling to a remote zone over swapping locally.
> -		 */

I wrote this comment, so I don't know how helpful it is to others, but
the retry logic in get_page_from_freelist() seems a little naked
without any explanation.

> -		if (alloc_flags & ALLOC_FAIR) {
> -			reset_alloc_batches(zonelist, high_zoneidx,
> -					    preferred_zone);
> -			alloc_flags &= ~ALLOC_FAIR;
> -			goto retry;
> -		}
> -		/*
>  		 * Runtime PM, block IO and its error handling path
>  		 * can deadlock because I/O on the device might not
>  		 * complete.
> @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> +			struct zoneref *zoneref, struct zone *preferred_zone)
>  {
> +	int zone_type = zone_idx(zone);
> +	bool fair_enabled = zone_local(zone, preferred_zone);
> +	if (zone_type == 0 &&
> +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> +		fair_enabled = false;

This needs a comment.

>  	zoneref->zone = zone;
> -	zoneref->zone_idx = zone_idx(zone);
> +	zoneref->zone_idx = zone_type;
> +	return fair_enabled;
>  }
>  
>  /*
> @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
>  {
>  	struct zone *zone;
>  	enum zone_type zone_type = MAX_NR_ZONES;
> +	struct zone *preferred_zone = NULL;
> +	int nr_fair = 0;
>  
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
>  		if (populated_zone(zone)) {
> -			zoneref_set_zone(zone,
> -				&zonelist->_zonerefs[nr_zones++]);
> +			if (!preferred_zone)
> +				preferred_zone = zone;
> +
> +			nr_fair += zoneref_set_zone(pgdat, zone,
> +				&zonelist->_zonerefs[nr_zones++],
> +				preferred_zone);

Passing preferred_zone to determine locality seems pointless when you
walk the zones of a single node.

And the return value of zoneref_set_zone() is fairly unexpected.

It's probably better to determine fair_enabled in the callsite, that
would fix both problems, and write a separate helper that tests if a
zone is eligible for fair treatment (type && managed_pages test).

>  			check_highest_zone(zone_type);
>  		}
>  	} while (zone_type);
>  
> +	if (nr_fair <= 1)
> +		zonelist->fair_enabled = false;
> +
>  	return nr_zones;
>  }
>  
> @@ -3538,8 +3555,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
>  {
>  	int pos, j, node;
>  	int zone_type;		/* needs to be signed */
> -	struct zone *z;
> +	struct zone *z, *preferred_zone = NULL;
>  	struct zonelist *zonelist;
> +	int nr_fair = 0;
>  
>  	zonelist = &pgdat->node_zonelists[0];
>  	pos = 0;
> @@ -3547,15 +3565,25 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
>  		for (j = 0; j < nr_nodes; j++) {
>  			node = node_order[j];
>  			z = &NODE_DATA(node)->node_zones[zone_type];
> +			if (!preferred_zone)
> +				preferred_zone = z;
>  			if (populated_zone(z)) {
> -				zoneref_set_zone(z,
> -					&zonelist->_zonerefs[pos++]);
> +				nr_fair += zoneref_set_zone(pgdat, z,
> +					&zonelist->_zonerefs[pos++],
> +					preferred_zone);
>  				check_highest_zone(zone_type);
>  			}
>  		}
>  	}
>  	zonelist->_zonerefs[pos].zone = NULL;
>  	zonelist->_zonerefs[pos].zone_idx = 0;
> +
> +	/*
> +	 * For this policy, the fair zone allocation policy is disabled as the
> +	 * stated priority is to preserve lower zones, not balance them fairly.
> +	 */
> +	if (nr_fair == 1 || nr_online_nodes > 1)
> +		zonelist->fair_enabled = false;
>  }
>  
>  static int default_zonelist_order(void)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
@ 2014-06-27 18:57     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-27 18:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> The fair zone allocation policy round-robins allocations between zones
> within a node to avoid age inversion problems during reclaim. If the
> first allocation fails, the batch counts is reset and a second attempt
> made before entering the slow path.
> 
> One assumption made with this scheme is that batches expire at roughly the
> same time and the resets each time are justified. This assumption does not
> hold when zones reach their low watermark as the batches will be consumed
> at uneven rates.  Allocation failure due to watermark depletion result in
> additional zonelist scans for the reset and another watermark check before
> hitting the slowpath.
> 
> This patch makes a number of changes that should reduce the overall cost
> 
> o Do not apply the fair zone policy to small zones such as DMA
> o Abort the fair zone allocation policy once remote or small zones are
>   encountered
> o Use a simplier scan when resetting NR_ALLOC_BATCH
> o Use a simple flag to identify depleted zones instead of accessing a
>   potentially write-intensive cache line for counters
> o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
>   to avoid doing a rescan of the zonelist when the counters are reset
> 
> On UMA machines, the effect is marginal. Even judging from system CPU
> usage it's small for the tiobench test
> 
>           3.16.0-rc2  3.16.0-rc2
>             checklow    fairzone
> User          396.24      396.23
> System        395.23      391.50
> Elapsed      5182.65     5165.49

The next patch reports fairzone at 5182.86 again, so I'm guessing this
patch is not actually reliably reducing the runtime to 5165.49, that's
just runtime variation.

> And the number of pages allocated from each zone is comparable
> 
>                             3.16.0-rc2  3.16.0-rc2
>                               checklow    fairzone
> DMA allocs                           0           0
> DMA32 allocs                   7374217     7920241
> Normal allocs                999277551   996568115

Wow, the DMA32 zone gets less than 1% of the allocations.  What are
the zone sizes in this machine?

> @@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  
>  #endif	/* CONFIG_NUMA */
>  
> +static void reset_alloc_batches(struct zone *preferred_zone)
> +{
> +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> +
> +	do {
> +		if (!zone_is_fair_depleted(zone))
> +			continue;
> +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> +			high_wmark_pages(zone) - low_wmark_pages(zone) -
> +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> +	} while (zone++ != preferred_zone);

get_page_from_freelist() looks at the batches in zonelist order, why
reset them in node_zones order?  Sure they are the same for all the
cases we care about now, but it's a non-obvious cross-depedency...

Does this even make a measurable difference?  It's a slow path after
you fixed the excessive resets below.

> @@ -2073,8 +2093,25 @@ this_zone_full:
>  		 * for !PFMEMALLOC purposes.
>  		 */
>  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +		return page;
> +	}
>  
> -	return page;
> +	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
> +		alloc_flags &= ~ALLOC_FAIR;
> +		zonelist_rescan = true;
> +		reset_alloc_batches(preferred_zone);
> +	}

Yes, it happens quite often that get_page_from_freelist() fails due to
watermarks while all the batches are fine, so resetting the batches
and rescanning the zonelist is kind of a waste of time.  However, in
this situation, we are waiting for kswapd to make progress on the
watermarks, and it doesn't really matter where we are wasting time...

In this micro benchmark that doesn't really do much besides allocating
and reclaiming IO-less cache pages, the performance difference is less
than 1% with this patch applied:

old: 19.835353264 seconds time elapsed                                          ( +-  0.39% )
new: 19.587258161 seconds time elapsed                                          ( +-  0.34% )

But overall I agree with this particular change.

> @@ -2748,33 +2763,18 @@ retry_cpuset:
>  		goto out;
>  	classzone_idx = zonelist_zone_idx(preferred_zoneref);
>  
> +	if (zonelist->fair_enabled)
> +		alloc_flags |= ALLOC_FAIR;
>  #ifdef CONFIG_CMA
>  	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  #endif
> -retry:
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, alloc_flags,
>  			preferred_zone, classzone_idx, migratetype);
>  	if (unlikely(!page)) {
>  		/*
> -		 * The first pass makes sure allocations are spread
> -		 * fairly within the local node.  However, the local
> -		 * node might have free pages left after the fairness
> -		 * batches are exhausted, and remote zones haven't
> -		 * even been considered yet.  Try once more without
> -		 * fairness, and include remote zones now, before
> -		 * entering the slowpath and waking kswapd: prefer
> -		 * spilling to a remote zone over swapping locally.
> -		 */

I wrote this comment, so I don't know how helpful it is to others, but
the retry logic in get_page_from_freelist() seems a little naked
without any explanation.

> -		if (alloc_flags & ALLOC_FAIR) {
> -			reset_alloc_batches(zonelist, high_zoneidx,
> -					    preferred_zone);
> -			alloc_flags &= ~ALLOC_FAIR;
> -			goto retry;
> -		}
> -		/*
>  		 * Runtime PM, block IO and its error handling path
>  		 * can deadlock because I/O on the device might not
>  		 * complete.
> @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> +			struct zoneref *zoneref, struct zone *preferred_zone)
>  {
> +	int zone_type = zone_idx(zone);
> +	bool fair_enabled = zone_local(zone, preferred_zone);
> +	if (zone_type == 0 &&
> +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> +		fair_enabled = false;

This needs a comment.

>  	zoneref->zone = zone;
> -	zoneref->zone_idx = zone_idx(zone);
> +	zoneref->zone_idx = zone_type;
> +	return fair_enabled;
>  }
>  
>  /*
> @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
>  {
>  	struct zone *zone;
>  	enum zone_type zone_type = MAX_NR_ZONES;
> +	struct zone *preferred_zone = NULL;
> +	int nr_fair = 0;
>  
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
>  		if (populated_zone(zone)) {
> -			zoneref_set_zone(zone,
> -				&zonelist->_zonerefs[nr_zones++]);
> +			if (!preferred_zone)
> +				preferred_zone = zone;
> +
> +			nr_fair += zoneref_set_zone(pgdat, zone,
> +				&zonelist->_zonerefs[nr_zones++],
> +				preferred_zone);

Passing preferred_zone to determine locality seems pointless when you
walk the zones of a single node.

And the return value of zoneref_set_zone() is fairly unexpected.

It's probably better to determine fair_enabled in the callsite, that
would fix both problems, and write a separate helper that tests if a
zone is eligible for fair treatment (type && managed_pages test).

>  			check_highest_zone(zone_type);
>  		}
>  	} while (zone_type);
>  
> +	if (nr_fair <= 1)
> +		zonelist->fair_enabled = false;
> +
>  	return nr_zones;
>  }
>  
> @@ -3538,8 +3555,9 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
>  {
>  	int pos, j, node;
>  	int zone_type;		/* needs to be signed */
> -	struct zone *z;
> +	struct zone *z, *preferred_zone = NULL;
>  	struct zonelist *zonelist;
> +	int nr_fair = 0;
>  
>  	zonelist = &pgdat->node_zonelists[0];
>  	pos = 0;
> @@ -3547,15 +3565,25 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
>  		for (j = 0; j < nr_nodes; j++) {
>  			node = node_order[j];
>  			z = &NODE_DATA(node)->node_zones[zone_type];
> +			if (!preferred_zone)
> +				preferred_zone = z;
>  			if (populated_zone(z)) {
> -				zoneref_set_zone(z,
> -					&zonelist->_zonerefs[pos++]);
> +				nr_fair += zoneref_set_zone(pgdat, z,
> +					&zonelist->_zonerefs[pos++],
> +					preferred_zone);
>  				check_highest_zone(zone_type);
>  			}
>  		}
>  	}
>  	zonelist->_zonerefs[pos].zone = NULL;
>  	zonelist->_zonerefs[pos].zone_idx = 0;
> +
> +	/*
> +	 * For this policy, the fair zone allocation policy is disabled as the
> +	 * stated priority is to preserve lower zones, not balance them fairly.
> +	 */
> +	if (nr_fair == 1 || nr_online_nodes > 1)
> +		zonelist->fair_enabled = false;
>  }
>  
>  static int default_zonelist_order(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-27 18:57     ` Johannes Weiner
@ 2014-06-27 19:25       ` Mel Gorman
  -1 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27 19:25 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 02:57:00PM -0400, Johannes Weiner wrote:
> On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> > The fair zone allocation policy round-robins allocations between zones
> > within a node to avoid age inversion problems during reclaim. If the
> > first allocation fails, the batch counts is reset and a second attempt
> > made before entering the slow path.
> > 
> > One assumption made with this scheme is that batches expire at roughly the
> > same time and the resets each time are justified. This assumption does not
> > hold when zones reach their low watermark as the batches will be consumed
> > at uneven rates.  Allocation failure due to watermark depletion result in
> > additional zonelist scans for the reset and another watermark check before
> > hitting the slowpath.
> > 
> > This patch makes a number of changes that should reduce the overall cost
> > 
> > o Do not apply the fair zone policy to small zones such as DMA
> > o Abort the fair zone allocation policy once remote or small zones are
> >   encountered
> > o Use a simplier scan when resetting NR_ALLOC_BATCH
> > o Use a simple flag to identify depleted zones instead of accessing a
> >   potentially write-intensive cache line for counters
> > o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
> >   to avoid doing a rescan of the zonelist when the counters are reset
> > 
> > On UMA machines, the effect is marginal. Even judging from system CPU
> > usage it's small for the tiobench test
> > 
> >           3.16.0-rc2  3.16.0-rc2
> >             checklow    fairzone
> > User          396.24      396.23
> > System        395.23      391.50
> > Elapsed      5182.65     5165.49
> 
> The next patch reports fairzone at 5182.86 again, so I'm guessing this
> patch is not actually reliably reducing the runtime to 5165.49, that's
> just runtime variation.
> 

The change is going to be marginal that it'll depend on a host of issues
including how much merging/splitting the page allocator does.

> > And the number of pages allocated from each zone is comparable
> > 
> >                             3.16.0-rc2  3.16.0-rc2
> >                               checklow    fairzone
> > DMA allocs                           0           0
> > DMA32 allocs                   7374217     7920241
> > Normal allocs                999277551   996568115
> 
> Wow, the DMA32 zone gets less than 1% of the allocations.  What are
> the zone sizes in this machine?
> 

        managed  3976
        managed  755409
        managed  1281601

> > @@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> >  
> >  #endif	/* CONFIG_NUMA */
> >  
> > +static void reset_alloc_batches(struct zone *preferred_zone)
> > +{
> > +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> > +
> > +	do {
> > +		if (!zone_is_fair_depleted(zone))
> > +			continue;
> > +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> > +			high_wmark_pages(zone) - low_wmark_pages(zone) -
> > +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> > +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> > +	} while (zone++ != preferred_zone);
> 
> get_page_from_freelist() looks at the batches in zonelist order, why
> reset them in node_zones order?  Sure they are the same for all the
> cases we care about now, but it's a non-obvious cross-depedency...
> 

There is no functional difference at the end of the day.

> Does this even make a measurable difference?  It's a slow path after
> you fixed the excessive resets below.
> 

Again, very borderline but doing it this way avoids functional calls to
restart the zonelist walk.

> > @@ -2073,8 +2093,25 @@ this_zone_full:
> >  		 * for !PFMEMALLOC purposes.
> >  		 */
> >  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> > +		return page;
> > +	}
> >  
> > -	return page;
> > +	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
> > +		alloc_flags &= ~ALLOC_FAIR;
> > +		zonelist_rescan = true;
> > +		reset_alloc_batches(preferred_zone);
> > +	}
> 
> Yes, it happens quite often that get_page_from_freelist() fails due to
> watermarks while all the batches are fine, so resetting the batches
> and rescanning the zonelist is kind of a waste of time.

Yep

> However, in
> this situation, we are waiting for kswapd to make progress on the
> watermarks, and it doesn't really matter where we are wasting time...
> 

Or we can enter the slowpath sooner and get something useful done.

> In this micro benchmark that doesn't really do much besides allocating
> and reclaiming IO-less cache pages, the performance difference is less
> than 1% with this patch applied:
> 
> old: 19.835353264 seconds time elapsed                                          ( +-  0.39% )
> new: 19.587258161 seconds time elapsed                                          ( +-  0.34% )
> 
> But overall I agree with this particular change.
> 
> > @@ -2748,33 +2763,18 @@ retry_cpuset:
> >  		goto out;
> >  	classzone_idx = zonelist_zone_idx(preferred_zoneref);
> >  
> > +	if (zonelist->fair_enabled)
> > +		alloc_flags |= ALLOC_FAIR;
> >  #ifdef CONFIG_CMA
> >  	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >  		alloc_flags |= ALLOC_CMA;
> >  #endif
> > -retry:
> >  	/* First allocation attempt */
> >  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >  			zonelist, high_zoneidx, alloc_flags,
> >  			preferred_zone, classzone_idx, migratetype);
> >  	if (unlikely(!page)) {
> >  		/*
> > -		 * The first pass makes sure allocations are spread
> > -		 * fairly within the local node.  However, the local
> > -		 * node might have free pages left after the fairness
> > -		 * batches are exhausted, and remote zones haven't
> > -		 * even been considered yet.  Try once more without
> > -		 * fairness, and include remote zones now, before
> > -		 * entering the slowpath and waking kswapd: prefer
> > -		 * spilling to a remote zone over swapping locally.
> > -		 */
> 
> I wrote this comment, so I don't know how helpful it is to others, but
> the retry logic in get_page_from_freelist() seems a little naked
> without any explanation.
> 

I'll preserve the comment

> > -		if (alloc_flags & ALLOC_FAIR) {
> > -			reset_alloc_batches(zonelist, high_zoneidx,
> > -					    preferred_zone);
> > -			alloc_flags &= ~ALLOC_FAIR;
> > -			goto retry;
> > -		}
> > -		/*
> >  		 * Runtime PM, block IO and its error handling path
> >  		 * can deadlock because I/O on the device might not
> >  		 * complete.
> > @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
> >  	show_swap_cache_info();
> >  }
> >  
> > -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> > +			struct zoneref *zoneref, struct zone *preferred_zone)
> >  {
> > +	int zone_type = zone_idx(zone);
> > +	bool fair_enabled = zone_local(zone, preferred_zone);
> > +	if (zone_type == 0 &&
> > +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> > +		fair_enabled = false;
> 
> This needs a comment.
> 

        /*
         * Do not count the lowest zone as of relevance to the fair zone
         * allocation policy if it's a small percentage of the node
         */

However, as I write this I'll look at getting rid of this entirely. It
made some sense when fair_eligible was tracked on a per-zone basis but
it's more complex than necessary.

> >  	zoneref->zone = zone;
> > -	zoneref->zone_idx = zone_idx(zone);
> > +	zoneref->zone_idx = zone_type;
> > +	return fair_enabled;
> >  }
> >  
> >  /*
> > @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
> >  {
> >  	struct zone *zone;
> >  	enum zone_type zone_type = MAX_NR_ZONES;
> > +	struct zone *preferred_zone = NULL;
> > +	int nr_fair = 0;
> >  
> >  	do {
> >  		zone_type--;
> >  		zone = pgdat->node_zones + zone_type;
> >  		if (populated_zone(zone)) {
> > -			zoneref_set_zone(zone,
> > -				&zonelist->_zonerefs[nr_zones++]);
> > +			if (!preferred_zone)
> > +				preferred_zone = zone;
> > +
> > +			nr_fair += zoneref_set_zone(pgdat, zone,
> > +				&zonelist->_zonerefs[nr_zones++],
> > +				preferred_zone);
> 
> Passing preferred_zone to determine locality seems pointless when you
> walk the zones of a single node.
> 

True.

> And the return value of zoneref_set_zone() is fairly unexpected.
> 

How so?

> It's probably better to determine fair_enabled in the callsite, that
> would fix both problems, and write a separate helper that tests if a
> zone is eligible for fair treatment (type && managed_pages test).
> 

Are you thinking of putting that into the page allocator fast path? I'm
trying to take stuff out of there :/.

An alternative would be to look one zone ahead in the zonelist and check
zone_local there. That would get most of the intent and be cheaper than
a check on the zone or pgdat size.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
@ 2014-06-27 19:25       ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2014-06-27 19:25 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 02:57:00PM -0400, Johannes Weiner wrote:
> On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> > The fair zone allocation policy round-robins allocations between zones
> > within a node to avoid age inversion problems during reclaim. If the
> > first allocation fails, the batch counts is reset and a second attempt
> > made before entering the slow path.
> > 
> > One assumption made with this scheme is that batches expire at roughly the
> > same time and the resets each time are justified. This assumption does not
> > hold when zones reach their low watermark as the batches will be consumed
> > at uneven rates.  Allocation failure due to watermark depletion result in
> > additional zonelist scans for the reset and another watermark check before
> > hitting the slowpath.
> > 
> > This patch makes a number of changes that should reduce the overall cost
> > 
> > o Do not apply the fair zone policy to small zones such as DMA
> > o Abort the fair zone allocation policy once remote or small zones are
> >   encountered
> > o Use a simplier scan when resetting NR_ALLOC_BATCH
> > o Use a simple flag to identify depleted zones instead of accessing a
> >   potentially write-intensive cache line for counters
> > o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
> >   to avoid doing a rescan of the zonelist when the counters are reset
> > 
> > On UMA machines, the effect is marginal. Even judging from system CPU
> > usage it's small for the tiobench test
> > 
> >           3.16.0-rc2  3.16.0-rc2
> >             checklow    fairzone
> > User          396.24      396.23
> > System        395.23      391.50
> > Elapsed      5182.65     5165.49
> 
> The next patch reports fairzone at 5182.86 again, so I'm guessing this
> patch is not actually reliably reducing the runtime to 5165.49, that's
> just runtime variation.
> 

The change is going to be marginal that it'll depend on a host of issues
including how much merging/splitting the page allocator does.

> > And the number of pages allocated from each zone is comparable
> > 
> >                             3.16.0-rc2  3.16.0-rc2
> >                               checklow    fairzone
> > DMA allocs                           0           0
> > DMA32 allocs                   7374217     7920241
> > Normal allocs                999277551   996568115
> 
> Wow, the DMA32 zone gets less than 1% of the allocations.  What are
> the zone sizes in this machine?
> 

        managed  3976
        managed  755409
        managed  1281601

> > @@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> >  
> >  #endif	/* CONFIG_NUMA */
> >  
> > +static void reset_alloc_batches(struct zone *preferred_zone)
> > +{
> > +	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> > +
> > +	do {
> > +		if (!zone_is_fair_depleted(zone))
> > +			continue;
> > +		mod_zone_page_state(zone, NR_ALLOC_BATCH,
> > +			high_wmark_pages(zone) - low_wmark_pages(zone) -
> > +			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> > +		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> > +	} while (zone++ != preferred_zone);
> 
> get_page_from_freelist() looks at the batches in zonelist order, why
> reset them in node_zones order?  Sure they are the same for all the
> cases we care about now, but it's a non-obvious cross-depedency...
> 

There is no functional difference at the end of the day.

> Does this even make a measurable difference?  It's a slow path after
> you fixed the excessive resets below.
> 

Again, very borderline but doing it this way avoids functional calls to
restart the zonelist walk.

> > @@ -2073,8 +2093,25 @@ this_zone_full:
> >  		 * for !PFMEMALLOC purposes.
> >  		 */
> >  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> > +		return page;
> > +	}
> >  
> > -	return page;
> > +	if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
> > +		alloc_flags &= ~ALLOC_FAIR;
> > +		zonelist_rescan = true;
> > +		reset_alloc_batches(preferred_zone);
> > +	}
> 
> Yes, it happens quite often that get_page_from_freelist() fails due to
> watermarks while all the batches are fine, so resetting the batches
> and rescanning the zonelist is kind of a waste of time.

Yep

> However, in
> this situation, we are waiting for kswapd to make progress on the
> watermarks, and it doesn't really matter where we are wasting time...
> 

Or we can enter the slowpath sooner and get something useful done.

> In this micro benchmark that doesn't really do much besides allocating
> and reclaiming IO-less cache pages, the performance difference is less
> than 1% with this patch applied:
> 
> old: 19.835353264 seconds time elapsed                                          ( +-  0.39% )
> new: 19.587258161 seconds time elapsed                                          ( +-  0.34% )
> 
> But overall I agree with this particular change.
> 
> > @@ -2748,33 +2763,18 @@ retry_cpuset:
> >  		goto out;
> >  	classzone_idx = zonelist_zone_idx(preferred_zoneref);
> >  
> > +	if (zonelist->fair_enabled)
> > +		alloc_flags |= ALLOC_FAIR;
> >  #ifdef CONFIG_CMA
> >  	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >  		alloc_flags |= ALLOC_CMA;
> >  #endif
> > -retry:
> >  	/* First allocation attempt */
> >  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >  			zonelist, high_zoneidx, alloc_flags,
> >  			preferred_zone, classzone_idx, migratetype);
> >  	if (unlikely(!page)) {
> >  		/*
> > -		 * The first pass makes sure allocations are spread
> > -		 * fairly within the local node.  However, the local
> > -		 * node might have free pages left after the fairness
> > -		 * batches are exhausted, and remote zones haven't
> > -		 * even been considered yet.  Try once more without
> > -		 * fairness, and include remote zones now, before
> > -		 * entering the slowpath and waking kswapd: prefer
> > -		 * spilling to a remote zone over swapping locally.
> > -		 */
> 
> I wrote this comment, so I don't know how helpful it is to others, but
> the retry logic in get_page_from_freelist() seems a little naked
> without any explanation.
> 

I'll preserve the comment

> > -		if (alloc_flags & ALLOC_FAIR) {
> > -			reset_alloc_batches(zonelist, high_zoneidx,
> > -					    preferred_zone);
> > -			alloc_flags &= ~ALLOC_FAIR;
> > -			goto retry;
> > -		}
> > -		/*
> >  		 * Runtime PM, block IO and its error handling path
> >  		 * can deadlock because I/O on the device might not
> >  		 * complete.
> > @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
> >  	show_swap_cache_info();
> >  }
> >  
> > -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> > +			struct zoneref *zoneref, struct zone *preferred_zone)
> >  {
> > +	int zone_type = zone_idx(zone);
> > +	bool fair_enabled = zone_local(zone, preferred_zone);
> > +	if (zone_type == 0 &&
> > +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> > +		fair_enabled = false;
> 
> This needs a comment.
> 

        /*
         * Do not count the lowest zone as of relevance to the fair zone
         * allocation policy if it's a small percentage of the node
         */

However, as I write this I'll look at getting rid of this entirely. It
made some sense when fair_eligible was tracked on a per-zone basis but
it's more complex than necessary.

> >  	zoneref->zone = zone;
> > -	zoneref->zone_idx = zone_idx(zone);
> > +	zoneref->zone_idx = zone_type;
> > +	return fair_enabled;
> >  }
> >  
> >  /*
> > @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
> >  {
> >  	struct zone *zone;
> >  	enum zone_type zone_type = MAX_NR_ZONES;
> > +	struct zone *preferred_zone = NULL;
> > +	int nr_fair = 0;
> >  
> >  	do {
> >  		zone_type--;
> >  		zone = pgdat->node_zones + zone_type;
> >  		if (populated_zone(zone)) {
> > -			zoneref_set_zone(zone,
> > -				&zonelist->_zonerefs[nr_zones++]);
> > +			if (!preferred_zone)
> > +				preferred_zone = zone;
> > +
> > +			nr_fair += zoneref_set_zone(pgdat, zone,
> > +				&zonelist->_zonerefs[nr_zones++],
> > +				preferred_zone);
> 
> Passing preferred_zone to determine locality seems pointless when you
> walk the zones of a single node.
> 

True.

> And the return value of zoneref_set_zone() is fairly unexpected.
> 

How so?

> It's probably better to determine fair_enabled in the callsite, that
> would fix both problems, and write a separate helper that tests if a
> zone is eligible for fair treatment (type && managed_pages test).
> 

Are you thinking of putting that into the page allocator fast path? I'm
trying to take stuff out of there :/.

An alternative would be to look one zone ahead in the zonelist and check
zone_local there. That would get most of the intent and be cheaper than
a check on the zone or pgdat size.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-06-27 19:25       ` Mel Gorman
@ 2014-06-30 14:41         ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-30 14:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 08:25:37PM +0100, Mel Gorman wrote:
> On Fri, Jun 27, 2014 at 02:57:00PM -0400, Johannes Weiner wrote:
> > On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> > > And the number of pages allocated from each zone is comparable
> > > 
> > >                             3.16.0-rc2  3.16.0-rc2
> > >                               checklow    fairzone
> > > DMA allocs                           0           0
> > > DMA32 allocs                   7374217     7920241
> > > Normal allocs                999277551   996568115
> > 
> > Wow, the DMA32 zone gets less than 1% of the allocations.  What are
> > the zone sizes in this machine?
> > 
> 
>         managed  3976
>         managed  755409
>         managed  1281601

Something seems way off with this.  On my system here, the DMA32 zone
makes up for 20% of managed pages and it gets roughly 20% of the page
allocations, as I would expect.

Your DMA32 zone makes up for 37% of the managed pages and receives
merely 0.7% of the page allocations.  Unless a large portion of that
zone is somehow unreclaimable, fairness seems completely obliberated
in both kernels.

Is that checklow's doing?

> > > @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
> > >  	show_swap_cache_info();
> > >  }
> > >  
> > > -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > > +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> > > +			struct zoneref *zoneref, struct zone *preferred_zone)
> > >  {
> > > +	int zone_type = zone_idx(zone);
> > > +	bool fair_enabled = zone_local(zone, preferred_zone);
> > > +	if (zone_type == 0 &&
> > > +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> > > +		fair_enabled = false;
> > 
> > This needs a comment.
> > 
> 
>         /*
>          * Do not count the lowest zone as of relevance to the fair zone
>          * allocation policy if it's a small percentage of the node
>          */
> 
> However, as I write this I'll look at getting rid of this entirely. It
> made some sense when fair_eligible was tracked on a per-zone basis but
> it's more complex than necessary.
>
> > >  	zoneref->zone = zone;
> > > -	zoneref->zone_idx = zone_idx(zone);
> > > +	zoneref->zone_idx = zone_type;
> > > +	return fair_enabled;
> > >  }
> > >  
> > >  /*
> > > @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
> > >  {
> > >  	struct zone *zone;
> > >  	enum zone_type zone_type = MAX_NR_ZONES;
> > > +	struct zone *preferred_zone = NULL;
> > > +	int nr_fair = 0;
> > >  
> > >  	do {
> > >  		zone_type--;
> > >  		zone = pgdat->node_zones + zone_type;
> > >  		if (populated_zone(zone)) {
> > > -			zoneref_set_zone(zone,
> > > -				&zonelist->_zonerefs[nr_zones++]);
> > > +			if (!preferred_zone)
> > > +				preferred_zone = zone;
> > > +
> > > +			nr_fair += zoneref_set_zone(pgdat, zone,
> > > +				&zonelist->_zonerefs[nr_zones++],
> > > +				preferred_zone);
> > 
> > Passing preferred_zone to determine locality seems pointless when you
> > walk the zones of a single node.
> > 
> 
> True.
> 
> > And the return value of zoneref_set_zone() is fairly unexpected.
> > 
> 
> How so?

Given the name zoneref_set_zone(), I wouldn't expect any return value,
or a success/failure type return value at best - certainly not whether
the passed zone is eligible for the fairness policy.

> > It's probably better to determine fair_enabled in the callsite, that
> > would fix both problems, and write a separate helper that tests if a
> > zone is eligible for fair treatment (type && managed_pages test).
> > 
> 
> Are you thinking of putting that into the page allocator fast path? I'm
> trying to take stuff out of there :/.

Not at all, I was just suggesting to restructure the code for building
the zonelists, and move the fairness stuff out of zoneref_set_zone().

If you remove the small-zone exclusion as per above, this only leaves
the locality check when building the zonelist in zone order and that
can easily be checked inline in build_zonelists_in_zone_order().

build_zonelists_node() can just count every populated zone in nr_fair.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy
@ 2014-06-30 14:41         ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2014-06-30 14:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Fri, Jun 27, 2014 at 08:25:37PM +0100, Mel Gorman wrote:
> On Fri, Jun 27, 2014 at 02:57:00PM -0400, Johannes Weiner wrote:
> > On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> > > And the number of pages allocated from each zone is comparable
> > > 
> > >                             3.16.0-rc2  3.16.0-rc2
> > >                               checklow    fairzone
> > > DMA allocs                           0           0
> > > DMA32 allocs                   7374217     7920241
> > > Normal allocs                999277551   996568115
> > 
> > Wow, the DMA32 zone gets less than 1% of the allocations.  What are
> > the zone sizes in this machine?
> > 
> 
>         managed  3976
>         managed  755409
>         managed  1281601

Something seems way off with this.  On my system here, the DMA32 zone
makes up for 20% of managed pages and it gets roughly 20% of the page
allocations, as I would expect.

Your DMA32 zone makes up for 37% of the managed pages and receives
merely 0.7% of the page allocations.  Unless a large portion of that
zone is somehow unreclaimable, fairness seems completely obliberated
in both kernels.

Is that checklow's doing?

> > > @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
> > >  	show_swap_cache_info();
> > >  }
> > >  
> > > -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > > +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> > > +			struct zoneref *zoneref, struct zone *preferred_zone)
> > >  {
> > > +	int zone_type = zone_idx(zone);
> > > +	bool fair_enabled = zone_local(zone, preferred_zone);
> > > +	if (zone_type == 0 &&
> > > +			zone->managed_pages < (pgdat->node_present_pages >> 4))
> > > +		fair_enabled = false;
> > 
> > This needs a comment.
> > 
> 
>         /*
>          * Do not count the lowest zone as of relevance to the fair zone
>          * allocation policy if it's a small percentage of the node
>          */
> 
> However, as I write this I'll look at getting rid of this entirely. It
> made some sense when fair_eligible was tracked on a per-zone basis but
> it's more complex than necessary.
>
> > >  	zoneref->zone = zone;
> > > -	zoneref->zone_idx = zone_idx(zone);
> > > +	zoneref->zone_idx = zone_type;
> > > +	return fair_enabled;
> > >  }
> > >  
> > >  /*
> > > @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
> > >  {
> > >  	struct zone *zone;
> > >  	enum zone_type zone_type = MAX_NR_ZONES;
> > > +	struct zone *preferred_zone = NULL;
> > > +	int nr_fair = 0;
> > >  
> > >  	do {
> > >  		zone_type--;
> > >  		zone = pgdat->node_zones + zone_type;
> > >  		if (populated_zone(zone)) {
> > > -			zoneref_set_zone(zone,
> > > -				&zonelist->_zonerefs[nr_zones++]);
> > > +			if (!preferred_zone)
> > > +				preferred_zone = zone;
> > > +
> > > +			nr_fair += zoneref_set_zone(pgdat, zone,
> > > +				&zonelist->_zonerefs[nr_zones++],
> > > +				preferred_zone);
> > 
> > Passing preferred_zone to determine locality seems pointless when you
> > walk the zones of a single node.
> > 
> 
> True.
> 
> > And the return value of zoneref_set_zone() is fairly unexpected.
> > 
> 
> How so?

Given the name zoneref_set_zone(), I wouldn't expect any return value,
or a success/failure type return value at best - certainly not whether
the passed zone is eligible for the fairness policy.

> > It's probably better to determine fair_enabled in the callsite, that
> > would fix both problems, and write a separate helper that tests if a
> > zone is eligible for fair treatment (type && managed_pages test).
> > 
> 
> Are you thinking of putting that into the page allocator fast path? I'm
> trying to take stuff out of there :/.

Not at all, I was just suggesting to restructure the code for building
the zonelists, and move the fairness stuff out of zoneref_set_zone().

If you remove the small-zone exclusion as per above, this only leaves
the locality check when building the zonelist in zone order and that
can easily be checked inline in build_zonelists_in_zone_order().

build_zonelists_node() can just count every populated zone in nr_fair.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-06-30 14:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-27  8:14 [PATCH 0/5] Improve sequential read throughput v3 Mel Gorman
2014-06-27  8:14 ` Mel Gorman
2014-06-27  8:14 ` [PATCH 1/5] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-06-27  8:14   ` Mel Gorman
2014-06-27  8:14 ` [PATCH 2/5] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
2014-06-27  8:14   ` Mel Gorman
2014-06-27  8:14 ` [PATCH 3/5] mm: vmscan: Do not reclaim from lower zones if they are balanced Mel Gorman
2014-06-27  8:14   ` Mel Gorman
2014-06-27 17:26   ` Johannes Weiner
2014-06-27 17:26     ` Johannes Weiner
2014-06-27 18:42     ` Mel Gorman
2014-06-27 18:42       ` Mel Gorman
2014-06-27  8:14 ` [PATCH 4/5] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
2014-06-27  8:14   ` Mel Gorman
2014-06-27 18:57   ` Johannes Weiner
2014-06-27 18:57     ` Johannes Weiner
2014-06-27 19:25     ` Mel Gorman
2014-06-27 19:25       ` Mel Gorman
2014-06-30 14:41       ` Johannes Weiner
2014-06-30 14:41         ` Johannes Weiner
2014-06-27  8:14 ` [PATCH 5/5] mm: page_alloc: Reduce cost of dirty zone balancing Mel Gorman
2014-06-27  8:14   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.