linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Reduce sequential read overhead
@ 2014-07-09  8:13 Mel Gorman
  2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

This was formerly the series "Improve sequential read throughput" which
noted some major differences in performance of tiobench since 3.0. While
there are a number of factors, two that dominated were the introduction
of the fair zone allocation policy and changes to CFQ.

The behaviour of fair zone allocation policy makes more sense than tiobench
as a benchmark and CFQ defaults were not changed due to insufficient
benchmarking.

This series is what's left. It's one functional fix to the fair zone
allocation policy when used on NUMA machines and a reduction of overhead
in general. tiobench was used for the comparison despite its flaws as an
IO benchmark as in this case we are primarily interested in the overhead
of page allocator and page reclaim activity.

On UMA, it makes little difference to overhead

          3.16.0-rc3   3.16.0-rc3
             vanilla lowercost-v5
User          383.61      386.77
System        403.83      401.74
Elapsed      5411.50     5413.11

On a 4-socket NUMA machine it's a bit more noticable

          3.16.0-rc3   3.16.0-rc3
             vanilla lowercost-v5
User          746.94      802.00
System      65336.22    40852.33
Elapsed     27553.52    27368.46

 include/linux/mmzone.h         | 217 ++++++++++++++++++++++-------------------
 include/trace/events/pagemap.h |  16 ++-
 mm/page_alloc.c                | 122 ++++++++++++-----------
 mm/swap.c                      |   4 +-
 mm/vmscan.c                    |   7 +-
 mm/vmstat.c                    |   9 +-
 6 files changed, 198 insertions(+), 177 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:01   ` Johannes Weiner
  2014-07-09  8:13 ` [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller.  Move the overhead to the tracepoint fast-assign
method to ensure the cost is only incurred when the tracepoint is active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/pagemap.h | 16 +++++++---------
 mm/swap.c                      |  4 ++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 1c9fabd..ce0803b 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -28,12 +28,10 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_PROTO(
 		struct page *page,
-		unsigned long pfn,
-		int lru,
-		unsigned long flags
+		int lru
 	),
 
-	TP_ARGS(page, pfn, lru, flags),
+	TP_ARGS(page, lru),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -44,9 +42,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 		__entry->lru	= lru;
-		__entry->flags	= flags;
+		__entry->flags	= trace_pagemap_flags(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
@@ -64,9 +62,9 @@ TRACE_EVENT(mm_lru_insertion,
 
 TRACE_EVENT(mm_lru_activate,
 
-	TP_PROTO(struct page *page, unsigned long pfn),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, pfn),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(struct page *,	page	)
@@ -75,7 +73,7 @@ TRACE_EVENT(mm_lru_activate,
 
 	TP_fast_assign(
 		__entry->page	= page;
-		__entry->pfn	= pfn;
+		__entry->pfn	= page_to_pfn(page);
 	),
 
 	/* Flag format is based on page-types.c formatting for pagemap */
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..d10be45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
-		trace_mm_lru_activate(page, page_to_pfn(page));
+		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
@@ -996,7 +996,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
-	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
+	trace_mm_lru_insertion(page, lru);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
  2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:06   ` Johannes Weiner
  2014-07-09  8:13 ` [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The arrangement of struct zone has changed over time and now it has reached the
point where there is some inappropriate sharing going on. On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is accessed
  frequently from the page allocator due to the fair zone allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
  and allocator-intensive workloads can bounce the cache line on a stat update

This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that the type of lowmem_reserve changes due to the watermark
calculations being signed and avoiding a signed/unsigned conversion there.

On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be noticable.
However, on a 4-node NUMA machine running tiobench the system CPU overhead
is reduced by this patch.

          3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5r9
User          746.94      759.78
System      65336.22    58350.98
Elapsed     27553.52    27282.02

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h | 211 +++++++++++++++++++++++++------------------------
 mm/page_alloc.c        |   7 +-
 mm/vmstat.c            |   4 +-
 3 files changed, 113 insertions(+), 109 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cbd1b6..fbadc45 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,19 +324,12 @@ enum zone_type {
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
-	/* Fields commonly accessed by the page allocator */
+	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -344,41 +337,26 @@ struct zone {
 	 * on the higher zones). This array is recalculated at runtime if the
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
-	unsigned long		lowmem_reserve[MAX_NR_ZONES];
-
-	/*
-	 * This is a per-zone reserve of pages that should not be
-	 * considered dirtyable memory.
-	 */
-	unsigned long		dirty_balance_reserve;
+	long lowmem_reserve[MAX_NR_ZONES];
 
 #ifdef CONFIG_NUMA
 	int node;
+#endif
+
 	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this zone's LRU.  Maintained by the pageout code.
 	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
-#endif
+	unsigned int inactive_ratio;
+
+	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
+
 	/*
-	 * free areas of different sizes
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
 	 */
-	spinlock_t		lock;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
-	/* pfn where compaction free scanner should start */
-	unsigned long		compact_cached_free_pfn;
-	/* pfn where async and sync compaction migration scanner should start */
-	unsigned long		compact_cached_migrate_pfn[2];
-#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* see spanned/present_pages for more description */
-	seqlock_t		span_seqlock;
-#endif
-	struct free_area	free_area[MAX_ORDER];
+	unsigned long		dirty_balance_reserve;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -388,74 +366,14 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_COMPACTION
-	/*
-	 * On compaction failure, 1<<compact_defer_shift compactions
-	 * are skipped before trying again. The number attempted since
-	 * last failure is tracked with compact_considered.
-	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
-#endif
-
-	ZONE_PADDING(_pad1_)
-
-	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
-
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
-	unsigned long		pages_scanned;	   /* since last reclaim */
-	unsigned long		flags;		   /* zone flags, see below */
-
-	/* Zone statistics */
-	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
-
-	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
-	 */
-	unsigned int inactive_ratio;
-
-
-	ZONE_PADDING(_pad2_)
-	/* Rarely used or read-mostly fields */
-
+#ifdef CONFIG_NUMA
 	/*
-	 * wait_table		-- the array holding the hash table
-	 * wait_table_hash_nr_entries	-- the size of the hash table array
-	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
-	 *
-	 * The purpose of all these is to keep track of the people
-	 * waiting for a page to become available and make them
-	 * runnable again when possible. The trouble is that this
-	 * consumes a lot of space, especially when so few things
-	 * wait on pages at a given time. So instead of using
-	 * per-page waitqueues, we use a waitqueue hash table.
-	 *
-	 * The bucket discipline is to sleep on the same queue when
-	 * colliding and wake all in that wait queue when removing.
-	 * When something wakes, it must check to be sure its page is
-	 * truly available, a la thundering herd. The cost of a
-	 * collision is great, but given the expected load of the
-	 * table, they should be so rare as to be outweighed by the
-	 * benefits from the saved space.
-	 *
-	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
-	 * primary users of these fields, and in mm/page_alloc.c
-	 * free_area_init_core() performs the initialization of them.
+	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	wait_queue_head_t	* wait_table;
-	unsigned long		wait_table_hash_nr_entries;
-	unsigned long		wait_table_bits;
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
 
-	/*
-	 * Discontig memory support fields.
-	 */
-	struct pglist_data	*zone_pgdat;
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -500,9 +418,11 @@ struct zone {
 	 * adjust_managed_page_count() should be used instead of directly
 	 * touching zone->managed_pages and totalram_pages.
 	 */
+	unsigned long		managed_pages;
 	unsigned long		spanned_pages;
 	unsigned long		present_pages;
-	unsigned long		managed_pages;
+
+	const char		*name;
 
 	/*
 	 * Number of MIGRATE_RESEVE page block. To maintain for just
@@ -510,10 +430,95 @@ struct zone {
 	 */
 	int			nr_migrate_reserve_block;
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/* see spanned/present_pages for more description */
+	seqlock_t		span_seqlock;
+#endif
+
 	/*
-	 * rarely used fields:
+	 * wait_table		-- the array holding the hash table
+	 * wait_table_hash_nr_entries	-- the size of the hash table array
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
+	 *
+	 * The purpose of all these is to keep track of the people
+	 * waiting for a page to become available and make them
+	 * runnable again when possible. The trouble is that this
+	 * consumes a lot of space, especially when so few things
+	 * wait on pages at a given time. So instead of using
+	 * per-page waitqueues, we use a waitqueue hash table.
+	 *
+	 * The bucket discipline is to sleep on the same queue when
+	 * colliding and wake all in that wait queue when removing.
+	 * When something wakes, it must check to be sure its page is
+	 * truly available, a la thundering herd. The cost of a
+	 * collision is great, but given the expected load of the
+	 * table, they should be so rare as to be outweighed by the
+	 * benefits from the saved space.
+	 *
+	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
+	 * primary users of these fields, and in mm/page_alloc.c
+	 * free_area_init_core() performs the initialization of them.
 	 */
-	const char		*name;
+	wait_queue_head_t	*wait_table;
+	unsigned long		wait_table_hash_nr_entries;
+	unsigned long		wait_table_bits;
+
+	ZONE_PADDING(_pad1_)
+
+	/* Write-intensive fields used from the page allocator */
+	spinlock_t		lock;
+
+	/* free areas of different sizes */
+	struct free_area	free_area[MAX_ORDER];
+
+	/* zone flags, see below */
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
+	/* Write-intensive fields used by page reclaim */
+
+	/* Fields commonly accessed by the page reclaim scanner */
+	spinlock_t		lru_lock;
+	unsigned long		pages_scanned;	   /* since last reclaim */
+	struct lruvec		lruvec;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
+	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction free scanner should start */
+	unsigned long		compact_cached_free_pfn;
+	/* pfn where async and sync compaction migration scanner should start */
+	unsigned long		compact_cached_migrate_pfn[2];
+#endif
+
+#ifdef CONFIG_COMPACTION
+	/*
+	 * On compaction failure, 1<<compact_defer_shift compactions
+	 * are skipped before trying again. The number attempted since
+	 * last failure is tracked with compact_considered.
+	 */
+	unsigned int		compact_considered;
+	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
+#endif
+
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+#endif
+
+	ZONE_PADDING(_pad3_)
+	/* Zone statistics */
+	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20d17f8..c607acd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1700,7 +1700,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long lowmem_reserve = z->lowmem_reserve[classzone_idx];
 	int o;
 	long free_cma = 0;
 
@@ -1715,7 +1714,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + lowmem_reserve)
+	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -3246,7 +3245,7 @@ void show_free_areas(unsigned int filter)
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
-			printk(" %lu", zone->lowmem_reserve[i]);
+			printk(" %ld", zone->lowmem_reserve[i]);
 		printk("\n");
 	}
 
@@ -5567,7 +5566,7 @@ static void calculate_totalreserve_pages(void)
 	for_each_online_pgdat(pgdat) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			unsigned long max = 0;
+			long max = 0;
 
 			/* Find valid and maximum lowmem_reserve in the zone */
 			for (j = i; j < MAX_NR_ZONES; j++) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49..8267f77 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1077,10 +1077,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 				zone_page_state(zone, i));
 
 	seq_printf(m,
-		   "\n        protection: (%lu",
+		   "\n        protection: (%ld",
 		   zone->lowmem_reserve[0]);
 	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+		seq_printf(m, ", %ld", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
  2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
  2014-07-09  8:13 ` [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:08   ` Johannes Weiner
  2014-07-09  8:13 ` [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal. On a
4-node machine the overhead is more noticable. Note that automatic NUMA
balancing was disabled for this test as otherwise the system CPU overhead
is unpredictable.

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanillarearrange-v5   vmstat-v5
User          746.94      759.78      774.56
System      65336.22    58350.98    32847.27
Elapsed     27553.52    27282.02    27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 12 +++++++++---
 mm/vmscan.c            |  7 ++++---
 mm/vmstat.c            |  3 ++-
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fbadc45..c0ee2ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,6 +143,7 @@ enum zone_stat_item {
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
@@ -480,7 +481,6 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;
-	unsigned long		pages_scanned;	   /* since last reclaim */
 	struct lruvec		lruvec;
 
 	/* Evictions & activations on the inactive file list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c607acd..aa46f00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -680,9 +680,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
+	unsigned long nr_scanned;
 
 	spin_lock(&zone->lock);
-	zone->pages_scanned = 0;
+	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	if (nr_scanned)
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
 
 	while (to_free) {
 		struct page *page;
@@ -731,8 +734,11 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
+	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	zone->pages_scanned = 0;
+	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	if (nr_scanned)
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
 
 	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
@@ -3240,7 +3246,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
-			zone->pages_scanned,
+			K(zone_page_state(zone, NR_PAGES_SCANNED)),
 			(!zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..761c628 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,7 +169,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+	return zone_page_state(zone, NR_PAGES_SCANNED) <
+		zone_reclaimable_pages(zone) * 6;
 }
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1503,7 +1504,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
 
 	if (global_reclaim(sc)) {
-		zone->pages_scanned += nr_scanned;
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
 		else
@@ -1693,7 +1694,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 	if (global_reclaim(sc))
-		zone->pages_scanned += nr_scanned;
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8267f77..e574e883 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -763,6 +763,7 @@ const char * const vmstat_text[] = {
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
+	"nr_pages_scanned",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",
@@ -1067,7 +1068,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   zone->pages_scanned,
+		   zone_page_state(zone, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
                   ` (2 preceding siblings ...)
  2014-07-09  8:13 ` [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:09   ` Johannes Weiner
  2014-07-09  8:13 ` [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered Mel Gorman
  2014-07-09  8:13 ` [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
  5 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

When kswapd is awake reclaiming, the per-cpu stat thresholds are
lowered to get more accurate counts to avoid breaching watermarks.  This
threshold update iterates over all possible CPUs which is unnecessary.
Only online CPUs need to be updated. If a new CPU is onlined,
refresh_zone_stat_thresholds() will set the thresholds correctly.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index e574e883..e9ab104 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -200,7 +200,7 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_possible_cpu(cpu)
+		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
 	}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
                   ` (3 preceding siblings ...)
  2014-07-09  8:13 ` [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:14   ` Johannes Weiner
  2014-07-09  8:13 ` [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
  5 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The purpose of numa_zonelist_order=zone is to preserve lower zones
for use with 32-bit devices. If locality is preferred then the
numa_zonelist_order=node policy should be used. Unfortunately, the fair
zone allocation policy overrides this by skipping zones on remote nodes
until the lower one is found. While this makes sense from a page aging
and performance perspective, it breaks the expected zonelist policy. This
patch restores the expected behaviour for zone-list ordering.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa46f00..0bf384a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1957,7 +1957,7 @@ zonelist_scan:
 		 */
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
-				continue;
+				break;
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 		}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
                   ` (4 preceding siblings ...)
  2014-07-09  8:13 ` [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered Mel Gorman
@ 2014-07-09  8:13 ` Mel Gorman
  2014-07-10 12:18   ` Johannes Weiner
  2014-08-08 15:27   ` Vlastimil Babka
  5 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2014-07-09  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner, Mel Gorman

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts is reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does not
hold when zones reach their low watermark as the batches will be consumed
at uneven rates.  Allocation failure due to watermark depletion result in
additional zonelist scans for the reset and another watermark check before
hitting the slowpath.

On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes. The system CPU overhead comparison looks like

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanilla   vmstat-v5 lowercost-v5
User          746.94      774.56      802.00
System      65336.22    32847.27    40852.33
Elapsed     27553.52    27415.04    27368.46

However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   6 +++
 mm/page_alloc.c        | 101 ++++++++++++++++++++++++++-----------------------
 2 files changed, 59 insertions(+), 48 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c0ee2ec..08a223b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,6 +534,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -571,6 +572,11 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0bf384a..c81087d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1604,6 +1604,9 @@ again:
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
+	    !zone_is_fair_depleted(zone))
+		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1915,6 +1918,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1932,8 +1947,12 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
+
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1958,8 +1977,10 @@ zonelist_scan:
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
 				break;
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
+			}
 		}
 		/*
 		 * When allocating a page cache page for writing, we
@@ -2065,13 +2086,7 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2080,8 +2095,37 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	/*
+	 * The first pass makes sure allocations are spread fairly within the
+	 * local node.  However, the local node might have free pages left
+	 * after the fairness batches are exhausted, and remote zones haven't
+	 * even been considered yet.  Try once more without fairness, and
+	 * include remote zones now, before entering the slowpath and waking
+	 * kswapd: prefer spilling to a remote zone over swapping locally.
+	 */
+	if (alloc_flags & ALLOC_FAIR) {
+		alloc_flags &= ~ALLOC_FAIR;
+		if (nr_fair_skipped) {
+			zonelist_rescan = true;
+			reset_alloc_batches(preferred_zone);
+		}
+		if (nr_online_nodes > 1)
+			zonelist_rescan = true;
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2402,28 +2446,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2759,29 +2781,12 @@ retry_cpuset:
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated
  2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
@ 2014-07-10 12:01   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:03AM +0100, Mel Gorman wrote:
> The LRU insertion and activate tracepoints take PFN as a parameter forcing
> the overhead to the caller.  Move the overhead to the tracepoint fast-assign
> method to ensure the cost is only incurred when the tracepoint is active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
  2014-07-09  8:13 ` [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
@ 2014-07-10 12:06   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:04AM +0100, Mel Gorman wrote:
> The arrangement of struct zone has changed over time and now it has reached the
> point where there is some inappropriate sharing going on. On x86-64 for example
> 
> o The zone->node field is shared with the zone lock and zone->node is accessed
>   frequently from the page allocator due to the fair zone allocation policy.
> o span_seqlock is almost never used by shares a line with free_area
> o Some zone statistics share a cache line with the LRU lock so reclaim-intensive
>   and allocator-intensive workloads can bounce the cache line on a stat update
> 
> This patch rearranges struct zone to put read-only and read-mostly fields
> together and then splits the page allocator intensive fields, the zone
> statistics and the page reclaim intensive fields into their own cache
> lines. Note that the type of lowmem_reserve changes due to the watermark
> calculations being signed and avoiding a signed/unsigned conversion there.
> 
> On the test configuration I used the overall size of struct zone shrunk
> by one cache line. On smaller machines, this is not likely to be noticable.
> However, on a 4-node NUMA machine running tiobench the system CPU overhead
> is reduced by this patch.
> 
>           3.16.0-rc3  3.16.0-rc3
>              vanillarearrange-v5r9
> User          746.94      759.78
> System      65336.22    58350.98
> Elapsed     27553.52    27282.02
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter
  2014-07-09  8:13 ` [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter Mel Gorman
@ 2014-07-10 12:08   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:05AM +0100, Mel Gorman wrote:
> zone->pages_scanned is a write-intensive cache line during page reclaim
> and it's also updated during page free. Move the counter into vmstat to
> take advantage of the per-cpu updates and do not update it in the free
> paths unless necessary.
> 
> On a small UMA machine running tiobench the difference is marginal. On a
> 4-node machine the overhead is more noticable. Note that automatic NUMA
> balancing was disabled for this test as otherwise the system CPU overhead
> is unpredictable.
> 
>           3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
>              vanillarearrange-v5   vmstat-v5
> User          746.94      759.78      774.56
> System      65336.22    58350.98    32847.27
> Elapsed     27553.52    27282.02    27415.04
> 
> Note that the overhead reduction will vary depending on where exactly
> pages are allocated and freed.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU
  2014-07-09  8:13 ` [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU Mel Gorman
@ 2014-07-10 12:09   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:06AM +0100, Mel Gorman wrote:
> When kswapd is awake reclaiming, the per-cpu stat thresholds are
> lowered to get more accurate counts to avoid breaching watermarks.  This
> threshold update iterates over all possible CPUs which is unnecessary.
> Only online CPUs need to be updated. If a new CPU is onlined,
> refresh_zone_stat_thresholds() will set the thresholds correctly.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered
  2014-07-09  8:13 ` [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered Mel Gorman
@ 2014-07-10 12:14   ` Johannes Weiner
  2014-07-10 12:44     ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:07AM +0100, Mel Gorman wrote:
> The purpose of numa_zonelist_order=zone is to preserve lower zones
> for use with 32-bit devices. If locality is preferred then the
> numa_zonelist_order=node policy should be used. Unfortunately, the fair
> zone allocation policy overrides this by skipping zones on remote nodes
> until the lower one is found. While this makes sense from a page aging
> and performance perspective, it breaks the expected zonelist policy. This
> patch restores the expected behaviour for zone-list ordering.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

32-bit NUMA? :-) Anyway, this change also cuts down the fair pass
overhead on bigger NUMA machines, so I'm all for it.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-07-09  8:13 ` [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
@ 2014-07-10 12:18   ` Johannes Weiner
  2014-08-08 15:27   ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-07-10 12:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, Jul 09, 2014 at 09:13:08AM +0100, Mel Gorman wrote:
> The fair zone allocation policy round-robins allocations between zones
> within a node to avoid age inversion problems during reclaim. If the
> first allocation fails, the batch counts is reset and a second attempt
> made before entering the slow path.
> 
> One assumption made with this scheme is that batches expire at roughly the
> same time and the resets each time are justified. This assumption does not
> hold when zones reach their low watermark as the batches will be consumed
> at uneven rates.  Allocation failure due to watermark depletion result in
> additional zonelist scans for the reset and another watermark check before
> hitting the slowpath.
> 
> On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
> machine it's variable due to the variability of measuring overhead with
> the vmstat changes. The system CPU overhead comparison looks like
> 
>           3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
>              vanilla   vmstat-v5 lowercost-v5
> User          746.94      774.56      802.00
> System      65336.22    32847.27    40852.33
> Elapsed     27553.52    27415.04    27368.46
> 
> However it is worth noting that the overall benchmark still completed
> faster and intuitively it makes sense to take as few passes as possible
> through the zonelists.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered
  2014-07-10 12:14   ` Johannes Weiner
@ 2014-07-10 12:44     ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2014-07-10 12:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Thu, Jul 10, 2014 at 08:14:19AM -0400, Johannes Weiner wrote:
> On Wed, Jul 09, 2014 at 09:13:07AM +0100, Mel Gorman wrote:
> > The purpose of numa_zonelist_order=zone is to preserve lower zones
> > for use with 32-bit devices. If locality is preferred then the
> > numa_zonelist_order=node policy should be used. Unfortunately, the fair
> > zone allocation policy overrides this by skipping zones on remote nodes
> > until the lower one is found. While this makes sense from a page aging
> > and performance perspective, it breaks the expected zonelist policy. This
> > patch restores the expected behaviour for zone-list ordering.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> 32-bit NUMA? :-)

I'm tempted to just say "it can go on fire" but realistically speaking
they should be configured to use node ordering. I was very tempted to
always force node ordering but I didn't have good data on how often lowmem
allocations are required on NUMA machines.

> Anyway, this change also cuts down the fair pass
> overhead on bigger NUMA machines, so I'm all for it.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for the reviews!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-07-09  8:13 ` [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
  2014-07-10 12:18   ` Johannes Weiner
@ 2014-08-08 15:27   ` Vlastimil Babka
  2014-08-11 12:12     ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2014-08-08 15:27 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On 07/09/2014 10:13 AM, Mel Gorman wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1604,6 +1604,9 @@ again:
>  	}
>  
>  	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));

This can underflow zero, right?

> +	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&

AFAICS, zone_page_state will correct negative values to zero only for
CONFIG_SMP. Won't this check be broken on !CONFIG_SMP?

I just stumbled upon this when trying to optimize the function. I didn't check
how rest of the design copes with negative NR_ALLOC_BATCH values.

> +	    !zone_is_fair_depleted(zone))
> +		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
>  
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone, gfp_flags);
> @@ -1915,6 +1918,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  
>  #endif	/* CONFIG_NUMA */
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-08-08 15:27   ` Vlastimil Babka
@ 2014-08-11 12:12     ` Mel Gorman
  2014-08-11 12:34       ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-08-11 12:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On Fri, Aug 08, 2014 at 05:27:15PM +0200, Vlastimil Babka wrote:
> On 07/09/2014 10:13 AM, Mel Gorman wrote:
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1604,6 +1604,9 @@ again:
> >  	}
> >  
> >  	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> 
> This can underflow zero, right?
> 

Yes, because of per-cpu accounting drift.

> > +	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
> 
> AFAICS, zone_page_state will correct negative values to zero only for
> CONFIG_SMP. Won't this check be broken on !CONFIG_SMP?
> 

On !CONFIG_SMP how can there be per-cpu accounting drift that would make
that counter negative?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-08-11 12:12     ` Mel Gorman
@ 2014-08-11 12:34       ` Vlastimil Babka
  2014-09-02 14:01         ` Johannes Weiner
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2014-08-11 12:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel, Johannes Weiner

On 08/11/2014 02:12 PM, Mel Gorman wrote:
> On Fri, Aug 08, 2014 at 05:27:15PM +0200, Vlastimil Babka wrote:
>> On 07/09/2014 10:13 AM, Mel Gorman wrote:
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1604,6 +1604,9 @@ again:
>>>   	}
>>>
>>>   	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
>>
>> This can underflow zero, right?
>>
>
> Yes, because of per-cpu accounting drift.

I meant mainly because of order > 0.

>>> +	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
>>
>> AFAICS, zone_page_state will correct negative values to zero only for
>> CONFIG_SMP. Won't this check be broken on !CONFIG_SMP?
>>
>
> On !CONFIG_SMP how can there be per-cpu accounting drift that would make
> that counter negative?

Well original code used "if (zone_page_state(zone, NR_ALLOC_BATCH) <= 
0)" elsewhere, that you are replacing with zone_is_fair_depleted check. 
I assumed it's because it can get negative due to order > 0. I might 
have not looked thoroughly enough but it seems to me there's nothing 
that would prevent it, such as skipping a zone because its remaining 
batch is lower than 1 << order.
So I think the check should be "<= 0" to be safe.

Vlastimil



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy
  2014-08-11 12:34       ` Vlastimil Babka
@ 2014-09-02 14:01         ` Johannes Weiner
  2014-09-05 10:14           ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2014-09-02 14:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Linux Kernel, Linux-MM, Linux-FSDevel

On Mon, Aug 11, 2014 at 02:34:05PM +0200, Vlastimil Babka wrote:
> On 08/11/2014 02:12 PM, Mel Gorman wrote:
> >On Fri, Aug 08, 2014 at 05:27:15PM +0200, Vlastimil Babka wrote:
> >>On 07/09/2014 10:13 AM, Mel Gorman wrote:
> >>>--- a/mm/page_alloc.c
> >>>+++ b/mm/page_alloc.c
> >>>@@ -1604,6 +1604,9 @@ again:
> >>>  	}
> >>>
> >>>  	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> >>
> >>This can underflow zero, right?
> >>
> >
> >Yes, because of per-cpu accounting drift.
> 
> I meant mainly because of order > 0.
> 
> >>>+	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
> >>
> >>AFAICS, zone_page_state will correct negative values to zero only for
> >>CONFIG_SMP. Won't this check be broken on !CONFIG_SMP?
> >>
> >
> >On !CONFIG_SMP how can there be per-cpu accounting drift that would make
> >that counter negative?
> 
> Well original code used "if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)"
> elsewhere, that you are replacing with zone_is_fair_depleted check. I
> assumed it's because it can get negative due to order > 0. I might have not
> looked thoroughly enough but it seems to me there's nothing that would
> prevent it, such as skipping a zone because its remaining batch is lower
> than 1 << order.
> So I think the check should be "<= 0" to be safe.

Any updates on this?

The counter can definitely underflow on !CONFIG_SMP, and then the flag
gets out of sync with the actual batch state.  I'd still prefer just
removing this flag again; it's extra complexity and error prone (case
in point) while the upsides are not even measurable in real life.

---

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 318df7051850..0bd77f730b38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -534,7 +534,6 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -572,11 +571,6 @@ static inline int zone_is_reclaim_locked(const struct zone *zone)
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
-static inline int zone_is_fair_depleted(const struct zone *zone)
-{
-	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-}
-
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18cee0d4c8a2..d913809a328f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1612,9 +1612,6 @@ again:
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
-	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
-	    !zone_is_fair_depleted(zone))
-		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1934,7 +1931,6 @@ static void reset_alloc_batches(struct zone *preferred_zone)
 		mod_zone_page_state(zone, NR_ALLOC_BATCH,
 			high_wmark_pages(zone) - low_wmark_pages(zone) -
 			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
 	} while (zone++ != preferred_zone);
 }
 
@@ -1985,7 +1981,7 @@ zonelist_scan:
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
 				break;
-			if (zone_is_fair_depleted(zone)) {
+			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) {
 				nr_fair_skipped++;
 				continue;
 			}

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP
  2014-09-02 14:01         ` Johannes Weiner
@ 2014-09-05 10:14           ` Mel Gorman
  2014-09-07  6:32             ` Leon Romanovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-09-05 10:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Linux Kernel, Linux-MM, Linux-FSDevel

Commit 4ffeaf35 (mm: page_alloc: reduce cost of the fair zone allocation policy)
broke the fair zone allocation policy on UP with these hunks.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e5e8f7..fb99081 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1612,6 +1612,9 @@ again:
        }

        __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+       if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
+           !zone_is_fair_depleted(zone))
+               zone_set_flag(zone, ZONE_FAIR_DEPLETED);

        __count_zone_vm_events(PGALLOC, zone, 1 << order);
        zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1966,8 +1985,10 @@ zonelist_scan:
                if (alloc_flags & ALLOC_FAIR) {
                        if (!zone_local(preferred_zone, zone))
                                break;
-                       if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+                       if (zone_is_fair_depleted(zone)) {
+                               nr_fair_skipped++;
                                continue;
+                       }
                }

The problem is that a <= check was replaced with a ==. On SMP it doesn't
matter because negative values are returned as zero due to per-CPU drift
which is not possible in the UP case. Vlastimil Babka correctly pointed
out that this can be negative due to high-order allocations. This patch
fixes the problem.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18cee0d..cd4c05c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1612,7 +1612,7 @@ again:
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
-	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0 &&
 	    !zone_is_fair_depleted(zone))
 		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP
  2014-09-05 10:14           ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP Mel Gorman
@ 2014-09-07  6:32             ` Leon Romanovsky
  2014-09-08 11:57               ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2 Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Romanovsky @ 2014-09-07  6:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Johannes Weiner, Linux Kernel,
	Linux-MM, Linux-FSDevel

Hi Mel,
>         __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> -       if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
> +       if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0 &&
zone_page_state is declared to return unsigned long value [1], so it
should never be below 0.
So interesting question: what zone_page_state will return for negative
atomic_long_read(&zone->vm_stat[item]) ?

130 static inline unsigned long zone_page_state(struct zone *zone,
131                                         enum zone_stat_item item)
132 {
133         long x = atomic_long_read(&zone->vm_stat[item]);
134 #ifdef CONFIG_SMP
135         if (x < 0)
136                 x = 0;
137 #endif
138         return x;
139 }

[1] https://git.kernel.org/cgit/linux/kernel/git/mhocko/mm.git/tree/include/linux/vmstat.h#n130

-- 
Leon Romanovsky | Independent Linux Consultant
        www.leon.nu | leon@leon.nu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2
  2014-09-07  6:32             ` Leon Romanovsky
@ 2014-09-08 11:57               ` Mel Gorman
  2014-09-09 19:53                 ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-09-08 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Leon Romanovsky, Vlastimil Babka, Johannes Weiner, Linux Kernel,
	Linux-MM, Linux-FSDevel

Commit 4ffeaf35 (mm: page_alloc: reduce cost of the fair zone allocation
policy) arguably broke the fair zone allocation policy on UP with these
hunks.

a/mm/page_alloc.c
b/mm/page_alloc.c
@@ -1612,6 +1612,9 @@ again:
       	}

       	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+       if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
+           !zone_is_fair_depleted(zone))
+               zone_set_flag(zone, ZONE_FAIR_DEPLETED);

       	__count_zone_vm_events(PGALLOC, zone, 1 << order);
       	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1966,8 +1985,10 @@ zonelist_scan:
               	if (alloc_flags & ALLOC_FAIR) {
                       	if (!zone_local(preferred_zone, zone))
                               	break;
-                       if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+                       if (zone_is_fair_depleted(zone)) {
+                               nr_fair_skipped++;
                               	continue;
+                       }
               	}

A <= check was replaced with a ==. On SMP it doesn't matter because
negative values are returned as zero due to per-CPU drift which is not
possible in the UP case. Vlastimil Babka correctly pointed out that this
can wrap negative due to high-order allocations.

However, Leon Romanovsky pointed out that a <= check on zone_page_state
was never correct as zone_page_state returns unsigned long so the root
cause of the breakage was the <= check in the first place.

zone_page_state is an API hazard because of the difference in behaviour
between SMP and UP is very surprising. There is a good reason to allow
NR_ALLOC_BATCH to go negative -- when the counter is reset the negative
value takes recent activity into account. This patch makes zone_page_state
behave the same on SMP and UP as saving one branch on UP is not likely to
make a measurable performance difference.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/vmstat.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7..cece0f0 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -131,10 +131,8 @@ static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
 	long x = atomic_long_read(&zone->vm_stat[item]);
-#ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
-#endif
 	return x;
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2
  2014-09-08 11:57               ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2 Mel Gorman
@ 2014-09-09 19:53                 ` Andrew Morton
  2014-09-10  9:16                   ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2014-09-09 19:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Leon Romanovsky, Vlastimil Babka, Johannes Weiner, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Mon, 8 Sep 2014 12:57:18 +0100 Mel Gorman <mgorman@suse.de> wrote:

> zone_page_state is an API hazard because of the difference in behaviour
> between SMP and UP is very surprising. There is a good reason to allow
> NR_ALLOC_BATCH to go negative -- when the counter is reset the negative
> value takes recent activity into account. This patch makes zone_page_state
> behave the same on SMP and UP as saving one branch on UP is not likely to
> make a measurable performance difference.
> 
> ...
>
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -131,10 +131,8 @@ static inline unsigned long zone_page_state(struct zone *zone,
>  					enum zone_stat_item item)
>  {
>  	long x = atomic_long_read(&zone->vm_stat[item]);
> -#ifdef CONFIG_SMP
>  	if (x < 0)
>  		x = 0;
> -#endif
>  	return x;
>  }

We now have three fixes for the same thing.  I'm presently holding on
to hannes's mm-page_alloc-fix-zone-allocation-fairness-on-up.patch.

Regularizing zone_page_state() in this fashion seems a good idea and is
presumably safe because callers have been tested with SMP.  So unless
shouted at I think I'll queue this one for 3.18?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2
  2014-09-09 19:53                 ` Andrew Morton
@ 2014-09-10  9:16                   ` Mel Gorman
  2014-09-10 20:32                     ` Johannes Weiner
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-09-10  9:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Leon Romanovsky, Vlastimil Babka, Johannes Weiner, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, Sep 09, 2014 at 12:53:18PM -0700, Andrew Morton wrote:
> On Mon, 8 Sep 2014 12:57:18 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > zone_page_state is an API hazard because of the difference in behaviour
> > between SMP and UP is very surprising. There is a good reason to allow
> > NR_ALLOC_BATCH to go negative -- when the counter is reset the negative
> > value takes recent activity into account. This patch makes zone_page_state
> > behave the same on SMP and UP as saving one branch on UP is not likely to
> > make a measurable performance difference.
> > 
> > ...
> >
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -131,10 +131,8 @@ static inline unsigned long zone_page_state(struct zone *zone,
> >  					enum zone_stat_item item)
> >  {
> >  	long x = atomic_long_read(&zone->vm_stat[item]);
> > -#ifdef CONFIG_SMP
> >  	if (x < 0)
> >  		x = 0;
> > -#endif
> >  	return x;
> >  }
> 
> We now have three fixes for the same thing. 

This might be holding a record for most patches for what should have
been a trivial issue :P

> I'm presently holding on
> to hannes's mm-page_alloc-fix-zone-allocation-fairness-on-up.patch.
> 

This is my preferred fix because it clearly points to where the source of the
original problem is. Furthermore, the second hunk really should be reading
the unsigned counter value. It's an inconsequential corner-case but it's
still more correct although it's a pity that it's also a layering violation.
However, adding a new API to return the raw value on UP and SMP is likely
to be interpreted as unwelcome indirection.

> Regularizing zone_page_state() in this fashion seems a good idea and is
> presumably safe because callers have been tested with SMP.  So unless
> shouted at I think I'll queue this one for 3.18?

Both are ok but if we really want to regularise the API then all readers
should be brought in line and declared an API cleanup. That looks like
the following;

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: vmstat: regularize UP and SMP behavior

zone_page_state and friends are an API hazard because of the difference in
behaviour between SMP and UP is very surprising.  There is a good reason
to allow NR_ALLOC_BATCH to go negative -- when the counter is reset the
negative value takes recent activity into account. NR_ALLOC_BATCH callers
that matter access the raw counter but the API hazard is a lesson.

This patch makes zone_page_state, global_page_state and
zone_page_state_snapshot return the same values on SMP and UP as saving
the branches on UP is unlikely to make a measurable performance difference.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leon Romanovsky <leon@leon.nu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmstat.h | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7..873104e 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -120,10 +120,8 @@ static inline void zone_page_state_add(long x, struct zone *zone,
 static inline unsigned long global_page_state(enum zone_stat_item item)
 {
 	long x = atomic_long_read(&vm_stat[item]);
-#ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
-#endif
 	return x;
 }
 
@@ -131,10 +129,8 @@ static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
 	long x = atomic_long_read(&zone->vm_stat[item]);
-#ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
-#endif
 	return x;
 }
 
@@ -153,10 +149,10 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 	int cpu;
 	for_each_online_cpu(cpu)
 		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
-
+#endif
 	if (x < 0)
 		x = 0;
-#endif
+
 	return x;
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2
  2014-09-10  9:16                   ` Mel Gorman
@ 2014-09-10 20:32                     ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2014-09-10 20:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Leon Romanovsky, Vlastimil Babka, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Wed, Sep 10, 2014 at 10:16:03AM +0100, Mel Gorman wrote:
> On Tue, Sep 09, 2014 at 12:53:18PM -0700, Andrew Morton wrote:
> > On Mon, 8 Sep 2014 12:57:18 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > zone_page_state is an API hazard because of the difference in behaviour
> > > between SMP and UP is very surprising. There is a good reason to allow
> > > NR_ALLOC_BATCH to go negative -- when the counter is reset the negative
> > > value takes recent activity into account. This patch makes zone_page_state
> > > behave the same on SMP and UP as saving one branch on UP is not likely to
> > > make a measurable performance difference.
> > > 
> > > ...
> > >
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -131,10 +131,8 @@ static inline unsigned long zone_page_state(struct zone *zone,
> > >  					enum zone_stat_item item)
> > >  {
> > >  	long x = atomic_long_read(&zone->vm_stat[item]);
> > > -#ifdef CONFIG_SMP
> > >  	if (x < 0)
> > >  		x = 0;
> > > -#endif
> > >  	return x;
> > >  }
> > 
> > We now have three fixes for the same thing. 
> 
> This might be holding a record for most patches for what should have
> been a trivial issue :P
> 
> > I'm presently holding on
> > to hannes's mm-page_alloc-fix-zone-allocation-fairness-on-up.patch.
> > 
> 
> This is my preferred fix because it clearly points to where the source of the
> original problem is. Furthermore, the second hunk really should be reading
> the unsigned counter value. It's an inconsequential corner-case but it's
> still more correct although it's a pity that it's also a layering violation.
> However, adding a new API to return the raw value on UP and SMP is likely
> to be interpreted as unwelcome indirection.
> 
> > Regularizing zone_page_state() in this fashion seems a good idea and is
> > presumably safe because callers have been tested with SMP.  So unless
> > shouted at I think I'll queue this one for 3.18?
> 
> Both are ok but if we really want to regularise the API then all readers
> should be brought in line and declared an API cleanup. That looks like
> the following;
> 
> ---8<---
> From: Mel Gorman <mgorman@suse.de>
> Subject: [PATCH] mm: vmstat: regularize UP and SMP behavior
> 
> zone_page_state and friends are an API hazard because of the difference in
> behaviour between SMP and UP is very surprising.  There is a good reason
> to allow NR_ALLOC_BATCH to go negative -- when the counter is reset the
> negative value takes recent activity into account. NR_ALLOC_BATCH callers
> that matter access the raw counter but the API hazard is a lesson.
> 
> This patch makes zone_page_state, global_page_state and
> zone_page_state_snapshot return the same values on SMP and UP as saving
> the branches on UP is unlikely to make a measurable performance difference.

The API still returns an unsigned long and so can not really support
counters that can go logically negative - as opposed to negative reads
that are a side-effect of concurrency and can be interpreted as zero.

The problem is that the fairness batches abuse the internals of what
is publicly an unsigned counter to implement an unlocked allocator
that needs to be able to go negative.  We could make that counter API
truly signed to support that, but I think this patch makes the code
actually more confusing and inconsistent.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-09-10 20:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-09  8:13 [PATCH 0/5] Reduce sequential read overhead Mel Gorman
2014-07-09  8:13 ` [PATCH 1/6] mm: pagemap: Avoid unnecessary overhead when tracepoints are deactivated Mel Gorman
2014-07-10 12:01   ` Johannes Weiner
2014-07-09  8:13 ` [PATCH 2/6] mm: Rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Mel Gorman
2014-07-10 12:06   ` Johannes Weiner
2014-07-09  8:13 ` [PATCH 3/6] mm: Move zone->pages_scanned into a vmstat counter Mel Gorman
2014-07-10 12:08   ` Johannes Weiner
2014-07-09  8:13 ` [PATCH 4/6] mm: vmscan: Only update per-cpu thresholds for online CPU Mel Gorman
2014-07-10 12:09   ` Johannes Weiner
2014-07-09  8:13 ` [PATCH 5/6] mm: page_alloc: Abort fair zone allocation policy when remotes nodes are encountered Mel Gorman
2014-07-10 12:14   ` Johannes Weiner
2014-07-10 12:44     ` Mel Gorman
2014-07-09  8:13 ` [PATCH 6/6] mm: page_alloc: Reduce cost of the fair zone allocation policy Mel Gorman
2014-07-10 12:18   ` Johannes Weiner
2014-08-08 15:27   ` Vlastimil Babka
2014-08-11 12:12     ` Mel Gorman
2014-08-11 12:34       ` Vlastimil Babka
2014-09-02 14:01         ` Johannes Weiner
2014-09-05 10:14           ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP Mel Gorman
2014-09-07  6:32             ` Leon Romanovsky
2014-09-08 11:57               ` [PATCH] mm: page_alloc: Fix setting of ZONE_FAIR_DEPLETED on UP v2 Mel Gorman
2014-09-09 19:53                 ` Andrew Morton
2014-09-10  9:16                   ` Mel Gorman
2014-09-10 20:32                     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).