linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
@ 2017-11-28  6:00 Kemi Wang
  2017-11-28  6:00 ` [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics() Kemi Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Kemi Wang @ 2017-11-28  6:00 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior
  Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel

The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
effect system's decision and are only read from /proc and /sys, it is a
slow path operation and likely tolerate higher overhead. Additionally,
usually nodes only have a single zone, except for node 0. And there isn't
really any use where you need these hits counts separated by zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node, and get rid of these global numa counters. It's good enough to
keep everything in a per cpu ptr of type u64, and sum them up when need, as
suggested by Andi Kleen. That's helpful for code cleanup and enhancement
(e.g. save more than 130+ lines code).

With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
page allocation and deallocation concurrently with 112 threads tested on a
2-sockets skylake platform using Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

Also, it does not cause obvious latency increase when read /proc and /sys
on a 2-sockets skylake platform. Latency shown by time command:
                           base             head
/proc/vmstat            sys 0m0.001s     sys 0m0.001s

/sys/devices/system/    sys 0m0.001s     sys 0m0.000s
node/node*/numastat

We would not worry it much as it is a slow path and will not be read
frequently.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 drivers/base/node.c    |  14 ++---
 include/linux/mmzone.h |   2 -
 include/linux/vmstat.h |  61 +++++++++---------
 mm/page_alloc.c        |   7 +++
 mm/vmstat.c            | 167 ++++---------------------------------------------
 5 files changed, 56 insertions(+), 195 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..0be5fbd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       sum_zone_numa_state(dev->id, NUMA_HIT),
-		       sum_zone_numa_state(dev->id, NUMA_MISS),
-		       sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-		       sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       sum_zone_numa_state(dev->id, NUMA_LOCAL),
-		       sum_zone_numa_state(dev->id, NUMA_OTHER));
+		       node_numa_state_snapshot(dev->id, NUMA_HIT),
+		       node_numa_state_snapshot(dev->id, NUMA_MISS),
+		       node_numa_state_snapshot(dev->id, NUMA_FOREIGN),
+		       node_numa_state_snapshot(dev->id, NUMA_INTERLEAVE_HIT),
+		       node_numa_state_snapshot(dev->id, NUMA_LOCAL),
+		       node_numa_state_snapshot(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -194,7 +194,7 @@ static ssize_t node_read_vmstat(struct device *dev,
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n",
 			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-			     sum_zone_numa_state(nid, i));
+			     node_numa_state_snapshot(nid, i));
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..b2d264f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -283,7 +283,6 @@ struct per_cpu_pageset {
 	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
-	u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
@@ -504,7 +503,6 @@ struct zone {
 	ZONE_PADDING(_pad3_)
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
-	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..7383d66 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
-
-#ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
-				 enum numa_stat_item item)
-{
-	atomic_long_add(x, &zone->vm_numa_stat[item]);
-	atomic_long_add(x, &vm_numa_stat[item]);
-}
-
-static inline unsigned long global_numa_state(enum numa_stat_item item)
-{
-	long x = atomic_long_read(&vm_numa_stat[item]);
-
-	return x;
-}
-
-static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
-					enum numa_stat_item item)
-{
-	long x = atomic_long_read(&zone->vm_numa_stat[item]);
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
-
-	return x;
-}
-#endif /* CONFIG_NUMA */
+extern u64 __percpu *vm_numa_stat;
 
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
@@ -234,10 +206,39 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
 
 
 #ifdef CONFIG_NUMA
+static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
+					enum numa_stat_item item)
+{
+	return 0;
+}
+
+static inline unsigned long node_numa_state_snapshot(int node,
+					enum numa_stat_item item)
+{
+	unsigned long x = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		x += per_cpu_ptr(vm_numa_stat, cpu)[(node *
+				NR_VM_NUMA_STAT_ITEMS) + item];
+
+	return x;
+}
+
+static inline unsigned long global_numa_state(enum numa_stat_item item)
+{
+	int node;
+	unsigned long x = 0;
+
+	for_each_online_node(node)
+		x += node_numa_state_snapshot(node, item);
+
+	return x;
+}
+
 extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
 extern unsigned long sum_zone_node_page_state(int node,
 					      enum zone_stat_item item);
-extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4096f4..142e1ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5564,6 +5564,7 @@ void __init setup_per_cpu_pageset(void)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
+	size_t size, align;
 
 	for_each_populated_zone(zone)
 		setup_zone_pageset(zone);
@@ -5571,6 +5572,12 @@ void __init setup_per_cpu_pageset(void)
 	for_each_online_pgdat(pgdat)
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
+
+#ifdef CONFIG_NUMA
+	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
+	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
+	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
+#endif
 }
 
 static __meminit void zone_pcp_init(struct zone *zone)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 40b2db6..bbabd96 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,48 +30,20 @@
 
 #include "internal.h"
 
-#define NUMA_STATS_THRESHOLD (U16_MAX - 2)
-
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
-/* zero numa counters within a zone */
-static void zero_zone_numa_counters(struct zone *zone)
+static void invalid_numa_statistics(void)
 {
-	int item, cpu;
+	int i, cpu;
 
-	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
-		atomic_long_set(&zone->vm_numa_stat[item], 0);
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
-						= 0;
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < num_possible_nodes() *
+				NR_VM_NUMA_STAT_ITEMS; i++)
+			per_cpu_ptr(vm_numa_stat, cpu)[i] = 0;
 	}
 }
 
-/* zero numa counters of all the populated zones */
-static void zero_zones_numa_counters(void)
-{
-	struct zone *zone;
-
-	for_each_populated_zone(zone)
-		zero_zone_numa_counters(zone);
-}
-
-/* zero global numa counters */
-static void zero_global_numa_counters(void)
-{
-	int item;
-
-	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
-		atomic_long_set(&vm_numa_stat[item], 0);
-}
-
-static void invalid_numa_statistics(void)
-{
-	zero_zones_numa_counters();
-	zero_global_numa_counters();
-}
-
 static DEFINE_MUTEX(vm_numa_stat_lock);
 
 int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
@@ -160,12 +132,12 @@ void vm_events_fold_cpu(int cpu)
  * vm_stat contains the global counters
  */
 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_zone_stat);
-EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+u64 __percpu *vm_numa_stat;
+EXPORT_SYMBOL(vm_numa_stat);
 #ifdef CONFIG_SMP
 
 int calculate_pressure_threshold(struct zone *zone)
@@ -679,32 +651,6 @@ EXPORT_SYMBOL(dec_node_page_state);
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
-#ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
-{
-	int i;
-	int changes = 0;
-
-	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (zone_diff[i]) {
-			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
-			changes++;
-	}
-
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		if (numa_diff[i]) {
-			atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
-			changes++;
-	}
-
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		if (node_diff[i]) {
-			atomic_long_add(node_diff[i], &vm_node_stat[i]);
-			changes++;
-	}
-	return changes;
-}
-#else
 static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
@@ -723,7 +669,6 @@ static int fold_diff(int *zone_diff, int *node_diff)
 	}
 	return changes;
 }
-#endif /* CONFIG_NUMA */
 
 /*
  * Update the zone counters for the current cpu.
@@ -747,9 +692,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
-	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
@@ -771,18 +713,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 		}
 #ifdef CONFIG_NUMA
-		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-			int v;
-
-			v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
-			if (v) {
-
-				atomic_long_add(v, &zone->vm_numa_stat[i]);
-				global_numa_diff[i] += v;
-				__this_cpu_write(p->expire, 3);
-			}
-		}
-
 		if (do_pagesets) {
 			cond_resched();
 			/*
@@ -829,12 +759,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
-#ifdef CONFIG_NUMA
-	changes += fold_diff(global_zone_diff, global_numa_diff,
-			     global_node_diff);
-#else
 	changes += fold_diff(global_zone_diff, global_node_diff);
-#endif
 	return changes;
 }
 
@@ -849,9 +774,6 @@ void cpu_vm_stats_fold(int cpu)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
-	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
@@ -868,18 +790,6 @@ void cpu_vm_stats_fold(int cpu)
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
-
-#ifdef CONFIG_NUMA
-		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-			if (p->vm_numa_stat_diff[i]) {
-				int v;
-
-				v = p->vm_numa_stat_diff[i];
-				p->vm_numa_stat_diff[i] = 0;
-				atomic_long_add(v, &zone->vm_numa_stat[i]);
-				global_numa_diff[i] += v;
-			}
-#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -898,11 +808,7 @@ void cpu_vm_stats_fold(int cpu)
 			}
 	}
 
-#ifdef CONFIG_NUMA
-	fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
-#else
 	fold_diff(global_zone_diff, global_node_diff);
-#endif
 }
 
 /*
@@ -920,17 +826,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			atomic_long_add(v, &zone->vm_stat[i]);
 			atomic_long_add(v, &vm_zone_stat[i]);
 		}
-
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		if (pset->vm_numa_stat_diff[i]) {
-			int v = pset->vm_numa_stat_diff[i];
-
-			pset->vm_numa_stat_diff[i] = 0;
-			atomic_long_add(v, &zone->vm_numa_stat[i]);
-			atomic_long_add(v, &vm_numa_stat[i]);
-		}
-#endif
 }
 #endif
 
@@ -938,16 +833,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 void __inc_numa_state(struct zone *zone,
 				 enum numa_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	u16 __percpu *p = pcp->vm_numa_stat_diff + item;
-	u16 v;
+	int offset = zone->node * NR_VM_NUMA_STAT_ITEMS + item;
+	u64 __percpu *p = vm_numa_stat + offset;
 
-	v = __this_cpu_inc_return(*p);
-
-	if (unlikely(v > NUMA_STATS_THRESHOLD)) {
-		zone_numa_state_add(v, zone, item);
-		__this_cpu_write(*p, 0);
-	}
+	__this_cpu_inc(*p);
 }
 
 /*
@@ -969,23 +858,6 @@ unsigned long sum_zone_node_page_state(int node,
 }
 
 /*
- * Determine the per node value of a numa stat item. To avoid deviation,
- * the per cpu stat number in vm_numa_stat_diff[] is also included.
- */
-unsigned long sum_zone_numa_state(int node,
-				 enum numa_stat_item item)
-{
-	struct zone *zones = NODE_DATA(node)->node_zones;
-	int i;
-	unsigned long count = 0;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		count += zone_numa_state_snapshot(zones + i, item);
-
-	return count;
-}
-
-/*
  * Determine the per node value of a stat item.
  */
 unsigned long node_page_state(struct pglist_data *pgdat,
@@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
 			err = -EINVAL;
 		}
 	}
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
-			pr_warn("%s: %s %ld\n",
-				__func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
-			err = -EINVAL;
-		}
-	}
-#endif
 	if (err)
 		return err;
 	if (write)
@@ -1862,9 +1724,6 @@ static bool need_update(int cpu)
 		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
 
 		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
-#ifdef CONFIG_NUMA
-		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
-#endif
 
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
@@ -1872,10 +1731,6 @@ static bool need_update(int cpu)
 		 */
 		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
 			return true;
-#ifdef CONFIG_NUMA
-		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS))
-			return true;
-#endif
 	}
 	return false;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics()
  2017-11-28  6:00 [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Kemi Wang
@ 2017-11-28  6:00 ` Kemi Wang
  2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
  2017-11-29 12:17 ` Michal Hocko
  2 siblings, 0 replies; 21+ messages in thread
From: Kemi Wang @ 2017-11-28  6:00 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior
  Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel

Since numa statistics has been separated from zone statistics framework,
but the functionality of zone_statistics() updates numa counters. Thus, the
function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 142e1ba..61fa717 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2783,7 +2783,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
 	enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -2845,7 +2845,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-		zone_statistics(preferred_zone, zone);
+		numa_statistics(preferred_zone, zone);
 	}
 	local_irq_restore(flags);
 	return page;
@@ -2893,7 +2893,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				  get_pcppage_migratetype(page));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone);
+	numa_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
 
 out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28  6:00 [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Kemi Wang
  2017-11-28  6:00 ` [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics() Kemi Wang
@ 2017-11-28  8:09 ` Vlastimil Babka
  2017-11-28  8:33   ` kemi
  2017-11-28 18:40   ` Andi Kleen
  2017-11-29 12:17 ` Michal Hocko
  2 siblings, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-11-28  8:09 UTC (permalink / raw)
  To: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior
  Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Linux MM, Linux Kernel

On 11/28/2017 07:00 AM, Kemi Wang wrote:
> The existed implementation of NUMA counters is per logical CPU along with
> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
> effect system's decision and are only read from /proc and /sys, it is a
> slow path operation and likely tolerate higher overhead. Additionally,
> usually nodes only have a single zone, except for node 0. And there isn't
> really any use where you need these hits counts separated by zone.
> 
> Therefore, we can migrate the implementation of numa stats from per-zone to
> per-node, and get rid of these global numa counters. It's good enough to
> keep everything in a per cpu ptr of type u64, and sum them up when need, as
> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
> (e.g. save more than 130+ lines code).

OK.

> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
> page allocation and deallocation concurrently with 112 threads tested on a
> 2-sockets skylake platform using Jesper's page_bench03 benchmark.

To be fair, one can now avoid the overhead completely since 4518085e127d
("mm, sysctl: make NUMA stats configurable"). But if we can still
optimize it, sure.

> Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> Also, it does not cause obvious latency increase when read /proc and /sys
> on a 2-sockets skylake platform. Latency shown by time command:
>                            base             head
> /proc/vmstat            sys 0m0.001s     sys 0m0.001s
> 
> /sys/devices/system/    sys 0m0.001s     sys 0m0.000s
> node/node*/numastat

Well, here I have to point out that the coarse "time" command resolution
here means the comparison of a single read cannot be compared. You would
have to e.g. time a loop with enough iterations (which would then be all
cache-hot, but better than nothing I guess).

> We would not worry it much as it is a slow path and will not be read
> frequently.
> 
> Suggested-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>

...

> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1779c98..7383d66 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>   * Zone and node-based page accounting with per cpu differentials.
>   */
>  extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>  extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
> -
> -#ifdef CONFIG_NUMA
> -static inline void zone_numa_state_add(long x, struct zone *zone,
> -				 enum numa_stat_item item)
> -{
> -	atomic_long_add(x, &zone->vm_numa_stat[item]);
> -	atomic_long_add(x, &vm_numa_stat[item]);
> -}
> -
> -static inline unsigned long global_numa_state(enum numa_stat_item item)
> -{
> -	long x = atomic_long_read(&vm_numa_stat[item]);
> -
> -	return x;
> -}
> -
> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
> -					enum numa_stat_item item)
> -{
> -	long x = atomic_long_read(&zone->vm_numa_stat[item]);
> -	int cpu;
> -
> -	for_each_online_cpu(cpu)
> -		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
> -
> -	return x;
> -}
> -#endif /* CONFIG_NUMA */
> +extern u64 __percpu *vm_numa_stat;
>  
>  static inline void zone_page_state_add(long x, struct zone *zone,
>  				 enum zone_stat_item item)
> @@ -234,10 +206,39 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
>  
>  
>  #ifdef CONFIG_NUMA
> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
> +					enum numa_stat_item item)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned long node_numa_state_snapshot(int node,
> +					enum numa_stat_item item)
> +{
> +	unsigned long x = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)

I'm worried about the "for_each_possible..." approach here and elsewhere
in the patch as it can be rather excessive compared to the online number
of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC
the general approach with vmstat is to query just online cpu's / nodes,
and if they go offline, transfer their accumulated stats to some other
"victim"?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
@ 2017-11-28  8:33   ` kemi
  2017-11-28 18:40   ` Andi Kleen
  1 sibling, 0 replies; 21+ messages in thread
From: kemi @ 2017-11-28  8:33 UTC (permalink / raw)
  To: Vlastimil Babka, Greg Kroah-Hartman, Andrew Morton, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior
  Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Linux MM, Linux Kernel



On 2017年11月28日 16:09, Vlastimil Babka wrote:
> On 11/28/2017 07:00 AM, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> OK.
> 
>> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
>> page allocation and deallocation concurrently with 112 threads tested on a
>> 2-sockets skylake platform using Jesper's page_bench03 benchmark.
> 
> To be fair, one can now avoid the overhead completely since 4518085e127d
> ("mm, sysctl: make NUMA stats configurable"). But if we can still
> optimize it, sure.
> 

Yes, I did that several months ago. And both Dave Hansen and me thought that
auto tuning should be better because people probably do not touch this interface,
but Michal had some concerns about that. 

This patch aims to cleanup up the code for numa stats with a little performance
improvement.

>> Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> Also, it does not cause obvious latency increase when read /proc and /sys
>> on a 2-sockets skylake platform. Latency shown by time command:
>>                            base             head
>> /proc/vmstat            sys 0m0.001s     sys 0m0.001s
>>
>> /sys/devices/system/    sys 0m0.001s     sys 0m0.000s
>> node/node*/numastat
> 
> Well, here I have to point out that the coarse "time" command resolution
> here means the comparison of a single read cannot be compared. You would
> have to e.g. time a loop with enough iterations (which would then be all
> cache-hot, but better than nothing I guess).
> 

It indeed is a coarse comparison to show that it does not cause obvious
overhead in a slow path.

All right, I will do that to get a more accurate value.

>> We would not worry it much as it is a slow path and will not be read
>> frequently.
>>
>> Suggested-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> 
> ...
> 
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1779c98..7383d66 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>>   * Zone and node-based page accounting with per cpu differentials.
>>   */
>>  extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>  extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>> -
>> -#ifdef CONFIG_NUMA
>> -static inline void zone_numa_state_add(long x, struct zone *zone,
>> -				 enum numa_stat_item item)
>> -{
>> -	atomic_long_add(x, &zone->vm_numa_stat[item]);
>> -	atomic_long_add(x, &vm_numa_stat[item]);
>> -}
>> -
>> -static inline unsigned long global_numa_state(enum numa_stat_item item)
>> -{
>> -	long x = atomic_long_read(&vm_numa_stat[item]);
>> -
>> -	return x;
>> -}
>> -
>> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> -					enum numa_stat_item item)
>> -{
>> -	long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> -	int cpu;
>> -
>> -	for_each_online_cpu(cpu)
>> -		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>> -
>> -	return x;
>> -}
>> -#endif /* CONFIG_NUMA */
>> +extern u64 __percpu *vm_numa_stat;
>>  
>>  static inline void zone_page_state_add(long x, struct zone *zone,
>>  				 enum zone_stat_item item)
>> @@ -234,10 +206,39 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
>>  
>>  
>>  #ifdef CONFIG_NUMA
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> +					enum numa_stat_item item)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline unsigned long node_numa_state_snapshot(int node,
>> +					enum numa_stat_item item)
>> +{
>> +	unsigned long x = 0;
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
> 
> I'm worried about the "for_each_possible..." approach here and elsewhere
> in the patch as it can be rather excessive compared to the online number
> of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC
> the general approach with vmstat is to query just online cpu's / nodes,
> and if they go offline, transfer their accumulated stats to some other
> "victim"?
> 

It's a trade off I think. "for_each_possible_cpu()" can avoid to fold local cpu 
stats to a global counter (actually, the first available cpu in this patch) when 
cpu is offline/dead.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
  2017-11-28  8:33   ` kemi
@ 2017-11-28 18:40   ` Andi Kleen
  2017-11-28 21:56     ` Andrew Morton
  2017-11-28 22:52     ` Vlastimil Babka
  1 sibling, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2017-11-28 18:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave,
	Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Linux MM, Linux Kernel

Vlastimil Babka <vbabka@suse.cz> writes:
>
> I'm worried about the "for_each_possible..." approach here and elsewhere
> in the patch as it can be rather excessive compared to the online number
> of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC

Even if they report a few hundred extra reading some more shared cache lines
is very cheap. The prefetcher usually quickly figures out such a pattern
and reads it all in parallel.

I doubt it will be noticeable, especially not in a slow path
like reading something from proc/sys.

> the general approach with vmstat is to query just online cpu's / nodes,
> and if they go offline, transfer their accumulated stats to some other
> "victim"?

That's very complicated, and unlikely to be worth it.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28 18:40   ` Andi Kleen
@ 2017-11-28 21:56     ` Andrew Morton
  2017-11-28 22:52     ` Vlastimil Babka
  1 sibling, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2017-11-28 21:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vlastimil Babka, Kemi Wang, Greg Kroah-Hartman, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave,
	Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Linux MM, Linux Kernel

On Tue, 28 Nov 2017 10:40:52 -0800 Andi Kleen <ak@linux.intel.com> wrote:

> Vlastimil Babka <vbabka@suse.cz> writes:
> >
> > I'm worried about the "for_each_possible..." approach here and elsewhere
> > in the patch as it can be rather excessive compared to the online number
> > of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC
> 
> Even if they report a few hundred extra reading some more shared cache lines
> is very cheap. The prefetcher usually quickly figures out such a pattern
> and reads it all in parallel.
> 
> I doubt it will be noticeable, especially not in a slow path
> like reading something from proc/sys.

We say that, then a few years it comes back and bites us on our
trailing edges.

> > the general approach with vmstat is to query just online cpu's / nodes,
> > and if they go offline, transfer their accumulated stats to some other
> > "victim"?
> 
> That's very complicated, and unlikely to be worth it.

for_each_online_cpu() and a few-line hotplug handler?  I'd like to see
an implementation before deciding that it's too complex...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28 18:40   ` Andi Kleen
  2017-11-28 21:56     ` Andrew Morton
@ 2017-11-28 22:52     ` Vlastimil Babka
  1 sibling, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-11-28 22:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
	Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave,
	Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Aubrey Li, Linux MM, Linux Kernel

On 11/28/2017 07:40 PM, Andi Kleen wrote:
> Vlastimil Babka <vbabka@suse.cz> writes:
>>
>> I'm worried about the "for_each_possible..." approach here and elsewhere
>> in the patch as it can be rather excessive compared to the online number
>> of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC
> 
> Even if they report a few hundred extra reading some more shared cache lines
> is very cheap. The prefetcher usually quickly figures out such a pattern
> and reads it all in parallel.

Hmm, prefetcher AFAIK works within page bounday and here IIUC we are
iterating between pcpu areas in the inner loop, which are futher apart
than that? And their number may exhausts the simultaneous prefetch
stream. And the outer loops repeats that for each counter. We might be
either evicting quite a bit of cache, or perhaps the distance between
pcpu areas is such that it will cause collision misses, so we'll be
always cache cold and not even benefit from multiple counters fitting
into single cache line.

> I doubt it will be noticeable, especially not in a slow path
> like reading something from proc/sys.
> 
>> the general approach with vmstat is to query just online cpu's / nodes,
>> and if they go offline, transfer their accumulated stats to some other
>> "victim"?
> 
> That's very complicated, and unlikely to be worth it.

vm_events_fold_cpu() doesn't look that complicated

> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-28  6:00 [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Kemi Wang
  2017-11-28  6:00 ` [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics() Kemi Wang
  2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
@ 2017-11-29 12:17 ` Michal Hocko
  2017-11-30  5:56   ` kemi
  2 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-11-29 12:17 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Tue 28-11-17 14:00:23, Kemi Wang wrote:
> The existed implementation of NUMA counters is per logical CPU along with
> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
> effect system's decision and are only read from /proc and /sys, it is a
> slow path operation and likely tolerate higher overhead. Additionally,
> usually nodes only have a single zone, except for node 0. And there isn't
> really any use where you need these hits counts separated by zone.
> 
> Therefore, we can migrate the implementation of numa stats from per-zone to
> per-node, and get rid of these global numa counters. It's good enough to
> keep everything in a per cpu ptr of type u64, and sum them up when need, as
> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
> (e.g. save more than 130+ lines code).

I agree. Having these stats per zone is a bit of overcomplication. The
only consumer is /proc/zoneinfo and I would argue this doesn't justify
the additional complexity. Who does really need to know per zone broken
out numbers?

Anyway, I haven't checked your implementation too deeply but why don't
you simply define static percpu array for each numa node?
[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> +	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> +	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> +	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-29 12:17 ` Michal Hocko
@ 2017-11-30  5:56   ` kemi
  2017-11-30  8:53     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: kemi @ 2017-11-30  5:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年11月29日 20:17, Michal Hocko wrote:
> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> I agree. Having these stats per zone is a bit of overcomplication. The
> only consumer is /proc/zoneinfo and I would argue this doesn't justify
> the additional complexity. Who does really need to know per zone broken
> out numbers?
> 
> Anyway, I haven't checked your implementation too deeply but why don't
> you simply define static percpu array for each numa node?

To be honest, there are another two ways I can think of listed below. but I don't
think they are simpler than my current implementation. Maybe you have better idea.

static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
But it's not correct.

Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct pglist_data.

My current implementation is quite straightforward by combining all of local counters
together, only one percpu array with size of num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
is enough for that.
		
> [...]
>> +extern u64 __percpu *vm_numa_stat;
> [...]
>> +#ifdef CONFIG_NUMA
>> +	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
>> +	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
>> +	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
>> +#endif

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-30  5:56   ` kemi
@ 2017-11-30  8:53     ` Michal Hocko
  2017-11-30  9:32       ` kemi
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-11-30  8:53 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Thu 30-11-17 13:56:13, kemi wrote:
> 
> 
> On 2017年11月29日 20:17, Michal Hocko wrote:
> > On Tue 28-11-17 14:00:23, Kemi Wang wrote:
> >> The existed implementation of NUMA counters is per logical CPU along with
> >> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
> >> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
> >> effect system's decision and are only read from /proc and /sys, it is a
> >> slow path operation and likely tolerate higher overhead. Additionally,
> >> usually nodes only have a single zone, except for node 0. And there isn't
> >> really any use where you need these hits counts separated by zone.
> >>
> >> Therefore, we can migrate the implementation of numa stats from per-zone to
> >> per-node, and get rid of these global numa counters. It's good enough to
> >> keep everything in a per cpu ptr of type u64, and sum them up when need, as
> >> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
> >> (e.g. save more than 130+ lines code).
> > 
> > I agree. Having these stats per zone is a bit of overcomplication. The
> > only consumer is /proc/zoneinfo and I would argue this doesn't justify
> > the additional complexity. Who does really need to know per zone broken
> > out numbers?
> > 
> > Anyway, I haven't checked your implementation too deeply but why don't
> > you simply define static percpu array for each numa node?
> 
> To be honest, there are another two ways I can think of listed below. but I don't
> think they are simpler than my current implementation. Maybe you have better idea.
> 
> static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
> But it's not correct.
> 
> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct pglist_data.
> 
> My current implementation is quite straightforward by combining all of local counters
> together, only one percpu array with size of num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
> is enough for that.

Well, this is certainly a matter of taste. But let's have a look what we
have currently. We have per zone, per node and numa stats. That looks one
way to many to me. Why don't we simply move the whole numa stat thingy
into per node stats? The code would simplify even more. We are going to
lose /proc/zoneinfo per-zone data but we are losing those without your
patch anyway. So I've just scratched the following on your patch and the
cumulative diff looks even better

 drivers/base/node.c    |  22 ++---
 include/linux/mmzone.h |  22 ++---
 include/linux/vmstat.h |  38 +--------
 mm/mempolicy.c         |   2 +-
 mm/page_alloc.c        |  20 ++---
 mm/vmstat.c            | 221 +------------------------------------------------
 6 files changed, 30 insertions(+), 295 deletions(-)

I haven't tested it at all yet. This is just to show the idea.
---
commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Nov 30 09:49:45 2017 +0100

    - move NUMA stats to node stats

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0be5fbdadaac..315156310c99 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev,
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 			     sum_zone_node_page_state(nid, i));
 
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		n += sprintf(buf+n, "%s %lu\n",
-			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-			     node_numa_state_snapshot(nid, i));
-#endif
-
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n",
-			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
-			     NR_VM_NUMA_STAT_ITEMS],
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 			     node_page_state(pgdat, i));
 
 	return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b2d264f8c0c6..2c9c8b13c44b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,20 +115,6 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
-#ifdef CONFIG_NUMA
-enum numa_stat_item {
-	NUMA_HIT,		/* allocated in intended node */
-	NUMA_MISS,		/* allocated in non intended node */
-	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
-	NUMA_LOCAL,		/* allocation from local node */
-	NUMA_OTHER,		/* allocation from other node */
-	NR_VM_NUMA_STAT_ITEMS
-};
-#else
-#define NR_VM_NUMA_STAT_ITEMS 0
-#endif
-
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
@@ -180,6 +166,12 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NUMA_HIT,		/* allocated in intended node */
+	NUMA_MISS,		/* allocated in non intended node */
+	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
+	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
+	NUMA_LOCAL,		/* allocation from local node */
+	NUMA_OTHER,		/* allocation from other node */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c07850f413de..cc1edd95e949 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -187,19 +187,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 #endif
 	return x;
 }
-
 #ifdef CONFIG_NUMA
-extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
+extern unsigned long node_page_state(struct pglist_data *pgdat,
+                                               enum node_stat_item item);
 extern unsigned long sum_zone_node_page_state(int node,
 					      enum zone_stat_item item);
-extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
-extern unsigned long node_page_state(struct pglist_data *pgdat,
-						enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
 #endif /* CONFIG_NUMA */
-
 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
 #define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..84e72f2b5748 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1939,7 +1939,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 		return page;
 	if (page && page_to_nid(page) == nid) {
 		preempt_disable();
-		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+		inc_node_page_state(page, NUMA_INTERLEAVE_HIT);
 		preempt_enable();
 	}
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 044daba8c11a..c8e34157f7b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2785,25 +2785,25 @@ int __isolate_free_page(struct page *page, unsigned int order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void zone_statistics(int preferred_nid, int page_nid)
 {
 #ifdef CONFIG_NUMA
-	enum numa_stat_item local_stat = NUMA_LOCAL;
+	enum node_stat_item local_stat = NUMA_LOCAL;
 
 	/* skip numa counters update if numa stats is disabled */
 	if (!static_branch_likely(&vm_numa_stat_key))
 		return;
 
-	if (z->node != numa_node_id())
+	if (page_nid != numa_node_id())
 		local_stat = NUMA_OTHER;
 
-	if (z->node == preferred_zone->node)
-		__inc_numa_state(z, NUMA_HIT);
+	if (page_nid == preferred_nid)
+		inc_node_state(NODE_DATA(page_nid), NUMA_HIT);
 	else {
-		__inc_numa_state(z, NUMA_MISS);
-		__inc_numa_state(preferred_zone, NUMA_FOREIGN);
+		inc_node_state(NODE_DATA(page_nid), NUMA_MISS);
+		inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
 	}
-	__inc_numa_state(z, local_stat);
+	inc_node_state(NODE_DATA(page_nid), local_stat);
 #endif
 }
 
@@ -2847,7 +2847,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-		zone_statistics(preferred_zone, zone);
+		zone_statistics(preferred_zone->node, zone->node);
 	}
 	local_irq_restore(flags);
 	return page;
@@ -2895,7 +2895,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				  get_pcppage_migratetype(page));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone);
+	zone_statistics(preferred_zone->node, zone->node);
 	local_irq_restore(flags);
 
 out:
@@ -5580,7 +5580,6 @@ void __init setup_per_cpu_pageset(void)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
-	size_t size, align;
 
 	for_each_populated_zone(zone)
 		setup_zone_pageset(zone);
@@ -5588,12 +5587,6 @@ void __init setup_per_cpu_pageset(void)
 	for_each_online_pgdat(pgdat)
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
-
-#ifdef CONFIG_NUMA
-	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
-	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
-	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
-#endif
 }
 
 static __meminit void zone_pcp_init(struct zone *zone)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bbabd96d1a4b..c9739104589f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -33,17 +33,6 @@
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
-static void invalid_numa_statistics(void)
-{
-	int i, cpu;
-
-	for_each_possible_cpu(cpu) {
-		for (i = 0; i < num_possible_nodes() *
-				NR_VM_NUMA_STAT_ITEMS; i++)
-			per_cpu_ptr(vm_numa_stat, cpu)[i] = 0;
-	}
-}
-
 static DEFINE_MUTEX(vm_numa_stat_lock);
 
 int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
@@ -65,7 +54,6 @@ int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
 		pr_info("enable numa statistics\n");
 	} else {
 		static_branch_disable(&vm_numa_stat_key);
-		invalid_numa_statistics();
 		pr_info("disable numa statistics, and clear numa counters\n");
 	}
 
@@ -829,49 +817,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 }
 #endif
 
-#ifdef CONFIG_NUMA
-void __inc_numa_state(struct zone *zone,
-				 enum numa_stat_item item)
-{
-	int offset = zone->node * NR_VM_NUMA_STAT_ITEMS + item;
-	u64 __percpu *p = vm_numa_stat + offset;
-
-	__this_cpu_inc(*p);
-}
-
-/*
- * Determine the per node value of a stat item. This function
- * is called frequently in a NUMA machine, so try to be as
- * frugal as possible.
- */
-unsigned long sum_zone_node_page_state(int node,
-				 enum zone_stat_item item)
-{
-	struct zone *zones = NODE_DATA(node)->node_zones;
-	int i;
-	unsigned long count = 0;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		count += zone_page_state(zones + i, item);
-
-	return count;
-}
-
-/*
- * Determine the per node value of a stat item.
- */
-unsigned long node_page_state(struct pglist_data *pgdat,
-				enum node_stat_item item)
-{
-	long x = atomic_long_read(&pgdat->vm_stat[item]);
-#ifdef CONFIG_SMP
-	if (x < 0)
-		x = 0;
-#endif
-	return x;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 
 struct contig_page_info {
@@ -1441,8 +1386,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			seq_printf(m, "\n      %-12s %lu",
-				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
-				NR_VM_NUMA_STAT_ITEMS],
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 				node_page_state(pgdat, i));
 		}
 	}
@@ -1479,13 +1423,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		seq_printf(m, "\n      %-12s %lu",
-				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-				zone_numa_state_snapshot(zone, i));
-#endif
-
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1560,7 +1497,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
-			  NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
@@ -1576,12 +1512,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_zone_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		v[i] = global_numa_state(i);
-	v += NR_VM_NUMA_STAT_ITEMS;
-#endif
-
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		v[i] = global_node_page_state(i);
 	v += NR_VM_NODE_STAT_ITEMS;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-30  8:53     ` Michal Hocko
@ 2017-11-30  9:32       ` kemi
  2017-11-30  9:45         ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: kemi @ 2017-11-30  9:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年11月30日 16:53, Michal Hocko wrote:
> On Thu 30-11-17 13:56:13, kemi wrote:
>>
>>
>> On 2017年11月29日 20:17, Michal Hocko wrote:
>>> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>>>> The existed implementation of NUMA counters is per logical CPU along with
>>>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>>>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>>>> effect system's decision and are only read from /proc and /sys, it is a
>>>> slow path operation and likely tolerate higher overhead. Additionally,
>>>> usually nodes only have a single zone, except for node 0. And there isn't
>>>> really any use where you need these hits counts separated by zone.
>>>>
>>>> Therefore, we can migrate the implementation of numa stats from per-zone to
>>>> per-node, and get rid of these global numa counters. It's good enough to
>>>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>>>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>>>> (e.g. save more than 130+ lines code).
>>>
>>> I agree. Having these stats per zone is a bit of overcomplication. The
>>> only consumer is /proc/zoneinfo and I would argue this doesn't justify
>>> the additional complexity. Who does really need to know per zone broken
>>> out numbers?
>>>
>>> Anyway, I haven't checked your implementation too deeply but why don't
>>> you simply define static percpu array for each numa node?
>>
>> To be honest, there are another two ways I can think of listed below. but I don't
>> think they are simpler than my current implementation. Maybe you have better idea.
>>
>> static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
>> But it's not correct.
>>
>> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct pglist_data.
>>
>> My current implementation is quite straightforward by combining all of local counters
>> together, only one percpu array with size of num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
>> is enough for that.
> 
> Well, this is certainly a matter of taste. But let's have a look what we
> have currently. We have per zone, per node and numa stats. That looks one
> way to many to me. Why don't we simply move the whole numa stat thingy
> into per node stats? The code would simplify even more. We are going to
> lose /proc/zoneinfo per-zone data but we are losing those without your
> patch anyway. So I've just scratched the following on your patch and the
> cumulative diff looks even better
> 
>  drivers/base/node.c    |  22 ++---
>  include/linux/mmzone.h |  22 ++---
>  include/linux/vmstat.h |  38 +--------
>  mm/mempolicy.c         |   2 +-
>  mm/page_alloc.c        |  20 ++---
>  mm/vmstat.c            | 221 +------------------------------------------------
>  6 files changed, 30 insertions(+), 295 deletions(-)
> 
> I haven't tested it at all yet. This is just to show the idea.
> ---
> commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Thu Nov 30 09:49:45 2017 +0100
> 
>     - move NUMA stats to node stats
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0be5fbdadaac..315156310c99 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>  			     sum_zone_node_page_state(nid, i));
>  
> -#ifdef CONFIG_NUMA
> -	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> -		n += sprintf(buf+n, "%s %lu\n",
> -			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -			     node_numa_state_snapshot(nid, i));
> -#endif
> -
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>  		n += sprintf(buf+n, "%s %lu\n",
> -			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> -			     NR_VM_NUMA_STAT_ITEMS],
> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     node_page_state(pgdat, i));
>  
>  	return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b2d264f8c0c6..2c9c8b13c44b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,20 +115,6 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  #endif
>  
> -#ifdef CONFIG_NUMA
> -enum numa_stat_item {
> -	NUMA_HIT,		/* allocated in intended node */
> -	NUMA_MISS,		/* allocated in non intended node */
> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> -	NUMA_LOCAL,		/* allocation from local node */
> -	NUMA_OTHER,		/* allocation from other node */
> -	NR_VM_NUMA_STAT_ITEMS
> -};
> -#else
> -#define NR_VM_NUMA_STAT_ITEMS 0
> -#endif
> -
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> @@ -180,6 +166,12 @@ enum node_stat_item {
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
> +	NUMA_HIT,		/* allocated in intended node */
> +	NUMA_MISS,		/* allocated in non intended node */
> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> +	NUMA_LOCAL,		/* allocation from local node */
> +	NUMA_OTHER,		/* allocation from other node */
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c07850f413de..cc1edd95e949 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -187,19 +187,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  #endif
>  	return x;
>  }
> -
>  #ifdef CONFIG_NUMA
> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
> +extern unsigned long node_page_state(struct pglist_data *pgdat,
> +                                               enum node_stat_item item);
>  extern unsigned long sum_zone_node_page_state(int node,
>  					      enum zone_stat_item item);
> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
> -extern unsigned long node_page_state(struct pglist_data *pgdat,
> -						enum node_stat_item item);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
>  #endif /* CONFIG_NUMA */
> -
>  #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
>  #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
>  #define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..84e72f2b5748 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1939,7 +1939,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>  		return page;
>  	if (page && page_to_nid(page) == nid) {
>  		preempt_disable();
> -		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> +		inc_node_page_state(page, NUMA_INTERLEAVE_HIT);
>  		preempt_enable();
>  	}
>  	return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 044daba8c11a..c8e34157f7b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2785,25 +2785,25 @@ int __isolate_free_page(struct page *page, unsigned int order)
>   *
>   * Must be called with interrupts disabled.
>   */
> -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
> +static inline void zone_statistics(int preferred_nid, int page_nid)
>  {
>  #ifdef CONFIG_NUMA
> -	enum numa_stat_item local_stat = NUMA_LOCAL;
> +	enum node_stat_item local_stat = NUMA_LOCAL;
>  
>  	/* skip numa counters update if numa stats is disabled */
>  	if (!static_branch_likely(&vm_numa_stat_key))
>  		return;
>  
> -	if (z->node != numa_node_id())
> +	if (page_nid != numa_node_id())
>  		local_stat = NUMA_OTHER;
>  
> -	if (z->node == preferred_zone->node)
> -		__inc_numa_state(z, NUMA_HIT);
> +	if (page_nid == preferred_nid)
> +		inc_node_state(NODE_DATA(page_nid), NUMA_HIT);
>  	else {
> -		__inc_numa_state(z, NUMA_MISS);
> -		__inc_numa_state(preferred_zone, NUMA_FOREIGN);
> +		inc_node_state(NODE_DATA(page_nid), NUMA_MISS);
> +		inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
>  	}

Your patch saves more code than mine because the node stats framework is reused
for numa stats. But it has a performance regression because of the limitation of
threshold size (125 at most, see calculate_normal_threshold() in vmstat.c) 
in inc_node_state().

You can check this patch "1d90ca8 mm: update NUMA counter threshold size" for details.
This issue is reported by Jesper Dangaard Brouer originally.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-30  9:32       ` kemi
@ 2017-11-30  9:45         ` Michal Hocko
  2017-11-30 11:06           ` Wang, Kemi
  2017-12-08  8:38           ` kemi
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2017-11-30  9:45 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Thu 30-11-17 17:32:08, kemi wrote:
[...]
> Your patch saves more code than mine because the node stats framework is reused
> for numa stats. But it has a performance regression because of the limitation of
> threshold size (125 at most, see calculate_normal_threshold() in vmstat.c) 
> in inc_node_state().

But this "regression" would be visible only on those workloads which
really need to squeeze every single cycle out of the allocation hot path
and those are supposed to disable the accounting altogether. Or is this
visible on a wider variety of workloads.

Do not get me wrong. If we want to make per-node stats more optimal,
then by all means let's do that. But having 3 sets of counters is just
way to much.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-30  9:45         ` Michal Hocko
@ 2017-11-30 11:06           ` Wang, Kemi
  2017-12-08  8:38           ` kemi
  1 sibling, 0 replies; 21+ messages in thread
From: Wang, Kemi @ 2017-11-30 11:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Kleen, Andi, Chen, Tim C,
	Jesper Dangaard Brouer, Huang, Ying, Lu, Aaron, Li, Aubrey,
	Linux MM, Linux Kernel

Of course, we should do that AFAP. Thanks for your comments :)

-----Original Message-----
From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Michal Hocko
Sent: Thursday, November 30, 2017 5:45 PM
To: Wang, Kemi <kemi.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>; Andrew Morton <akpm@linux-foundation.org>; Vlastimil Babka <vbabka@suse.cz>; Mel Gorman <mgorman@techsingularity.net>; Johannes Weiner <hannes@cmpxchg.org>; Christopher Lameter <cl@linux.com>; YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>; Andrey Ryabinin <aryabinin@virtuozzo.com>; Nikolay Borisov <nborisov@suse.com>; Pavel Tatashin <pasha.tatashin@oracle.com>; David Rientjes <rientjes@google.com>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; Dave <dave.hansen@linux.intel.com>; Kleen, Andi <andi.kleen@intel.com>; Chen, Tim C <tim.c.chen@intel.com>; Jesper Dangaard Brouer <brouer@redhat.com>; Huang, Ying <ying.huang@intel.com>; Lu, Aaron <aaron.lu@intel.com>; Li, Aubrey <aubrey.li@intel.com>; Linux MM <linux-mm@kvack.org>; Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Thu 30-11-17 17:32:08, kemi wrote:
[...]
> Your patch saves more code than mine because the node stats framework 
> is reused for numa stats. But it has a performance regression because 
> of the limitation of threshold size (125 at most, see 
> calculate_normal_threshold() in vmstat.c) in inc_node_state().

But this "regression" would be visible only on those workloads which really need to squeeze every single cycle out of the allocation hot path and those are supposed to disable the accounting altogether. Or is this visible on a wider variety of workloads.

Do not get me wrong. If we want to make per-node stats more optimal, then by all means let's do that. But having 3 sets of counters is just way to much.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-11-30  9:45         ` Michal Hocko
  2017-11-30 11:06           ` Wang, Kemi
@ 2017-12-08  8:38           ` kemi
  2017-12-08  8:47             ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: kemi @ 2017-12-08  8:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年11月30日 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
> 

Hi, Michal
  Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully, 
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats. 
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[    0.000000] BUG: unable to handle kernel paging request at 0392b000
[    0.000000] IP: __inc_numa_state+0x2a/0x34
[    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff53 
[    0.000000] Oops: 0002 [#1] PREEMPT SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    0.000000] task: cbf56000 task.stack: cbf4e000
[    0.000000] EIP: __inc_numa_state+0x2a/0x34
[    0.000000] EFLAGS: 00210006 CPU: 0
[    0.000000] EAX: 0392b000 EBX: 00000000 ECX: 00000000 EDX: cbef90ef
[    0.000000] ESI: cffdb320 EDI: 00000004 EBP: cbf4fd80 ESP: cbf4fd7c
[    0.000000]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    0.000000] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[    0.000000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    0.000000] DR6: fffe0ff0 DR7: 00000400
[    0.000000] Call Trace:
[    0.000000]  zone_statistics+0x4d/0x5b
[    0.000000]  get_page_from_freelist+0x257/0x993
[    0.000000]  __alloc_pages_nodemask+0x108/0x8c8
[    0.000000]  ? __bitmap_weight+0x38/0x41
[    0.000000]  ? pcpu_next_md_free_region+0xe/0xab
[    0.000000]  ? pcpu_chunk_refresh_hint+0x8b/0xbc
[    0.000000]  ? pcpu_chunk_slot+0x1e/0x24
[    0.000000]  ? pcpu_chunk_relocate+0x15/0x6d
[    0.000000]  ? find_next_bit+0xa/0xd
[    0.000000]  ? cpumask_next+0x15/0x18
[    0.000000]  ? pcpu_alloc+0x399/0x538
[    0.000000]  cache_grow_begin+0x85/0x31c
[    0.000000]  ____cache_alloc+0x147/0x1e0
[    0.000000]  ? debug_smp_processor_id+0x12/0x14
[    0.000000]  kmem_cache_alloc+0x80/0x145
[    0.000000]  create_kmalloc_cache+0x22/0x64
[    0.000000]  kmem_cache_init+0xf9/0x16c
[    0.000000]  start_kernel+0x1d4/0x3d6
[    0.000000]  i386_start_kernel+0x9a/0x9e
[    0.000000]  startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> +	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> +	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> +	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node numa
stats more clear for review and maintenance. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-08  8:38           ` kemi
@ 2017-12-08  8:47             ` Michal Hocko
  2017-12-12  2:05               ` kemi
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-12-08  8:47 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Fri 08-12-17 16:38:46, kemi wrote:
> 
> 
> On 2017年11月30日 17:45, Michal Hocko wrote:
> > On Thu 30-11-17 17:32:08, kemi wrote:
> 
> > Do not get me wrong. If we want to make per-node stats more optimal,
> > then by all means let's do that. But having 3 sets of counters is just
> > way to much.
> > 
> 
> Hi, Michal
>   Apologize to respond later in this email thread.
> 
> After thinking about how to optimize our per-node stats more gracefully, 
> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> we can keep everything in per cpu counter and sum them up when read /proc
> or /sys for numa stats. 
> What's your idea for that? thanks

I would like to see a strong argument why we cannot make it a _standard_
node counter.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-08  8:47             ` Michal Hocko
@ 2017-12-12  2:05               ` kemi
  2017-12-12  8:11                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: kemi @ 2017-12-12  2:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年12月08日 16:47, Michal Hocko wrote:
> On Fri 08-12-17 16:38:46, kemi wrote:
>>
>>
>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>
>> After thinking about how to optimize our per-node stats more gracefully, 
>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>> we can keep everything in per cpu counter and sum them up when read /proc
>> or /sys for numa stats. 
>> What's your idea for that? thanks
> 
> I would like to see a strong argument why we cannot make it a _standard_
> node counter.
> 

all right. 
This issue is first reported and discussed in 2017 MM summit, referred to
the topic "Provoking and fixing memory bottlenecks -Focused on the page 
allocator presentation" presented by Jesper.

http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
2017-JesperBrouer.pdf (slide 15/16)

As you know, page allocator is too slow and has becomes a bottleneck
in high-speed network.
Jesper also showed some data in that presentation: with micro benchmark 
stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
(143->97) comes from CONFIG_NUMA. 

When I took a look at this issue, I reproduced this issue and got a
similar result to Jesper's. Furthermore, with the help from Jesper, 
the overhead is root caused and the real cause of this overhead comes
from an extra level of function calls such as zone_statistics() (*10%*,
nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
is the biggest one introduced by CONFIG_NUMA in fast path that we can 
do something for optimizing page allocator. Plus, the overhead of 
zone_statistics() significantly increase with more and more cpu 
cores and nodes due to cache bouncing.

Therefore, we submitted a patch before to mitigate the overhead of 
zone_statistics() by reducing global NUMA counter update frequency 
(enlarge threshold size, as suggested by Dave Hansen). I also would
like to have an implementation of a "_standard_node counter" for NUMA
stats, but I wonder how we can keep the performance gain at the
same time.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-12  2:05               ` kemi
@ 2017-12-12  8:11                 ` Michal Hocko
  2017-12-14  1:40                   ` kemi
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-12-12  8:11 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Tue 12-12-17 10:05:26, kemi wrote:
> 
> 
> On 2017年12月08日 16:47, Michal Hocko wrote:
> > On Fri 08-12-17 16:38:46, kemi wrote:
> >>
> >>
> >> On 2017年11月30日 17:45, Michal Hocko wrote:
> >>> On Thu 30-11-17 17:32:08, kemi wrote:
> >>
> >> After thinking about how to optimize our per-node stats more gracefully, 
> >> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> >> we can keep everything in per cpu counter and sum them up when read /proc
> >> or /sys for numa stats. 
> >> What's your idea for that? thanks
> > 
> > I would like to see a strong argument why we cannot make it a _standard_
> > node counter.
> > 
> 
> all right. 
> This issue is first reported and discussed in 2017 MM summit, referred to
> the topic "Provoking and fixing memory bottlenecks -Focused on the page 
> allocator presentation" presented by Jesper.
> 
> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
> 2017-JesperBrouer.pdf (slide 15/16)
> 
> As you know, page allocator is too slow and has becomes a bottleneck
> in high-speed network.
> Jesper also showed some data in that presentation: with micro benchmark 
> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
> (143->97) comes from CONFIG_NUMA. 
> 
> When I took a look at this issue, I reproduced this issue and got a
> similar result to Jesper's. Furthermore, with the help from Jesper, 
> the overhead is root caused and the real cause of this overhead comes
> from an extra level of function calls such as zone_statistics() (*10%*,
> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
> policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
> is the biggest one introduced by CONFIG_NUMA in fast path that we can 
> do something for optimizing page allocator. Plus, the overhead of 
> zone_statistics() significantly increase with more and more cpu 
> cores and nodes due to cache bouncing.
> 
> Therefore, we submitted a patch before to mitigate the overhead of 
> zone_statistics() by reducing global NUMA counter update frequency 
> (enlarge threshold size, as suggested by Dave Hansen). I also would
> like to have an implementation of a "_standard_node counter" for NUMA
> stats, but I wonder how we can keep the performance gain at the
> same time.

I understand all that. But we do have a way to put all that overhead
away by disabling the stats altogether. I presume that CPU cycle
sensitive workloads would simply use that option because the stats are
quite limited in their usefulness anyway IMHO. So we are back to: Do
normal workloads care all that much to have 3rd way to account for
events? I haven't heard a sound argument for that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-12  8:11                 ` Michal Hocko
@ 2017-12-14  1:40                   ` kemi
  2017-12-14  7:29                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: kemi @ 2017-12-14  1:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年12月12日 16:11, Michal Hocko wrote:
> On Tue 12-12-17 10:05:26, kemi wrote:
>>
>>
>> On 2017年12月08日 16:47, Michal Hocko wrote:
>>> On Fri 08-12-17 16:38:46, kemi wrote:
>>>>
>>>>
>>>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>>>
>>>> After thinking about how to optimize our per-node stats more gracefully, 
>>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>>>> we can keep everything in per cpu counter and sum them up when read /proc
>>>> or /sys for numa stats. 
>>>> What's your idea for that? thanks
>>>
>>> I would like to see a strong argument why we cannot make it a _standard_
>>> node counter.
>>>
>>
>> all right. 
>> This issue is first reported and discussed in 2017 MM summit, referred to
>> the topic "Provoking and fixing memory bottlenecks -Focused on the page 
>> allocator presentation" presented by Jesper.
>>
>> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
>> 2017-JesperBrouer.pdf (slide 15/16)
>>
>> As you know, page allocator is too slow and has becomes a bottleneck
>> in high-speed network.
>> Jesper also showed some data in that presentation: with micro benchmark 
>> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
>> (143->97) comes from CONFIG_NUMA. 
>>
>> When I took a look at this issue, I reproduced this issue and got a
>> similar result to Jesper's. Furthermore, with the help from Jesper, 
>> the overhead is root caused and the real cause of this overhead comes
>> from an extra level of function calls such as zone_statistics() (*10%*,
>> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
>> policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
>> is the biggest one introduced by CONFIG_NUMA in fast path that we can 
>> do something for optimizing page allocator. Plus, the overhead of 
>> zone_statistics() significantly increase with more and more cpu 
>> cores and nodes due to cache bouncing.
>>
>> Therefore, we submitted a patch before to mitigate the overhead of 
>> zone_statistics() by reducing global NUMA counter update frequency 
>> (enlarge threshold size, as suggested by Dave Hansen). I also would
>> like to have an implementation of a "_standard_node counter" for NUMA
>> stats, but I wonder how we can keep the performance gain at the
>> same time.
> 
> I understand all that. But we do have a way to put all that overhead
> away by disabling the stats altogether. I presume that CPU cycle
> sensitive workloads would simply use that option because the stats are
> quite limited in their usefulness anyway IMHO. So we are back to: Do
> normal workloads care all that much to have 3rd way to account for
> events? I haven't heard a sound argument for that.
> 

I'm not a fan of adding code that nobody(or 0.001%) cares.
We can't depend on that tunable interface too much, because our customers 
or even kernel hacker may not know that new added interface, or sometimes 
NUMA stats can't be disabled in their environments. That's the reason
why we spent time to do that optimization other than simply adding a runtime
configuration interface.

Furthermore, the code we optimized for is the core area of kernel that can
benefit most of kernel actions, more or less I think.

All right, let's think about it in another way, does a u64 percpu array per-node
for NUMA stats really make code too much complicated and hard to maintain?
I'm afraid not IMHO.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-14  1:40                   ` kemi
@ 2017-12-14  7:29                     ` Michal Hocko
  2017-12-14  8:55                       ` kemi
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-12-14  7:29 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Thu 14-12-17 09:40:32, kemi wrote:
> 
> 
> On 2017年12月12日 16:11, Michal Hocko wrote:
> > On Tue 12-12-17 10:05:26, kemi wrote:
> >>
> >>
> >> On 2017年12月08日 16:47, Michal Hocko wrote:
> >>> On Fri 08-12-17 16:38:46, kemi wrote:
> >>>>
> >>>>
> >>>> On 2017年11月30日 17:45, Michal Hocko wrote:
> >>>>> On Thu 30-11-17 17:32:08, kemi wrote:
> >>>>
> >>>> After thinking about how to optimize our per-node stats more gracefully, 
> >>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> >>>> we can keep everything in per cpu counter and sum them up when read /proc
> >>>> or /sys for numa stats. 
> >>>> What's your idea for that? thanks
> >>>
> >>> I would like to see a strong argument why we cannot make it a _standard_
> >>> node counter.
> >>>
> >>
> >> all right. 
> >> This issue is first reported and discussed in 2017 MM summit, referred to
> >> the topic "Provoking and fixing memory bottlenecks -Focused on the page 
> >> allocator presentation" presented by Jesper.
> >>
> >> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
> >> 2017-JesperBrouer.pdf (slide 15/16)
> >>
> >> As you know, page allocator is too slow and has becomes a bottleneck
> >> in high-speed network.
> >> Jesper also showed some data in that presentation: with micro benchmark 
> >> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
> >> (143->97) comes from CONFIG_NUMA. 
> >>
> >> When I took a look at this issue, I reproduced this issue and got a
> >> similar result to Jesper's. Furthermore, with the help from Jesper, 
> >> the overhead is root caused and the real cause of this overhead comes
> >> from an extra level of function calls such as zone_statistics() (*10%*,
> >> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
> >> policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
> >> is the biggest one introduced by CONFIG_NUMA in fast path that we can 
> >> do something for optimizing page allocator. Plus, the overhead of 
> >> zone_statistics() significantly increase with more and more cpu 
> >> cores and nodes due to cache bouncing.
> >>
> >> Therefore, we submitted a patch before to mitigate the overhead of 
> >> zone_statistics() by reducing global NUMA counter update frequency 
> >> (enlarge threshold size, as suggested by Dave Hansen). I also would
> >> like to have an implementation of a "_standard_node counter" for NUMA
> >> stats, but I wonder how we can keep the performance gain at the
> >> same time.
> > 
> > I understand all that. But we do have a way to put all that overhead
> > away by disabling the stats altogether. I presume that CPU cycle
> > sensitive workloads would simply use that option because the stats are
> > quite limited in their usefulness anyway IMHO. So we are back to: Do
> > normal workloads care all that much to have 3rd way to account for
> > events? I haven't heard a sound argument for that.
> > 
> 
> I'm not a fan of adding code that nobody(or 0.001%) cares.
> We can't depend on that tunable interface too much, because our customers 
> or even kernel hacker may not know that new added interface,

Come on. If somebody want's to tune the system to squeeze every single
cycle then there is tuning required and those people can figure out.

> or sometimes 
> NUMA stats can't be disabled in their environments.

why?

> That's the reason
> why we spent time to do that optimization other than simply adding a runtime
> configuration interface.
> 
> Furthermore, the code we optimized for is the core area of kernel that can
> benefit most of kernel actions, more or less I think.
> 
> All right, let's think about it in another way, does a u64 percpu array per-node
> for NUMA stats really make code too much complicated and hard to maintain?
> I'm afraid not IMHO.

I disagree. The whole numa stat things has turned out to be nasty to
maintain. For a very limited gain. Now you are just shifting that
elsewhere. Look, there are other counters taken in the allocator, we do
not want to treat them specially. We have a nice per-cpu infrastructure
here so I really fail to see why we should code-around it. If that can
be improved then by all means let's do it.

So unless you have a strong usecase I would vote for a simpler code.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-14  7:29                     ` Michal Hocko
@ 2017-12-14  8:55                       ` kemi
  2017-12-14  9:23                         ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: kemi @ 2017-12-14  8:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel



On 2017年12月14日 15:29, Michal Hocko wrote:
> On Thu 14-12-17 09:40:32, kemi wrote:
>>
>>
>> or sometimes 
>> NUMA stats can't be disabled in their environments.
> 
> why?
> 
>> That's the reason
>> why we spent time to do that optimization other than simply adding a runtime
>> configuration interface.
>>
>> Furthermore, the code we optimized for is the core area of kernel that can
>> benefit most of kernel actions, more or less I think.
>>
>> All right, let's think about it in another way, does a u64 percpu array per-node
>> for NUMA stats really make code too much complicated and hard to maintain?
>> I'm afraid not IMHO.
> 
> I disagree. The whole numa stat things has turned out to be nasty to
> maintain. For a very limited gain. Now you are just shifting that
> elsewhere. Look, there are other counters taken in the allocator, we do
> not want to treat them specially. We have a nice per-cpu infrastructure
> here so I really fail to see why we should code-around it. If that can
> be improved then by all means let's do it.
> 

Yes, I agree with you that we may improve current per-cpu infrastructure.
May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 for
this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The 
limitation of type s8 seems not enough with more and more cpu cores, especially
for those monotone increasing type of counters like NUMA counters.

                               before     after(moving numa to per_cpu_nodestat
                                              and change s8 to s16)   
sizeof(struct per_cpu_nodestat)  28                 68

If ok, we can also keep that improvement in a nice way.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
  2017-12-14  8:55                       ` kemi
@ 2017-12-14  9:23                         ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-12-14  9:23 UTC (permalink / raw)
  To: kemi
  Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes,
	Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li,
	Linux MM, Linux Kernel

On Thu 14-12-17 16:55:54, kemi wrote:
> 
> 
> On 2017年12月14日 15:29, Michal Hocko wrote:
> > On Thu 14-12-17 09:40:32, kemi wrote:
> >>
> >>
> >> or sometimes 
> >> NUMA stats can't be disabled in their environments.
> > 
> > why?
> > 
> >> That's the reason
> >> why we spent time to do that optimization other than simply adding a runtime
> >> configuration interface.
> >>
> >> Furthermore, the code we optimized for is the core area of kernel that can
> >> benefit most of kernel actions, more or less I think.
> >>
> >> All right, let's think about it in another way, does a u64 percpu array per-node
> >> for NUMA stats really make code too much complicated and hard to maintain?
> >> I'm afraid not IMHO.
> > 
> > I disagree. The whole numa stat things has turned out to be nasty to
> > maintain. For a very limited gain. Now you are just shifting that
> > elsewhere. Look, there are other counters taken in the allocator, we do
> > not want to treat them specially. We have a nice per-cpu infrastructure
> > here so I really fail to see why we should code-around it. If that can
> > be improved then by all means let's do it.
> > 
> 
> Yes, I agree with you that we may improve current per-cpu infrastructure.
> May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 for
> this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The 
> limitation of type s8 seems not enough with more and more cpu cores, especially
> for those monotone increasing type of counters like NUMA counters.
> 
>                                before     after(moving numa to per_cpu_nodestat
>                                               and change s8 to s16)   
> sizeof(struct per_cpu_nodestat)  28                 68
> 
> If ok, we can also keep that improvement in a nice way.

I wouldn't be opposed. Maybe we should make it nr_cpus sized.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-12-14  9:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-28  6:00 [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Kemi Wang
2017-11-28  6:00 ` [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics() Kemi Wang
2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
2017-11-28  8:33   ` kemi
2017-11-28 18:40   ` Andi Kleen
2017-11-28 21:56     ` Andrew Morton
2017-11-28 22:52     ` Vlastimil Babka
2017-11-29 12:17 ` Michal Hocko
2017-11-30  5:56   ` kemi
2017-11-30  8:53     ` Michal Hocko
2017-11-30  9:32       ` kemi
2017-11-30  9:45         ` Michal Hocko
2017-11-30 11:06           ` Wang, Kemi
2017-12-08  8:38           ` kemi
2017-12-08  8:47             ` Michal Hocko
2017-12-12  2:05               ` kemi
2017-12-12  8:11                 ` Michal Hocko
2017-12-14  1:40                   ` kemi
2017-12-14  7:29                     ` Michal Hocko
2017-12-14  8:55                       ` kemi
2017-12-14  9:23                         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).