All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-15  8:45 ` Kemi Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM submit, these are a substantial
source of overhead in the page allocator and are very rarely consumed. This
significant overhead in cache bouncing caused by zone counters (NUMA
associated counters) update in parallel in multi-threaded page allocation
(pointed out by Dave Hansen).

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of 32765, as a small threshold greatly increases the update frequency
of the global counter from local per cpu counter (suggested by Ying Huang).
The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
for per single page allocation and reclaim on Jesper's page_bench03
benchmark. Meanwhile, this patchset keeps the same style of virtual memory
statistics with little end-user-visible effects (see the first patch for
details), except that the number of NUMA items in each cpu
(vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
the value of NUMA counter to eliminate deviation.

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
(88 processors with 126G memory) with different size of threshold of pcp
counter.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

   Threshold   CPU cycles    Throughput(88 threads)
      32        799         241760478
      64        640         301628829
      125       537         358906028 <==> system by default
      256       468         412397590
      512       428         450550704
      4096      399         482520943
      20000     394         489009617
      30000     395         488017817
      32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
      N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Kemi Wang (2):
  mm: Change the call sites of numa statistics items
  mm: Update NUMA counter threshold size

 drivers/base/node.c    |  22 ++++---
 include/linux/mmzone.h |  25 +++++---
 include/linux/vmstat.h |  33 ++++++++++
 mm/page_alloc.c        |  10 +--
 mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 227 insertions(+), 25 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-15  8:45 ` Kemi Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM submit, these are a substantial
source of overhead in the page allocator and are very rarely consumed. This
significant overhead in cache bouncing caused by zone counters (NUMA
associated counters) update in parallel in multi-threaded page allocation
(pointed out by Dave Hansen).

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of 32765, as a small threshold greatly increases the update frequency
of the global counter from local per cpu counter (suggested by Ying Huang).
The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
for per single page allocation and reclaim on Jesper's page_bench03
benchmark. Meanwhile, this patchset keeps the same style of virtual memory
statistics with little end-user-visible effects (see the first patch for
details), except that the number of NUMA items in each cpu
(vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
the value of NUMA counter to eliminate deviation.

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
(88 processors with 126G memory) with different size of threshold of pcp
counter.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

   Threshold   CPU cycles    Throughput(88 threads)
      32        799         241760478
      64        640         301628829
      125       537         358906028 <==> system by default
      256       468         412397590
      512       428         450550704
      4096      399         482520943
      20000     394         489009617
      30000     395         488017817
      32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
      N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Kemi Wang (2):
  mm: Change the call sites of numa statistics items
  mm: Update NUMA counter threshold size

 drivers/base/node.c    |  22 ++++---
 include/linux/mmzone.h |  25 +++++---
 include/linux/vmstat.h |  33 ++++++++++
 mm/page_alloc.c        |  10 +--
 mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 227 insertions(+), 25 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] mm: Change the call sites of numa statistics items
  2017-08-15  8:45 ` Kemi Wang
@ 2017-08-15  8:45   ` Kemi Wang
  -1 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

In this patch,  NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the value of NUMA stats is shown behind zone page stats, and
the threshold size of NUMA stats is shown behind pcp threshold when users
*read* the zone info.

E.g. cat /proc/zoneinfo
    ***Base***                           ***With this patch***
nr_free_pages 3976                         nr_free_pages 3976
nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
nr_zone_active_anon 0                      nr_zone_active_anon 0
nr_zone_inactive_file 0                    nr_zone_inactive_file 0
nr_zone_active_file 0                      nr_zone_active_file 0
nr_zone_unevictable 0                      nr_zone_unevictable 0
nr_zone_write_pending 0                    nr_zone_write_pending 0
nr_mlock     0                             nr_mlock     0
nr_page_table_pages 0                      nr_page_table_pages 0
nr_kernel_stack 0                          nr_kernel_stack 0
nr_bounce    0                             nr_bounce    0
nr_zspages   0                             nr_zspages   0
numa_hit 0                                *nr_free_cma  0*
numa_miss 0                                numa_hit     0
numa_foreign 0                             numa_miss    0
numa_interleave 0                          numa_foreign 0
numa_local   0                             numa_interleave 0
numa_other   0                             numa_local   0
*nr_free_cma 0*                            numa_other 0
    ...                                        ...
vm stats threshold: 10                     vm stats threshold: 10
    ...                                   *vm numa stats threshold: 10*
                                               ...

The next patch updates the numa stats counter size and threshold.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 drivers/base/node.c    |  22 ++++---
 include/linux/mmzone.h |  25 +++++---
 include/linux/vmstat.h |  29 +++++++++
 mm/page_alloc.c        |  10 +--
 mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 227 insertions(+), 26 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d8dc830..12080c6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       sum_zone_node_page_state(dev->id, NUMA_HIT),
-		       sum_zone_node_page_state(dev->id, NUMA_MISS),
-		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
-		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
-		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
+		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
+		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
+		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
+		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
+		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 			     sum_zone_node_page_state(nid, i));
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n",
 			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+			     sum_zone_node_numa_state(nid, i));
+#endif
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		n += sprintf(buf+n, "%s %lu\n",
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+			     NR_VM_ZONE_NUMA_STAT_ITEMS],
 			     node_page_state(pgdat, i));
 
 	return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b..0b11ba7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,20 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
+#ifdef CONFIG_NUMA
+enum zone_numa_stat_item {
+	NUMA_HIT,		/* allocated in intended node */
+	NUMA_MISS,		/* allocated in non intended node */
+	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
+	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
+	NUMA_LOCAL,		/* allocation from local node */
+	NUMA_OTHER,		/* allocation from other node */
+	NR_VM_ZONE_NUMA_STAT_ITEMS
+};
+#else
+#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
+#endif
+
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
@@ -132,14 +146,6 @@ enum zone_stat_item {
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
-#ifdef CONFIG_NUMA
-	NUMA_HIT,		/* allocated in intended node */
-	NUMA_MISS,		/* allocated in non intended node */
-	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
-	NUMA_LOCAL,		/* allocation from local node */
-	NUMA_OTHER,		/* allocation from other node */
-#endif
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
@@ -276,6 +282,8 @@ struct per_cpu_pageset {
 	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
+	s8 numa_stat_threshold;
+	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
@@ -496,6 +504,7 @@ struct zone {
 	ZONE_PADDING(_pad3_)
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	atomic_long_t		vm_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b3d85f3..1e19379 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -107,8 +107,33 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
+#ifdef CONFIG_NUMA
+static inline void zone_numa_state_add(long x, struct zone *zone,
+				 enum zone_numa_stat_item item)
+{
+	atomic_long_add(x, &zone->vm_numa_stat[item]);
+	atomic_long_add(x, &vm_zone_numa_stat[item]);
+}
+
+static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
+{
+	long x = atomic_long_read(&vm_zone_numa_stat[item]);
+
+	return x;
+}
+
+static inline unsigned long zone_numa_state(struct zone *zone,
+					enum zone_numa_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_numa_stat[item]);
+
+	return x;
+}
+#endif /* CONFIG_NUMA */
+
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
 {
@@ -194,8 +219,12 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
 
 
 #ifdef CONFIG_NUMA
+extern void __inc_zone_numa_state(struct zone *,
+						enum zone_numa_stat_item);
 extern unsigned long sum_zone_node_page_state(int node,
 						enum zone_stat_item item);
+extern unsigned long sum_zone_node_numa_state(int node,
+						enum zone_numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d30e91..7222c47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2720,18 +2720,18 @@ int __isolate_free_page(struct page *page, unsigned int order)
 static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
-	enum zone_stat_item local_stat = NUMA_LOCAL;
+	enum zone_numa_stat_item local_stat = NUMA_LOCAL;
 
 	if (z->node != numa_node_id())
 		local_stat = NUMA_OTHER;
 
 	if (z->node == preferred_zone->node)
-		__inc_zone_state(z, NUMA_HIT);
+		__inc_zone_numa_state(z, NUMA_HIT);
 	else {
-		__inc_zone_state(z, NUMA_MISS);
-		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
+		__inc_zone_numa_state(z, NUMA_MISS);
+		__inc_zone_numa_state(preferred_zone, NUMA_FOREIGN);
 	}
-	__inc_zone_state(z, local_stat);
+	__inc_zone_numa_state(z, local_stat);
 #endif
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441b..5a7fa30 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -87,8 +87,10 @@ void vm_events_fold_cpu(int cpu)
  * vm_stat contains the global counters
  */
 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_zone_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
@@ -192,7 +194,10 @@ void refresh_zone_stat_thresholds(void)
 
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
-
+#ifdef CONFIG_NUMA
+			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+							= threshold;
+#endif
 			/* Base nodestat threshold on the largest populated zone. */
 			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
 			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
@@ -226,9 +231,14 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+#ifdef CONFIG_NUMA
+			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+							= threshold;
+#endif
+		}
 	}
 }
 
@@ -604,6 +614,32 @@ EXPORT_SYMBOL(dec_node_page_state);
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
+#ifdef CONFIG_NUMA
+static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+{
+	int i;
+	int changes = 0;
+
+	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+		if (zone_diff[i]) {
+			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		if (numa_diff[i]) {
+			atomic_long_add(numa_diff[i], &vm_zone_numa_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		if (node_diff[i]) {
+			atomic_long_add(node_diff[i], &vm_node_stat[i]);
+			changes++;
+	}
+	return changes;
+}
+#else
 static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
@@ -622,6 +658,7 @@ static int fold_diff(int *zone_diff, int *node_diff)
 	}
 	return changes;
 }
+#endif /* CONFIG_NUMA */
 
 /*
  * Update the zone counters for the current cpu.
@@ -645,6 +682,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+	int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
@@ -666,6 +706,18 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 		}
 #ifdef CONFIG_NUMA
+		for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+			int v;
+
+			v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
+			if (v) {
+
+				atomic_long_add(v, &zone->vm_numa_stat[i]);
+				global_numa_diff[i] += v;
+				__this_cpu_write(p->expire, 3);
+			}
+		}
+
 		if (do_pagesets) {
 			cond_resched();
 			/*
@@ -712,7 +764,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
+#ifdef CONFIG_NUMA
+	changes += fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
 	changes += fold_diff(global_zone_diff, global_node_diff);
+#endif
 	return changes;
 }
 
@@ -727,6 +783,9 @@ void cpu_vm_stats_fold(int cpu)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+	int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
@@ -743,6 +802,18 @@ void cpu_vm_stats_fold(int cpu)
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
+
+#ifdef CONFIG_NUMA
+		for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+			if (p->vm_numa_stat_diff[i]) {
+				int v;
+
+				v = p->vm_numa_stat_diff[i];
+				p->vm_numa_stat_diff[i] = 0;
+				atomic_long_add(v, &zone->vm_numa_stat[i]);
+				global_numa_diff[i] += v;
+			}
+#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -761,7 +832,11 @@ void cpu_vm_stats_fold(int cpu)
 			}
 	}
 
+#ifdef CONFIG_NUMA
+	fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
 	fold_diff(global_zone_diff, global_node_diff);
+#endif
 }
 
 /*
@@ -779,10 +854,37 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			atomic_long_add(v, &zone->vm_stat[i]);
 			atomic_long_add(v, &vm_zone_stat[i]);
 		}
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		if (pset->vm_numa_stat_diff[i]) {
+			int v = pset->vm_numa_stat_diff[i];
+			pset->vm_numa_stat_diff[i] = 0;
+			atomic_long_add(v, &zone->vm_numa_stat[i]);
+			atomic_long_add(v, &vm_zone_numa_stat[i]);
+		}
+#endif
 }
 #endif
 
 #ifdef CONFIG_NUMA
+void __inc_zone_numa_state(struct zone *zone,
+				 enum zone_numa_stat_item item)
+{
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_numa_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_inc_return(*p);
+	t = __this_cpu_read(pcp->numa_stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
+
+		zone_numa_state_add(v + overstep, zone, item);
+		__this_cpu_write(*p, -overstep);
+	}
+}
+
 /*
  * Determine the per node value of a stat item. This function
  * is called frequently in a NUMA machine, so try to be as
@@ -801,6 +903,22 @@ unsigned long sum_zone_node_page_state(int node,
 	return count;
 }
 
+/* Determine the per node value of a numa stat item. To avoid deviation,
+ * the per-cpu stat number in vm_numa_stat_diff[] is also included.
+ */
+unsigned long sum_zone_node_numa_state(int node,
+				 enum zone_numa_stat_item item)
+{
+	struct zone *zones = NODE_DATA(node)->node_zones;
+	int i;
+	unsigned long count = 0;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		count += zone_numa_state(zones + i, item);
+
+	return count;
+}
+
 /*
  * Determine the per node value of a stat item.
  */
@@ -934,6 +1052,9 @@ const char * const vmstat_text[] = {
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
+	"nr_free_cma",
+
+	/* enum zone_numa_stat_item counters */
 #ifdef CONFIG_NUMA
 	"numa_hit",
 	"numa_miss",
@@ -942,7 +1063,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"nr_free_cma",
 
 	/* Node-based counters */
 	"nr_inactive_anon",
@@ -1097,7 +1217,6 @@ const char * const vmstat_text[] = {
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
 
-
 #if (defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)) || \
      defined(CONFIG_PROC_FS)
 static void *frag_start(struct seq_file *m, loff_t *pos)
@@ -1375,7 +1494,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			seq_printf(m, "\n      %-12s %lu",
-				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+				NR_VM_ZONE_NUMA_STAT_ITEMS],
 				node_page_state(pgdat, i));
 		}
 	}
@@ -1412,6 +1532,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		seq_printf(m, "\n      %-12s %lu",
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				zone_numa_state(zone, i));
+#endif
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1430,6 +1557,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  vm stats threshold: %d",
 				pageset->stat_threshold);
 #endif
+
+#ifdef CONFIG_NUMA
+		seq_printf(m, "\n  vm numa stats threshold: %d",
+				pageset->numa_stat_threshold);
+#endif
 	}
 	seq_printf(m,
 		   "\n  node_unreclaimable:  %u"
@@ -1488,6 +1620,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+			  NR_VM_ZONE_NUMA_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
@@ -1503,6 +1636,12 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		v[i] = global_numa_state(i);
+	v += NR_VM_ZONE_NUMA_STAT_ITEMS;
+#endif
+
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		v[i] = global_node_page_state(i);
 	v += NR_VM_NODE_STAT_ITEMS;
@@ -1604,6 +1743,16 @@ int vmstat_refresh(struct ctl_table *table, int write,
 			err = -EINVAL;
 		}
 	}
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+		val = atomic_long_read(&vm_zone_numa_stat[i]);
+		if (val < 0) {
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
+			err = -EINVAL;
+		}
+	}
+#endif
 	if (err)
 		return err;
 	if (write)
@@ -1645,13 +1794,19 @@ static bool need_update(int cpu)
 		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
 
 		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+#ifdef CONFIG_NUMA
+		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+#endif
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
 		 * This works because the diffs are byte sized items.
 		 */
 		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
 			return true;
-
+#ifdef CONFIG_NUMA
+		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_ZONE_NUMA_STAT_ITEMS))
+			return true;
+#endif
 	}
 	return false;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/2] mm: Change the call sites of numa statistics items
@ 2017-08-15  8:45   ` Kemi Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

In this patch,  NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the value of NUMA stats is shown behind zone page stats, and
the threshold size of NUMA stats is shown behind pcp threshold when users
*read* the zone info.

E.g. cat /proc/zoneinfo
    ***Base***                           ***With this patch***
nr_free_pages 3976                         nr_free_pages 3976
nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
nr_zone_active_anon 0                      nr_zone_active_anon 0
nr_zone_inactive_file 0                    nr_zone_inactive_file 0
nr_zone_active_file 0                      nr_zone_active_file 0
nr_zone_unevictable 0                      nr_zone_unevictable 0
nr_zone_write_pending 0                    nr_zone_write_pending 0
nr_mlock     0                             nr_mlock     0
nr_page_table_pages 0                      nr_page_table_pages 0
nr_kernel_stack 0                          nr_kernel_stack 0
nr_bounce    0                             nr_bounce    0
nr_zspages   0                             nr_zspages   0
numa_hit 0                                *nr_free_cma  0*
numa_miss 0                                numa_hit     0
numa_foreign 0                             numa_miss    0
numa_interleave 0                          numa_foreign 0
numa_local   0                             numa_interleave 0
numa_other   0                             numa_local   0
*nr_free_cma 0*                            numa_other 0
    ...                                        ...
vm stats threshold: 10                     vm stats threshold: 10
    ...                                   *vm numa stats threshold: 10*
                                               ...

The next patch updates the numa stats counter size and threshold.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 drivers/base/node.c    |  22 ++++---
 include/linux/mmzone.h |  25 +++++---
 include/linux/vmstat.h |  29 +++++++++
 mm/page_alloc.c        |  10 +--
 mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 227 insertions(+), 26 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d8dc830..12080c6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       sum_zone_node_page_state(dev->id, NUMA_HIT),
-		       sum_zone_node_page_state(dev->id, NUMA_MISS),
-		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
-		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
-		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
+		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
+		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
+		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
+		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
+		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 			     sum_zone_node_page_state(nid, i));
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n",
 			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+			     sum_zone_node_numa_state(nid, i));
+#endif
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		n += sprintf(buf+n, "%s %lu\n",
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+			     NR_VM_ZONE_NUMA_STAT_ITEMS],
 			     node_page_state(pgdat, i));
 
 	return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b..0b11ba7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,20 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
+#ifdef CONFIG_NUMA
+enum zone_numa_stat_item {
+	NUMA_HIT,		/* allocated in intended node */
+	NUMA_MISS,		/* allocated in non intended node */
+	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
+	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
+	NUMA_LOCAL,		/* allocation from local node */
+	NUMA_OTHER,		/* allocation from other node */
+	NR_VM_ZONE_NUMA_STAT_ITEMS
+};
+#else
+#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
+#endif
+
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
@@ -132,14 +146,6 @@ enum zone_stat_item {
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
-#ifdef CONFIG_NUMA
-	NUMA_HIT,		/* allocated in intended node */
-	NUMA_MISS,		/* allocated in non intended node */
-	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
-	NUMA_LOCAL,		/* allocation from local node */
-	NUMA_OTHER,		/* allocation from other node */
-#endif
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
@@ -276,6 +282,8 @@ struct per_cpu_pageset {
 	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
+	s8 numa_stat_threshold;
+	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
@@ -496,6 +504,7 @@ struct zone {
 	ZONE_PADDING(_pad3_)
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	atomic_long_t		vm_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b3d85f3..1e19379 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -107,8 +107,33 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
+#ifdef CONFIG_NUMA
+static inline void zone_numa_state_add(long x, struct zone *zone,
+				 enum zone_numa_stat_item item)
+{
+	atomic_long_add(x, &zone->vm_numa_stat[item]);
+	atomic_long_add(x, &vm_zone_numa_stat[item]);
+}
+
+static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
+{
+	long x = atomic_long_read(&vm_zone_numa_stat[item]);
+
+	return x;
+}
+
+static inline unsigned long zone_numa_state(struct zone *zone,
+					enum zone_numa_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_numa_stat[item]);
+
+	return x;
+}
+#endif /* CONFIG_NUMA */
+
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
 {
@@ -194,8 +219,12 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
 
 
 #ifdef CONFIG_NUMA
+extern void __inc_zone_numa_state(struct zone *,
+						enum zone_numa_stat_item);
 extern unsigned long sum_zone_node_page_state(int node,
 						enum zone_stat_item item);
+extern unsigned long sum_zone_node_numa_state(int node,
+						enum zone_numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d30e91..7222c47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2720,18 +2720,18 @@ int __isolate_free_page(struct page *page, unsigned int order)
 static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
-	enum zone_stat_item local_stat = NUMA_LOCAL;
+	enum zone_numa_stat_item local_stat = NUMA_LOCAL;
 
 	if (z->node != numa_node_id())
 		local_stat = NUMA_OTHER;
 
 	if (z->node == preferred_zone->node)
-		__inc_zone_state(z, NUMA_HIT);
+		__inc_zone_numa_state(z, NUMA_HIT);
 	else {
-		__inc_zone_state(z, NUMA_MISS);
-		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
+		__inc_zone_numa_state(z, NUMA_MISS);
+		__inc_zone_numa_state(preferred_zone, NUMA_FOREIGN);
 	}
-	__inc_zone_state(z, local_stat);
+	__inc_zone_numa_state(z, local_stat);
 #endif
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441b..5a7fa30 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -87,8 +87,10 @@ void vm_events_fold_cpu(int cpu)
  * vm_stat contains the global counters
  */
 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_zone_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
@@ -192,7 +194,10 @@ void refresh_zone_stat_thresholds(void)
 
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
-
+#ifdef CONFIG_NUMA
+			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+							= threshold;
+#endif
 			/* Base nodestat threshold on the largest populated zone. */
 			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
 			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
@@ -226,9 +231,14 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+#ifdef CONFIG_NUMA
+			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+							= threshold;
+#endif
+		}
 	}
 }
 
@@ -604,6 +614,32 @@ EXPORT_SYMBOL(dec_node_page_state);
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
+#ifdef CONFIG_NUMA
+static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+{
+	int i;
+	int changes = 0;
+
+	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+		if (zone_diff[i]) {
+			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		if (numa_diff[i]) {
+			atomic_long_add(numa_diff[i], &vm_zone_numa_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		if (node_diff[i]) {
+			atomic_long_add(node_diff[i], &vm_node_stat[i]);
+			changes++;
+	}
+	return changes;
+}
+#else
 static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
@@ -622,6 +658,7 @@ static int fold_diff(int *zone_diff, int *node_diff)
 	}
 	return changes;
 }
+#endif /* CONFIG_NUMA */
 
 /*
  * Update the zone counters for the current cpu.
@@ -645,6 +682,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+	int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
@@ -666,6 +706,18 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 		}
 #ifdef CONFIG_NUMA
+		for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+			int v;
+
+			v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
+			if (v) {
+
+				atomic_long_add(v, &zone->vm_numa_stat[i]);
+				global_numa_diff[i] += v;
+				__this_cpu_write(p->expire, 3);
+			}
+		}
+
 		if (do_pagesets) {
 			cond_resched();
 			/*
@@ -712,7 +764,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
+#ifdef CONFIG_NUMA
+	changes += fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
 	changes += fold_diff(global_zone_diff, global_node_diff);
+#endif
 	return changes;
 }
 
@@ -727,6 +783,9 @@ void cpu_vm_stats_fold(int cpu)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+	int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
@@ -743,6 +802,18 @@ void cpu_vm_stats_fold(int cpu)
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
+
+#ifdef CONFIG_NUMA
+		for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+			if (p->vm_numa_stat_diff[i]) {
+				int v;
+
+				v = p->vm_numa_stat_diff[i];
+				p->vm_numa_stat_diff[i] = 0;
+				atomic_long_add(v, &zone->vm_numa_stat[i]);
+				global_numa_diff[i] += v;
+			}
+#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -761,7 +832,11 @@ void cpu_vm_stats_fold(int cpu)
 			}
 	}
 
+#ifdef CONFIG_NUMA
+	fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
 	fold_diff(global_zone_diff, global_node_diff);
+#endif
 }
 
 /*
@@ -779,10 +854,37 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			atomic_long_add(v, &zone->vm_stat[i]);
 			atomic_long_add(v, &vm_zone_stat[i]);
 		}
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		if (pset->vm_numa_stat_diff[i]) {
+			int v = pset->vm_numa_stat_diff[i];
+			pset->vm_numa_stat_diff[i] = 0;
+			atomic_long_add(v, &zone->vm_numa_stat[i]);
+			atomic_long_add(v, &vm_zone_numa_stat[i]);
+		}
+#endif
 }
 #endif
 
 #ifdef CONFIG_NUMA
+void __inc_zone_numa_state(struct zone *zone,
+				 enum zone_numa_stat_item item)
+{
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_numa_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_inc_return(*p);
+	t = __this_cpu_read(pcp->numa_stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
+
+		zone_numa_state_add(v + overstep, zone, item);
+		__this_cpu_write(*p, -overstep);
+	}
+}
+
 /*
  * Determine the per node value of a stat item. This function
  * is called frequently in a NUMA machine, so try to be as
@@ -801,6 +903,22 @@ unsigned long sum_zone_node_page_state(int node,
 	return count;
 }
 
+/* Determine the per node value of a numa stat item. To avoid deviation,
+ * the per-cpu stat number in vm_numa_stat_diff[] is also included.
+ */
+unsigned long sum_zone_node_numa_state(int node,
+				 enum zone_numa_stat_item item)
+{
+	struct zone *zones = NODE_DATA(node)->node_zones;
+	int i;
+	unsigned long count = 0;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		count += zone_numa_state(zones + i, item);
+
+	return count;
+}
+
 /*
  * Determine the per node value of a stat item.
  */
@@ -934,6 +1052,9 @@ const char * const vmstat_text[] = {
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
+	"nr_free_cma",
+
+	/* enum zone_numa_stat_item counters */
 #ifdef CONFIG_NUMA
 	"numa_hit",
 	"numa_miss",
@@ -942,7 +1063,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"nr_free_cma",
 
 	/* Node-based counters */
 	"nr_inactive_anon",
@@ -1097,7 +1217,6 @@ const char * const vmstat_text[] = {
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
 
-
 #if (defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)) || \
      defined(CONFIG_PROC_FS)
 static void *frag_start(struct seq_file *m, loff_t *pos)
@@ -1375,7 +1494,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			seq_printf(m, "\n      %-12s %lu",
-				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+				NR_VM_ZONE_NUMA_STAT_ITEMS],
 				node_page_state(pgdat, i));
 		}
 	}
@@ -1412,6 +1532,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		seq_printf(m, "\n      %-12s %lu",
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				zone_numa_state(zone, i));
+#endif
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1430,6 +1557,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  vm stats threshold: %d",
 				pageset->stat_threshold);
 #endif
+
+#ifdef CONFIG_NUMA
+		seq_printf(m, "\n  vm numa stats threshold: %d",
+				pageset->numa_stat_threshold);
+#endif
 	}
 	seq_printf(m,
 		   "\n  node_unreclaimable:  %u"
@@ -1488,6 +1620,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+			  NR_VM_ZONE_NUMA_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
@@ -1503,6 +1636,12 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+		v[i] = global_numa_state(i);
+	v += NR_VM_ZONE_NUMA_STAT_ITEMS;
+#endif
+
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		v[i] = global_node_page_state(i);
 	v += NR_VM_NODE_STAT_ITEMS;
@@ -1604,6 +1743,16 @@ int vmstat_refresh(struct ctl_table *table, int write,
 			err = -EINVAL;
 		}
 	}
+#ifdef CONFIG_NUMA
+	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+		val = atomic_long_read(&vm_zone_numa_stat[i]);
+		if (val < 0) {
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
+			err = -EINVAL;
+		}
+	}
+#endif
 	if (err)
 		return err;
 	if (write)
@@ -1645,13 +1794,19 @@ static bool need_update(int cpu)
 		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
 
 		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+#ifdef CONFIG_NUMA
+		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+#endif
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
 		 * This works because the diffs are byte sized items.
 		 */
 		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
 			return true;
-
+#ifdef CONFIG_NUMA
+		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_ZONE_NUMA_STAT_ITEMS))
+			return true;
+#endif
 	}
 	return false;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15  8:45 ` Kemi Wang
@ 2017-08-15  8:45   ` Kemi Wang
  -1 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).

This patch updates NUMA counter threshold to a fixed size of 32765, as a
small threshold greatly increases the update frequency of the global
counter from local per cpu counter, and the number of NUMA items in each
cpu (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user
*reads* the value of numa counter to eliminate deviation (suggested by
Ying Huang).

The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

 Threshold   CPU cycles    Throughput(88 threads)
     32          799         241760478
     64          640         301628829
     125         537         358906028 <==> system by default (base)
     256         468         412397590
     512         428         450550704
     4096        399         482520943
     20000       394         489009617
     30000       395         488017817
     32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
     N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/vmstat.h |  6 +++++-
 mm/vmstat.c            | 23 ++++++++++-------------
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b11ba7..7eaf0e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -282,8 +282,8 @@ struct per_cpu_pageset {
 	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
-	s8 numa_stat_threshold;
-	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
+	s16 numa_stat_threshold;
+	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1e19379..d97cc34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
 	return x;
 }
 
-static inline unsigned long zone_numa_state(struct zone *zone,
+static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
 					enum zone_numa_stat_item item)
 {
 	long x = atomic_long_read(&zone->vm_numa_stat[item]);
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
 
 	return x;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5a7fa30..c7f50ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define NUMA_STAT_THRESHOLD  32765
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -196,7 +198,7 @@ void refresh_zone_stat_thresholds(void)
 							= threshold;
 #ifdef CONFIG_NUMA
 			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-							= threshold;
+							= NUMA_STAT_THRESHOLD;
 #endif
 			/* Base nodestat threshold on the largest populated zone. */
 			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
@@ -231,14 +233,9 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu) {
+		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
-#ifdef CONFIG_NUMA
-			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-							= threshold;
-#endif
-		}
 	}
 }
 
@@ -872,13 +869,13 @@ void __inc_zone_numa_state(struct zone *zone,
 				 enum zone_numa_stat_item item)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	s8 __percpu *p = pcp->vm_numa_stat_diff + item;
-	s8 v, t;
+	s16 __percpu *p = pcp->vm_numa_stat_diff + item;
+	s16 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->numa_stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s16 overstep = t >> 1;
 
 		zone_numa_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -914,7 +911,7 @@ unsigned long sum_zone_node_numa_state(int node,
 	unsigned long count = 0;
 
 	for (i = 0; i < MAX_NR_ZONES; i++)
-		count += zone_numa_state(zones + i, item);
+		count += zone_numa_state_snapshot(zones + i, item);
 
 	return count;
 }
@@ -1536,7 +1533,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
 		seq_printf(m, "\n      %-12s %lu",
 				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-				zone_numa_state(zone, i));
+				zone_numa_state_snapshot(zone, i));
 #endif
 
 	seq_printf(m, "\n  pagesets");
@@ -1795,7 +1792,7 @@ static bool need_update(int cpu)
 
 		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
 #ifdef CONFIG_NUMA
-		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
 #endif
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15  8:45   ` Kemi Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Kemi Wang @ 2017-08-15  8:45 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner
  Cc: Dave, Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel, Kemi Wang

There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).

This patch updates NUMA counter threshold to a fixed size of 32765, as a
small threshold greatly increases the update frequency of the global
counter from local per cpu counter, and the number of NUMA items in each
cpu (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user
*reads* the value of numa counter to eliminate deviation (suggested by
Ying Huang).

The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

 Threshold   CPU cycles    Throughput(88 threads)
     32          799         241760478
     64          640         301628829
     125         537         358906028 <==> system by default (base)
     256         468         412397590
     512         428         450550704
     4096        399         482520943
     20000       394         489009617
     30000       395         488017817
     32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
     N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/vmstat.h |  6 +++++-
 mm/vmstat.c            | 23 ++++++++++-------------
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b11ba7..7eaf0e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -282,8 +282,8 @@ struct per_cpu_pageset {
 	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
-	s8 numa_stat_threshold;
-	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
+	s16 numa_stat_threshold;
+	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1e19379..d97cc34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
 	return x;
 }
 
-static inline unsigned long zone_numa_state(struct zone *zone,
+static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
 					enum zone_numa_stat_item item)
 {
 	long x = atomic_long_read(&zone->vm_numa_stat[item]);
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
 
 	return x;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5a7fa30..c7f50ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define NUMA_STAT_THRESHOLD  32765
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -196,7 +198,7 @@ void refresh_zone_stat_thresholds(void)
 							= threshold;
 #ifdef CONFIG_NUMA
 			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-							= threshold;
+							= NUMA_STAT_THRESHOLD;
 #endif
 			/* Base nodestat threshold on the largest populated zone. */
 			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
@@ -231,14 +233,9 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu) {
+		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
-#ifdef CONFIG_NUMA
-			per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-							= threshold;
-#endif
-		}
 	}
 }
 
@@ -872,13 +869,13 @@ void __inc_zone_numa_state(struct zone *zone,
 				 enum zone_numa_stat_item item)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	s8 __percpu *p = pcp->vm_numa_stat_diff + item;
-	s8 v, t;
+	s16 __percpu *p = pcp->vm_numa_stat_diff + item;
+	s16 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->numa_stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s16 overstep = t >> 1;
 
 		zone_numa_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -914,7 +911,7 @@ unsigned long sum_zone_node_numa_state(int node,
 	unsigned long count = 0;
 
 	for (i = 0; i < MAX_NR_ZONES; i++)
-		count += zone_numa_state(zones + i, item);
+		count += zone_numa_state_snapshot(zones + i, item);
 
 	return count;
 }
@@ -1536,7 +1533,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
 		seq_printf(m, "\n      %-12s %lu",
 				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-				zone_numa_state(zone, i));
+				zone_numa_state_snapshot(zone, i));
 #endif
 
 	seq_printf(m, "\n  pagesets");
@@ -1795,7 +1792,7 @@ static bool need_update(int cpu)
 
 		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
 #ifdef CONFIG_NUMA
-		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+		BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
 #endif
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] mm: Change the call sites of numa statistics items
  2017-08-15  8:45   ` Kemi Wang
@ 2017-08-15  9:49     ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15  9:49 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
> In this patch,  NUMA statistics is separated from zone statistics
> framework, all the call sites of NUMA stats are changed to use
> numa-stats-specific functions, it does not have any functionality change
> except that the value of NUMA stats is shown behind zone page stats, and
> the threshold size of NUMA stats is shown behind pcp threshold when users
> *read* the zone info.
> 
> E.g. cat /proc/zoneinfo
>     ***Base***                           ***With this patch***
> nr_free_pages 3976                         nr_free_pages 3976
> nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
> nr_zone_active_anon 0                      nr_zone_active_anon 0
> nr_zone_inactive_file 0                    nr_zone_inactive_file 0
> nr_zone_active_file 0                      nr_zone_active_file 0
> nr_zone_unevictable 0                      nr_zone_unevictable 0
> nr_zone_write_pending 0                    nr_zone_write_pending 0
> nr_mlock     0                             nr_mlock     0
> nr_page_table_pages 0                      nr_page_table_pages 0
> nr_kernel_stack 0                          nr_kernel_stack 0
> nr_bounce    0                             nr_bounce    0
> nr_zspages   0                             nr_zspages   0
> numa_hit 0                                *nr_free_cma  0*
> numa_miss 0                                numa_hit     0
> numa_foreign 0                             numa_miss    0
> numa_interleave 0                          numa_foreign 0
> numa_local   0                             numa_interleave 0
> numa_other   0                             numa_local   0
> *nr_free_cma 0*                            numa_other 0
>     ...                                        ...
> vm stats threshold: 10                     vm stats threshold: 10
>     ...                                   *vm numa stats threshold: 10*
>                                                ...
> 
> The next patch updates the numa stats counter size and threshold.
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  drivers/base/node.c    |  22 ++++---
>  include/linux/mmzone.h |  25 +++++---
>  include/linux/vmstat.h |  29 +++++++++
>  mm/page_alloc.c        |  10 +--
>  mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index d8dc830..12080c6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
>  		       "interleave_hit %lu\n"
>  		       "local_node %lu\n"
>  		       "other_node %lu\n",
> -		       sum_zone_node_page_state(dev->id, NUMA_HIT),
> -		       sum_zone_node_page_state(dev->id, NUMA_MISS),
> -		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
> -		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
> -		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
> -		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
> +		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
> +		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
> +		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
> +		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
> +		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
> +		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
>  }

The names are very similar and it would be preferred if the names were
visually different like sum_zone_numa_stat() which is hard to confuse with
the zone stat fields.

>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>  
> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>  			     sum_zone_node_page_state(nid, i));
>  
> -	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +#ifdef CONFIG_NUMA
> +	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> +			     sum_zone_node_numa_state(nid, i));
> +#endif

Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
NR_VM_NODE_STAT_ITEMS.

> +
> +	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +		n += sprintf(buf+n, "%s %lu\n",
> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> +			     NR_VM_ZONE_NUMA_STAT_ITEMS],
>  			     node_page_state(pgdat, i));
>  
>  	return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b..0b11ba7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -114,6 +114,20 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  #endif
>  
> +#ifdef CONFIG_NUMA
> +enum zone_numa_stat_item {
> +	NUMA_HIT,		/* allocated in intended node */
> +	NUMA_MISS,		/* allocated in non intended node */
> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> +	NUMA_LOCAL,		/* allocation from local node */
> +	NUMA_OTHER,		/* allocation from other node */
> +	NR_VM_ZONE_NUMA_STAT_ITEMS
> +};
> +#else
> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
> +#endif
> +
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> @@ -132,14 +146,6 @@ enum zone_stat_item {
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>  #endif
> -#ifdef CONFIG_NUMA
> -	NUMA_HIT,		/* allocated in intended node */
> -	NUMA_MISS,		/* allocated in non intended node */
> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> -	NUMA_LOCAL,		/* allocation from local node */
> -	NUMA_OTHER,		/* allocation from other node */
> -#endif
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
>  
> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
>  	struct per_cpu_pages pcp;
>  #ifdef CONFIG_NUMA
>  	s8 expire;
> +	s8 numa_stat_threshold;
> +	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>  #endif
>  #ifdef CONFIG_SMP
>  	s8 stat_threshold;

Ok. this slightly increases the size of the per_cpu_pageset due to
numa_stat_threshold. The structure occupes 2 cache lines and still occupies
2 cache lines afterwards so that is ok but consider hard-coding the value
of it. The locality stats are never used as part of a decision made by the
kernel and they get summed when reading proc unconditionally. There is little
benefit to tuning that threshold at all and there should be a very small
performance gain if it's removed because it'll be a compile-time constant.

The rest of the patch is mostly mechanical.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] mm: Change the call sites of numa statistics items
@ 2017-08-15  9:49     ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15  9:49 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
> In this patch,  NUMA statistics is separated from zone statistics
> framework, all the call sites of NUMA stats are changed to use
> numa-stats-specific functions, it does not have any functionality change
> except that the value of NUMA stats is shown behind zone page stats, and
> the threshold size of NUMA stats is shown behind pcp threshold when users
> *read* the zone info.
> 
> E.g. cat /proc/zoneinfo
>     ***Base***                           ***With this patch***
> nr_free_pages 3976                         nr_free_pages 3976
> nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
> nr_zone_active_anon 0                      nr_zone_active_anon 0
> nr_zone_inactive_file 0                    nr_zone_inactive_file 0
> nr_zone_active_file 0                      nr_zone_active_file 0
> nr_zone_unevictable 0                      nr_zone_unevictable 0
> nr_zone_write_pending 0                    nr_zone_write_pending 0
> nr_mlock     0                             nr_mlock     0
> nr_page_table_pages 0                      nr_page_table_pages 0
> nr_kernel_stack 0                          nr_kernel_stack 0
> nr_bounce    0                             nr_bounce    0
> nr_zspages   0                             nr_zspages   0
> numa_hit 0                                *nr_free_cma  0*
> numa_miss 0                                numa_hit     0
> numa_foreign 0                             numa_miss    0
> numa_interleave 0                          numa_foreign 0
> numa_local   0                             numa_interleave 0
> numa_other   0                             numa_local   0
> *nr_free_cma 0*                            numa_other 0
>     ...                                        ...
> vm stats threshold: 10                     vm stats threshold: 10
>     ...                                   *vm numa stats threshold: 10*
>                                                ...
> 
> The next patch updates the numa stats counter size and threshold.
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  drivers/base/node.c    |  22 ++++---
>  include/linux/mmzone.h |  25 +++++---
>  include/linux/vmstat.h |  29 +++++++++
>  mm/page_alloc.c        |  10 +--
>  mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index d8dc830..12080c6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
>  		       "interleave_hit %lu\n"
>  		       "local_node %lu\n"
>  		       "other_node %lu\n",
> -		       sum_zone_node_page_state(dev->id, NUMA_HIT),
> -		       sum_zone_node_page_state(dev->id, NUMA_MISS),
> -		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
> -		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
> -		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
> -		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
> +		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
> +		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
> +		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
> +		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
> +		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
> +		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
>  }

The names are very similar and it would be preferred if the names were
visually different like sum_zone_numa_stat() which is hard to confuse with
the zone stat fields.

>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>  
> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>  			     sum_zone_node_page_state(nid, i));
>  
> -	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +#ifdef CONFIG_NUMA
> +	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> +			     sum_zone_node_numa_state(nid, i));
> +#endif

Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
NR_VM_NODE_STAT_ITEMS.

> +
> +	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +		n += sprintf(buf+n, "%s %lu\n",
> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> +			     NR_VM_ZONE_NUMA_STAT_ITEMS],
>  			     node_page_state(pgdat, i));
>  
>  	return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b..0b11ba7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -114,6 +114,20 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  #endif
>  
> +#ifdef CONFIG_NUMA
> +enum zone_numa_stat_item {
> +	NUMA_HIT,		/* allocated in intended node */
> +	NUMA_MISS,		/* allocated in non intended node */
> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> +	NUMA_LOCAL,		/* allocation from local node */
> +	NUMA_OTHER,		/* allocation from other node */
> +	NR_VM_ZONE_NUMA_STAT_ITEMS
> +};
> +#else
> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
> +#endif
> +
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> @@ -132,14 +146,6 @@ enum zone_stat_item {
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>  #endif
> -#ifdef CONFIG_NUMA
> -	NUMA_HIT,		/* allocated in intended node */
> -	NUMA_MISS,		/* allocated in non intended node */
> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> -	NUMA_LOCAL,		/* allocation from local node */
> -	NUMA_OTHER,		/* allocation from other node */
> -#endif
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
>  
> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
>  	struct per_cpu_pages pcp;
>  #ifdef CONFIG_NUMA
>  	s8 expire;
> +	s8 numa_stat_threshold;
> +	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>  #endif
>  #ifdef CONFIG_SMP
>  	s8 stat_threshold;

Ok. this slightly increases the size of the per_cpu_pageset due to
numa_stat_threshold. The structure occupes 2 cache lines and still occupies
2 cache lines afterwards so that is ok but consider hard-coding the value
of it. The locality stats are never used as part of a decision made by the
kernel and they get summed when reading proc unconditionally. There is little
benefit to tuning that threshold at all and there should be a very small
performance gain if it's removed because it'll be a compile-time constant.

The rest of the patch is mostly mechanical.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15  8:45   ` Kemi Wang
@ 2017-08-15  9:58     ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15  9:58 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>  Threshold   CPU cycles    Throughput(88 threads)
>      32          799         241760478
>      64          640         301628829
>      125         537         358906028 <==> system by default (base)
>      256         468         412397590
>      512         428         450550704
>      4096        399         482520943
>      20000       394         489009617
>      30000       395         488017817
>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> ---
>  include/linux/mmzone.h |  4 ++--
>  include/linux/vmstat.h |  6 +++++-
>  mm/vmstat.c            | 23 ++++++++++-------------
>  3 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0b11ba7..7eaf0e8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>  	struct per_cpu_pages pcp;
>  #ifdef CONFIG_NUMA
>  	s8 expire;
> -	s8 numa_stat_threshold;
> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> +	s16 numa_stat_threshold;
> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];

I'm fairly sure this pushes the size of that structure into the next
cache line which is not welcome.

vm_numa_stat_diff is an always incrementing field. How much do you gain
if this becomes a u8 code and remove any code that deals with negative
values? That would double the threshold without consuming another cache line.

Furthermore, the stats in question are only ever incremented by one.
That means that any calcluation related to overlap can be removed and
special cased that it'll never overlap by more than 1. That potentially
removes code that is required for other stats but not locality stats.
This may give enough savings to avoid moving to s16.

Very broadly speaking, I like what you're doing but I would like to see
more work on reducing any unnecessary code in that path (such as dealing
with overlaps for single increments) and treat incrasing the cache footprint
only as a very last resort.

>  #endif
>  #ifdef CONFIG_SMP
>  	s8 stat_threshold;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1e19379..d97cc34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>  	return x;
>  }
>  
> -static inline unsigned long zone_numa_state(struct zone *zone,
> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>  					enum zone_numa_stat_item item)
>  {
>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
> +	int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>  
>  	return x;
>  }

This does not appear to be related to the current patch. It either
should be merged with the previous patch or stand on its own.

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 5a7fa30..c7f50ed 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>  
>  #include "internal.h"
>  
> +#define NUMA_STAT_THRESHOLD  32765
> +

This should be expressed in terms of the type and not a hard-coded value.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15  9:58     ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15  9:58 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>  Threshold   CPU cycles    Throughput(88 threads)
>      32          799         241760478
>      64          640         301628829
>      125         537         358906028 <==> system by default (base)
>      256         468         412397590
>      512         428         450550704
>      4096        399         482520943
>      20000       394         489009617
>      30000       395         488017817
>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> ---
>  include/linux/mmzone.h |  4 ++--
>  include/linux/vmstat.h |  6 +++++-
>  mm/vmstat.c            | 23 ++++++++++-------------
>  3 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0b11ba7..7eaf0e8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>  	struct per_cpu_pages pcp;
>  #ifdef CONFIG_NUMA
>  	s8 expire;
> -	s8 numa_stat_threshold;
> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> +	s16 numa_stat_threshold;
> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];

I'm fairly sure this pushes the size of that structure into the next
cache line which is not welcome.

vm_numa_stat_diff is an always incrementing field. How much do you gain
if this becomes a u8 code and remove any code that deals with negative
values? That would double the threshold without consuming another cache line.

Furthermore, the stats in question are only ever incremented by one.
That means that any calcluation related to overlap can be removed and
special cased that it'll never overlap by more than 1. That potentially
removes code that is required for other stats but not locality stats.
This may give enough savings to avoid moving to s16.

Very broadly speaking, I like what you're doing but I would like to see
more work on reducing any unnecessary code in that path (such as dealing
with overlaps for single increments) and treat incrasing the cache footprint
only as a very last resort.

>  #endif
>  #ifdef CONFIG_SMP
>  	s8 stat_threshold;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1e19379..d97cc34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>  	return x;
>  }
>  
> -static inline unsigned long zone_numa_state(struct zone *zone,
> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>  					enum zone_numa_stat_item item)
>  {
>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
> +	int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>  
>  	return x;
>  }

This does not appear to be related to the current patch. It either
should be merged with the previous patch or stand on its own.

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 5a7fa30..c7f50ed 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>  
>  #include "internal.h"
>  
> +#define NUMA_STAT_THRESHOLD  32765
> +

This should be expressed in terms of the type and not a hard-coded value.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-15  8:45 ` Kemi Wang
@ 2017-08-15 10:36   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 41+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-15 10:36 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel, brouer

On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang <kemi.wang@intel.com> wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
                                             ^^^^^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
 http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
> 
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful :-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
> 
> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
                                 ^^^^^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
>    Threshold   CPU cycles    Throughput(88 threads)
>       32        799         241760478
>       64        640         301628829
>       125       537         358906028 <==> system by default
>       256       468         412397590
>       512       428         450550704
>       4096      399         482520943
>       20000     394         489009617
>       30000     395         488017817
>       32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
>       N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Kemi Wang (2):
>   mm: Change the call sites of numa statistics items
>   mm: Update NUMA counter threshold size
> 
>  drivers/base/node.c    |  22 ++++---
>  include/linux/mmzone.h |  25 +++++---
>  include/linux/vmstat.h |  33 ++++++++++
>  mm/page_alloc.c        |  10 +--
>  mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 25 deletions(-)
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-15 10:36   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 41+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-15 10:36 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel, brouer

On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang <kemi.wang@intel.com> wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
                                             ^^^^^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
 http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
> 
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful :-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
> 
> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
                                 ^^^^^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
>    Threshold   CPU cycles    Throughput(88 threads)
>       32        799         241760478
>       64        640         301628829
>       125       537         358906028 <==> system by default
>       256       468         412397590
>       512       428         450550704
>       4096      399         482520943
>       20000     394         489009617
>       30000     395         488017817
>       32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
>       N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Kemi Wang (2):
>   mm: Change the call sites of numa statistics items
>   mm: Update NUMA counter threshold size
> 
>  drivers/base/node.c    |  22 ++++---
>  include/linux/mmzone.h |  25 +++++---
>  include/linux/vmstat.h |  33 ++++++++++
>  mm/page_alloc.c        |  10 +--
>  mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 25 deletions(-)
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15  9:58     ` Mel Gorman
@ 2017-08-15 16:55       ` Tim Chen
  -1 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2017-08-15 16:55 UTC (permalink / raw)
  To: Mel Gorman, Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On 08/15/2017 02:58 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cycles    Throughput(88 threads)
>>      32          799         241760478
>>      64          640         301628829
>>      125         537         358906028 <==> system by default (base)
>>      256         468         412397590
>>      512         428         450550704
>>      4096        399         482520943
>>      20000       394         489009617
>>      30000       395         488017817
>>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +++++-
>>  mm/vmstat.c            | 23 ++++++++++-------------
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> -	s8 numa_stat_threshold;
>> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +	s16 numa_stat_threshold;
>> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.

Doubling the threshold and counter size will help, but not as much
as making them above u8 limit as seen in Kemi's data:

      125         537         358906028 <==> system by default (base)
      256         468         412397590
      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset

For small system making them u8 makes sense.  For larger ones the
frequent local counter overflow into the global counter still
causes a lot of cache bounce.  Kemi can perhaps collect some data
to see what is the gain from making the counters u8. 

> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>>  #endif
>>  #ifdef CONFIG_SMP
>>  	s8 stat_threshold;
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1e19379..d97cc34 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>>  	return x;
>>  }
>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  					enum zone_numa_stat_item item)
>>  {
>>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  	return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15 16:55       ` Tim Chen
  0 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2017-08-15 16:55 UTC (permalink / raw)
  To: Mel Gorman, Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On 08/15/2017 02:58 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cycles    Throughput(88 threads)
>>      32          799         241760478
>>      64          640         301628829
>>      125         537         358906028 <==> system by default (base)
>>      256         468         412397590
>>      512         428         450550704
>>      4096        399         482520943
>>      20000       394         489009617
>>      30000       395         488017817
>>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +++++-
>>  mm/vmstat.c            | 23 ++++++++++-------------
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> -	s8 numa_stat_threshold;
>> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +	s16 numa_stat_threshold;
>> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.

Doubling the threshold and counter size will help, but not as much
as making them above u8 limit as seen in Kemi's data:

      125         537         358906028 <==> system by default (base)
      256         468         412397590
      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset

For small system making them u8 makes sense.  For larger ones the
frequent local counter overflow into the global counter still
causes a lot of cache bounce.  Kemi can perhaps collect some data
to see what is the gain from making the counters u8. 

> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>>  #endif
>>  #ifdef CONFIG_SMP
>>  	s8 stat_threshold;
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1e19379..d97cc34 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>>  	return x;
>>  }
>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  					enum zone_numa_stat_item item)
>>  {
>>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  	return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15 16:55       ` Tim Chen
@ 2017-08-15 17:30         ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15 17:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cycles    Throughput(88 threads)
> >>      32          799         241760478
> >>      64          640         301628829
> >>      125         537         358906028 <==> system by default (base)
> >>      256         468         412397590
> >>      512         428         450550704
> >>      4096        399         482520943
> >>      20000       394         489009617
> >>      30000       395         488017817
> >>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> >> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +++++-
> >>  mm/vmstat.c            | 23 ++++++++++-------------
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>  	struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>  	s8 expire;
> >> -	s8 numa_stat_threshold;
> >> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +	s16 numa_stat_threshold;
> >> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> > vm_numa_stat_diff is an always incrementing field. How much do you gain
> > if this becomes a u8 code and remove any code that deals with negative
> > values? That would double the threshold without consuming another cache line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>       125         537         358906028 <==> system by default (base)
>       256         468         412397590
>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 

The same comments hold. The increase of a cache line is undesirable but
there are other places where the overall cost can be reduced by special
casing based on how this counter is used (always incrementing by one).
It would be preferred if those were addressed to see how close that gets
to the same performance of doubling the necessary storage for a counter.
doubling the storage 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15 17:30         ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15 17:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cycles    Throughput(88 threads)
> >>      32          799         241760478
> >>      64          640         301628829
> >>      125         537         358906028 <==> system by default (base)
> >>      256         468         412397590
> >>      512         428         450550704
> >>      4096        399         482520943
> >>      20000       394         489009617
> >>      30000       395         488017817
> >>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> >> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +++++-
> >>  mm/vmstat.c            | 23 ++++++++++-------------
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>  	struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>  	s8 expire;
> >> -	s8 numa_stat_threshold;
> >> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +	s16 numa_stat_threshold;
> >> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> > vm_numa_stat_diff is an always incrementing field. How much do you gain
> > if this becomes a u8 code and remove any code that deals with negative
> > values? That would double the threshold without consuming another cache line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>       125         537         358906028 <==> system by default (base)
>       256         468         412397590
>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 

The same comments hold. The increase of a cache line is undesirable but
there are other places where the overall cost can be reduced by special
casing based on how this counter is used (always incrementing by one).
It would be preferred if those were addressed to see how close that gets
to the same performance of doubling the necessary storage for a counter.
doubling the storage 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15 17:30         ` Mel Gorman
@ 2017-08-15 17:51           ` Tim Chen
  -1 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2017-08-15 17:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On 08/15/2017 10:30 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:

>>
>> Doubling the threshold and counter size will help, but not as much
>> as making them above u8 limit as seen in Kemi's data:
>>
>>       125         537         358906028 <==> system by default (base)
>>       256         468         412397590
>>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>
>> For small system making them u8 makes sense.  For larger ones the
>> frequent local counter overflow into the global counter still
>> causes a lot of cache bounce.  Kemi can perhaps collect some data
>> to see what is the gain from making the counters u8. 
>>
> 
> The same comments hold. The increase of a cache line is undesirable but
> there are other places where the overall cost can be reduced by special
> casing based on how this counter is used (always incrementing by one).

Can you be more explicit of what optimization you suggest here and changes
to inc/dec_zone_page_state?  Seems to me like we will still overflow
the local counter with the same frequency unless the threshold and
counter size is changed.

Thanks.

Tim

> It would be preferred if those were addressed to see how close that gets
> to the same performance of doubling the necessary storage for a counter.
> doubling the storage 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15 17:51           ` Tim Chen
  0 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2017-08-15 17:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On 08/15/2017 10:30 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:

>>
>> Doubling the threshold and counter size will help, but not as much
>> as making them above u8 limit as seen in Kemi's data:
>>
>>       125         537         358906028 <==> system by default (base)
>>       256         468         412397590
>>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>
>> For small system making them u8 makes sense.  For larger ones the
>> frequent local counter overflow into the global counter still
>> causes a lot of cache bounce.  Kemi can perhaps collect some data
>> to see what is the gain from making the counters u8. 
>>
> 
> The same comments hold. The increase of a cache line is undesirable but
> there are other places where the overall cost can be reduced by special
> casing based on how this counter is used (always incrementing by one).

Can you be more explicit of what optimization you suggest here and changes
to inc/dec_zone_page_state?  Seems to me like we will still overflow
the local counter with the same frequency unless the threshold and
counter size is changed.

Thanks.

Tim

> It would be preferred if those were addressed to see how close that gets
> to the same performance of doubling the necessary storage for a counter.
> doubling the storage 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15 17:51           ` Tim Chen
@ 2017-08-15 19:05             ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15 19:05 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On Tue, Aug 15, 2017 at 10:51:21AM -0700, Tim Chen wrote:
> On 08/15/2017 10:30 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> 
> >>
> >> Doubling the threshold and counter size will help, but not as much
> >> as making them above u8 limit as seen in Kemi's data:
> >>
> >>       125         537         358906028 <==> system by default (base)
> >>       256         468         412397590
> >>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>
> >> For small system making them u8 makes sense.  For larger ones the
> >> frequent local counter overflow into the global counter still
> >> causes a lot of cache bounce.  Kemi can perhaps collect some data
> >> to see what is the gain from making the counters u8. 
> >>
> > 
> > The same comments hold. The increase of a cache line is undesirable but
> > there are other places where the overall cost can be reduced by special
> > casing based on how this counter is used (always incrementing by one).
> 
> Can you be more explicit of what optimization you suggest here and changes
> to inc/dec_zone_page_state?  Seems to me like we will still overflow
> the local counter with the same frequency unless the threshold and
> counter size is changed.

One of the helpers added is __inc_zone_numa_state which doesn't have a
symmetrical __dec_zone_numa_state because the counter is always
incrementing. Because of this, there is little or no motivation to
update the global value by threshold >> 1 because with both inc/dec, you
want to avoid a corner case whereby a loop of inc/dec would do an
overflow every time. Instead, you can always apply the full threshold
and clear it which is fewer operations and halves the frequency at which
the global value needs to be updated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-15 19:05             ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-15 19:05 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On Tue, Aug 15, 2017 at 10:51:21AM -0700, Tim Chen wrote:
> On 08/15/2017 10:30 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> 
> >>
> >> Doubling the threshold and counter size will help, but not as much
> >> as making them above u8 limit as seen in Kemi's data:
> >>
> >>       125         537         358906028 <==> system by default (base)
> >>       256         468         412397590
> >>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>
> >> For small system making them u8 makes sense.  For larger ones the
> >> frequent local counter overflow into the global counter still
> >> causes a lot of cache bounce.  Kemi can perhaps collect some data
> >> to see what is the gain from making the counters u8. 
> >>
> > 
> > The same comments hold. The increase of a cache line is undesirable but
> > there are other places where the overall cost can be reduced by special
> > casing based on how this counter is used (always incrementing by one).
> 
> Can you be more explicit of what optimization you suggest here and changes
> to inc/dec_zone_page_state?  Seems to me like we will still overflow
> the local counter with the same frequency unless the threshold and
> counter size is changed.

One of the helpers added is __inc_zone_numa_state which doesn't have a
symmetrical __dec_zone_numa_state because the counter is always
incrementing. Because of this, there is little or no motivation to
update the global value by threshold >> 1 because with both inc/dec, you
want to avoid a corner case whereby a loop of inc/dec would do an
overflow every time. Instead, you can always apply the full threshold
and clear it which is fewer operations and halves the frequency at which
the global value needs to be updated.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] mm: Change the call sites of numa statistics items
  2017-08-15  9:49     ` Mel Gorman
@ 2017-08-16  2:12       ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  2:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017年08月15日 17:49, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
>> In this patch,  NUMA statistics is separated from zone statistics
>> framework, all the call sites of NUMA stats are changed to use
>> numa-stats-specific functions, it does not have any functionality change
>> except that the value of NUMA stats is shown behind zone page stats, and
>> the threshold size of NUMA stats is shown behind pcp threshold when users
>> *read* the zone info.
>>
>> E.g. cat /proc/zoneinfo
>>     ***Base***                           ***With this patch***
>> nr_free_pages 3976                         nr_free_pages 3976
>> nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
>> nr_zone_active_anon 0                      nr_zone_active_anon 0
>> nr_zone_inactive_file 0                    nr_zone_inactive_file 0
>> nr_zone_active_file 0                      nr_zone_active_file 0
>> nr_zone_unevictable 0                      nr_zone_unevictable 0
>> nr_zone_write_pending 0                    nr_zone_write_pending 0
>> nr_mlock     0                             nr_mlock     0
>> nr_page_table_pages 0                      nr_page_table_pages 0
>> nr_kernel_stack 0                          nr_kernel_stack 0
>> nr_bounce    0                             nr_bounce    0
>> nr_zspages   0                             nr_zspages   0
>> numa_hit 0                                *nr_free_cma  0*
>> numa_miss 0                                numa_hit     0
>> numa_foreign 0                             numa_miss    0
>> numa_interleave 0                          numa_foreign 0
>> numa_local   0                             numa_interleave 0
>> numa_other   0                             numa_local   0
>> *nr_free_cma 0*                            numa_other 0
>>     ...                                        ...
>> vm stats threshold: 10                     vm stats threshold: 10
>>     ...                                   *vm numa stats threshold: 10*
>>                                                ...
>>
>> The next patch updates the numa stats counter size and threshold.
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  drivers/base/node.c    |  22 ++++---
>>  include/linux/mmzone.h |  25 +++++---
>>  include/linux/vmstat.h |  29 +++++++++
>>  mm/page_alloc.c        |  10 +--
>>  mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  5 files changed, 227 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index d8dc830..12080c6 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
>>  		       "interleave_hit %lu\n"
>>  		       "local_node %lu\n"
>>  		       "other_node %lu\n",
>> -		       sum_zone_node_page_state(dev->id, NUMA_HIT),
>> -		       sum_zone_node_page_state(dev->id, NUMA_MISS),
>> -		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
>> -		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
>> -		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
>> -		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
>> +		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
>>  }
> 
> The names are very similar and it would be preferred if the names were
> visually different like sum_zone_numa_stat() which is hard to confuse with
> the zone stat fields.
> 
Agree. Thanks for your suggestion.

>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>  
>> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>>  			     sum_zone_node_page_state(nid, i));
>>  
>> -	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> +#ifdef CONFIG_NUMA
>> +	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> +			     sum_zone_node_numa_state(nid, i));
>> +#endif
> 
> Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
> NR_VM_NODE_STAT_ITEMS>
How about NR_VM_NUMA_STAT_ITEMS? anyone has better idea?
 
>> +
>> +	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> +		n += sprintf(buf+n, "%s %lu\n",
>> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> +			     NR_VM_ZONE_NUMA_STAT_ITEMS],
>>  			     node_page_state(pgdat, i));
>>  
>>  	return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index fc14b8b..0b11ba7 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -114,6 +114,20 @@ struct zone_padding {
>>  #define ZONE_PADDING(name)
>>  #endif
>>  
>> +#ifdef CONFIG_NUMA
>> +enum zone_numa_stat_item {
>> +	NUMA_HIT,		/* allocated in intended node */
>> +	NUMA_MISS,		/* allocated in non intended node */
>> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
>> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
>> +	NUMA_LOCAL,		/* allocation from local node */
>> +	NUMA_OTHER,		/* allocation from other node */
>> +	NR_VM_ZONE_NUMA_STAT_ITEMS
>> +};
>> +#else
>> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
>> +#endif
>> +
>>  enum zone_stat_item {
>>  	/* First 128 byte cacheline (assuming 64 bit words) */
>>  	NR_FREE_PAGES,
>> @@ -132,14 +146,6 @@ enum zone_stat_item {
>>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>>  #endif
>> -#ifdef CONFIG_NUMA
>> -	NUMA_HIT,		/* allocated in intended node */
>> -	NUMA_MISS,		/* allocated in non intended node */
>> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
>> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
>> -	NUMA_LOCAL,		/* allocation from local node */
>> -	NUMA_OTHER,		/* allocation from other node */
>> -#endif
>>  	NR_FREE_CMA_PAGES,
>>  	NR_VM_ZONE_STAT_ITEMS };
>>  
>> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> +	s8 numa_stat_threshold;
>> +	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>>  #endif
>>  #ifdef CONFIG_SMP
>>  	s8 stat_threshold;
> 
> Ok. this slightly increases the size of the per_cpu_pageset due to
> numa_stat_threshold. The structure occupes 2 cache lines and still occupies
> 2 cache lines afterwards so that is ok but consider hard-coding the value
> of it. The locality stats are never used as part of a decision made by the
> kernel and they get summed when reading proc unconditionally. There is little
> benefit to tuning that threshold at all and there should be a very small
> performance gain if it's removed because it'll be a compile-time constant.
> 
Agree, Thanks. I will remove numa_stat_threshold in next version.

> The rest of the patch is mostly mechanical.
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] mm: Change the call sites of numa statistics items
@ 2017-08-16  2:12       ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  2:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017a1'08ae??15ae?JPY 17:49, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
>> In this patch,  NUMA statistics is separated from zone statistics
>> framework, all the call sites of NUMA stats are changed to use
>> numa-stats-specific functions, it does not have any functionality change
>> except that the value of NUMA stats is shown behind zone page stats, and
>> the threshold size of NUMA stats is shown behind pcp threshold when users
>> *read* the zone info.
>>
>> E.g. cat /proc/zoneinfo
>>     ***Base***                           ***With this patch***
>> nr_free_pages 3976                         nr_free_pages 3976
>> nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
>> nr_zone_active_anon 0                      nr_zone_active_anon 0
>> nr_zone_inactive_file 0                    nr_zone_inactive_file 0
>> nr_zone_active_file 0                      nr_zone_active_file 0
>> nr_zone_unevictable 0                      nr_zone_unevictable 0
>> nr_zone_write_pending 0                    nr_zone_write_pending 0
>> nr_mlock     0                             nr_mlock     0
>> nr_page_table_pages 0                      nr_page_table_pages 0
>> nr_kernel_stack 0                          nr_kernel_stack 0
>> nr_bounce    0                             nr_bounce    0
>> nr_zspages   0                             nr_zspages   0
>> numa_hit 0                                *nr_free_cma  0*
>> numa_miss 0                                numa_hit     0
>> numa_foreign 0                             numa_miss    0
>> numa_interleave 0                          numa_foreign 0
>> numa_local   0                             numa_interleave 0
>> numa_other   0                             numa_local   0
>> *nr_free_cma 0*                            numa_other 0
>>     ...                                        ...
>> vm stats threshold: 10                     vm stats threshold: 10
>>     ...                                   *vm numa stats threshold: 10*
>>                                                ...
>>
>> The next patch updates the numa stats counter size and threshold.
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  drivers/base/node.c    |  22 ++++---
>>  include/linux/mmzone.h |  25 +++++---
>>  include/linux/vmstat.h |  29 +++++++++
>>  mm/page_alloc.c        |  10 +--
>>  mm/vmstat.c            | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  5 files changed, 227 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index d8dc830..12080c6 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
>>  		       "interleave_hit %lu\n"
>>  		       "local_node %lu\n"
>>  		       "other_node %lu\n",
>> -		       sum_zone_node_page_state(dev->id, NUMA_HIT),
>> -		       sum_zone_node_page_state(dev->id, NUMA_MISS),
>> -		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
>> -		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
>> -		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
>> -		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
>> +		       sum_zone_node_numa_state(dev->id, NUMA_HIT),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_MISS),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
>> +		       sum_zone_node_numa_state(dev->id, NUMA_OTHER));
>>  }
> 
> The names are very similar and it would be preferred if the names were
> visually different like sum_zone_numa_stat() which is hard to confuse with
> the zone stat fields.
> 
Agree. Thanks for your suggestion.

>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>  
>> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>>  			     sum_zone_node_page_state(nid, i));
>>  
>> -	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> +#ifdef CONFIG_NUMA
>> +	for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> +			     sum_zone_node_numa_state(nid, i));
>> +#endif
> 
> Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
> NR_VM_NODE_STAT_ITEMS>
How about NR_VM_NUMA_STAT_ITEMS? anyone has better idea?
 
>> +
>> +	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> +		n += sprintf(buf+n, "%s %lu\n",
>> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> +			     NR_VM_ZONE_NUMA_STAT_ITEMS],
>>  			     node_page_state(pgdat, i));
>>  
>>  	return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index fc14b8b..0b11ba7 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -114,6 +114,20 @@ struct zone_padding {
>>  #define ZONE_PADDING(name)
>>  #endif
>>  
>> +#ifdef CONFIG_NUMA
>> +enum zone_numa_stat_item {
>> +	NUMA_HIT,		/* allocated in intended node */
>> +	NUMA_MISS,		/* allocated in non intended node */
>> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
>> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
>> +	NUMA_LOCAL,		/* allocation from local node */
>> +	NUMA_OTHER,		/* allocation from other node */
>> +	NR_VM_ZONE_NUMA_STAT_ITEMS
>> +};
>> +#else
>> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
>> +#endif
>> +
>>  enum zone_stat_item {
>>  	/* First 128 byte cacheline (assuming 64 bit words) */
>>  	NR_FREE_PAGES,
>> @@ -132,14 +146,6 @@ enum zone_stat_item {
>>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>>  #endif
>> -#ifdef CONFIG_NUMA
>> -	NUMA_HIT,		/* allocated in intended node */
>> -	NUMA_MISS,		/* allocated in non intended node */
>> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
>> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
>> -	NUMA_LOCAL,		/* allocation from local node */
>> -	NUMA_OTHER,		/* allocation from other node */
>> -#endif
>>  	NR_FREE_CMA_PAGES,
>>  	NR_VM_ZONE_STAT_ITEMS };
>>  
>> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> +	s8 numa_stat_threshold;
>> +	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>>  #endif
>>  #ifdef CONFIG_SMP
>>  	s8 stat_threshold;
> 
> Ok. this slightly increases the size of the per_cpu_pageset due to
> numa_stat_threshold. The structure occupes 2 cache lines and still occupies
> 2 cache lines afterwards so that is ok but consider hard-coding the value
> of it. The locality stats are never used as part of a decision made by the
> kernel and they get summed when reading proc unconditionally. There is little
> benefit to tuning that threshold at all and there should be a very small
> performance gain if it's removed because it'll be a compile-time constant.
> 
Agree, Thanks. I will remove numa_stat_threshold in next version.

> The rest of the patch is mostly mechanical.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15  9:58     ` Mel Gorman
@ 2017-08-16  2:31       ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  2:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel


>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  					enum zone_numa_stat_item item)
>>  {
>>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  	return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
OK. I can move it to an individual patch if it does not make anyone unhappy.
Since it is not graceful to introduce any functionality change in first patch.

>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 
OK, Thanks. I will follow it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-16  2:31       ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  2:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel


>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  					enum zone_numa_stat_item item)
>>  {
>>  	long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  	return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
OK. I can move it to an individual patch if it does not make anyone unhappy.
Since it is not graceful to introduce any functionality change in first patch.

>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 
OK, Thanks. I will follow it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15 16:55       ` Tim Chen
@ 2017-08-16  3:02         ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  3:02 UTC (permalink / raw)
  To: Tim Chen, Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017年08月16日 00:55, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
>> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:

>> I'm fairly sure this pushes the size of that structure into the next
>> cache line which is not welcome.
>>>> vm_numa_stat_diff is an always incrementing field. How much do you gain
>> if this becomes a u8 code and remove any code that deals with negative
>> values? That would double the threshold without consuming another cache line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>       125         537         358906028 <==> system by default (base)
>       256         468         412397590
>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 
Tim, thanks for your answer. That is what I want to clarify.

Also, pls notice that the negative threshold/2 is set to cpu local counter
(e.g. vm_numa_stat_diff[]) once per-zone counter is updated in current code
path. This weakens the benefit of changing s8 to u8 in this case. 
>>
>> Furthermore, the stats in question are only ever incremented by one.
>> That means that any calcluation related to overlap can be removed and
>> special cased that it'll never overlap by more than 1. That potentially
>> removes code that is required for other stats but not locality stats.
>> This may give enough savings to avoid moving to s16.
>>
>> Very broadly speaking, I like what you're doing but I would like to see
>> more work on reducing any unnecessary code in that path (such as dealing
>> with overlaps for single increments) and treat incrasing the cache footprint
>> only as a very last resort.
>>
Agree. I will think about it more. 

>>>  #endif
>>>  #ifdef CONFIG_SMP
>>>  	s8 stat_threshold;
>>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>>> index 1e19379..d97cc34 100644
>>> --- a/include/linux/vmstat.h
>>> +++ b/include/linux/vmstat.h
>>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>>>  	return x;
>>>  }
>>>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-16  3:02         ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  3:02 UTC (permalink / raw)
  To: Tim Chen, Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017a1'08ae??16ae?JPY 00:55, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
>> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:

>> I'm fairly sure this pushes the size of that structure into the next
>> cache line which is not welcome.
>>>> vm_numa_stat_diff is an always incrementing field. How much do you gain
>> if this becomes a u8 code and remove any code that deals with negative
>> values? That would double the threshold without consuming another cache line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>       125         537         358906028 <==> system by default (base)
>       256         468         412397590
>       32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 
Tim, thanks for your answer. That is what I want to clarify.

Also, pls notice that the negative threshold/2 is set to cpu local counter
(e.g. vm_numa_stat_diff[]) once per-zone counter is updated in current code
path. This weakens the benefit of changing s8 to u8 in this case. 
>>
>> Furthermore, the stats in question are only ever incremented by one.
>> That means that any calcluation related to overlap can be removed and
>> special cased that it'll never overlap by more than 1. That potentially
>> removes code that is required for other stats but not locality stats.
>> This may give enough savings to avoid moving to s16.
>>
>> Very broadly speaking, I like what you're doing but I would like to see
>> more work on reducing any unnecessary code in that path (such as dealing
>> with overlaps for single increments) and treat incrasing the cache footprint
>> only as a very last resort.
>>
Agree. I will think about it more. 

>>>  #endif
>>>  #ifdef CONFIG_SMP
>>>  	s8 stat_threshold;
>>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>>> index 1e19379..d97cc34 100644
>>> --- a/include/linux/vmstat.h
>>> +++ b/include/linux/vmstat.h
>>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>>>  	return x;
>>>  }
>>>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-15 10:36   ` Jesper Dangaard Brouer
@ 2017-08-16  3:23     ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  3:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017年08月15日 18:36, Jesper Dangaard Brouer wrote:
> On Tue, 15 Aug 2017 16:45:34 +0800
> Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> Each page allocation updates a set of per-zone statistics with a call to
>> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
>                                              ^^^^^^ should be "summit"

Hi, Jesper
   Thanks for reporting this issue and providing the benchmark to test raw
performance of page allocation. It is really quite helpful to figure out the
root cause.
>> source of overhead in the page allocator and are very rarely consumed. This
>> significant overhead in cache bouncing caused by zone counters (NUMA
>> associated counters) update in parallel in multi-threaded page allocation
>> (pointed out by Dave Hansen).
> 
> Hi Kemi
> 
> Thanks a lot for following up on this work. A link to the MM summit slides:
>  http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
> 
Thanks for adding the link here. I should have done that in this cover letter.

>> To mitigate this overhead, this patchset separates NUMA statistics from
>> zone statistics framework, and update NUMA counter threshold to a fixed
>> size of 32765, as a small threshold greatly increases the update frequency
>> of the global counter from local per cpu counter (suggested by Ying Huang).
>> The rationality is that these statistics counters don't need to be read
>> often, unlike other VM counters, so it's not a problem to use a large
>> threshold and make readers more expensive.
>>
>> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
>> for per single page allocation and reclaim on Jesper's page_bench03
>> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
>> statistics with little end-user-visible effects (see the first patch for
>> details), except that the number of NUMA items in each cpu
>> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
>> the value of NUMA counter to eliminate deviation.
> 
> I'm very happy to see that you found my kernel module for benchmarking useful :-)
> 
>> I did an experiment of single page allocation and reclaim concurrently
>> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
>> (88 processors with 126G memory) with different size of threshold of pcp
>> counter.
>>
>> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
>                                  ^^^^^^^
> You mis-spelled my last name, it is "Brouer".
> 
Dear Jesper, I am so sorry about it, please forgive me :)

>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>>    Threshold   CPU cycles    Throughput(88 threads)
>>       32        799         241760478
>>       64        640         301628829
>>       125       537         358906028 <==> system by default
>>       256       468         412397590
>>       512       428         450550704
>>       4096      399         482520943
>>       20000     394         489009617
>>       30000     395         488017817
>>       32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>       N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Kemi Wang (2):
>>   mm: Change the call sites of numa statistics items
>>   mm: Update NUMA counter threshold size
>>
>>  drivers/base/node.c    |  22 ++++---
>>  include/linux/mmzone.h |  25 +++++---
>>  include/linux/vmstat.h |  33 ++++++++++
>>  mm/page_alloc.c        |  10 +--
>>  mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  5 files changed, 227 insertions(+), 25 deletions(-)
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-16  3:23     ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-16  3:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017a1'08ae??15ae?JPY 18:36, Jesper Dangaard Brouer wrote:
> On Tue, 15 Aug 2017 16:45:34 +0800
> Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> Each page allocation updates a set of per-zone statistics with a call to
>> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
>                                              ^^^^^^ should be "summit"

Hi, Jesper
   Thanks for reporting this issue and providing the benchmark to test raw
performance of page allocation. It is really quite helpful to figure out the
root cause.
>> source of overhead in the page allocator and are very rarely consumed. This
>> significant overhead in cache bouncing caused by zone counters (NUMA
>> associated counters) update in parallel in multi-threaded page allocation
>> (pointed out by Dave Hansen).
> 
> Hi Kemi
> 
> Thanks a lot for following up on this work. A link to the MM summit slides:
>  http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
> 
Thanks for adding the link here. I should have done that in this cover letter.

>> To mitigate this overhead, this patchset separates NUMA statistics from
>> zone statistics framework, and update NUMA counter threshold to a fixed
>> size of 32765, as a small threshold greatly increases the update frequency
>> of the global counter from local per cpu counter (suggested by Ying Huang).
>> The rationality is that these statistics counters don't need to be read
>> often, unlike other VM counters, so it's not a problem to use a large
>> threshold and make readers more expensive.
>>
>> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
>> for per single page allocation and reclaim on Jesper's page_bench03
>> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
>> statistics with little end-user-visible effects (see the first patch for
>> details), except that the number of NUMA items in each cpu
>> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
>> the value of NUMA counter to eliminate deviation.
> 
> I'm very happy to see that you found my kernel module for benchmarking useful :-)
> 
>> I did an experiment of single page allocation and reclaim concurrently
>> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
>> (88 processors with 126G memory) with different size of threshold of pcp
>> counter.
>>
>> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
>                                  ^^^^^^^
> You mis-spelled my last name, it is "Brouer".
> 
Dear Jesper, I am so sorry about it, please forgive me :)

>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>>    Threshold   CPU cycles    Throughput(88 threads)
>>       32        799         241760478
>>       64        640         301628829
>>       125       537         358906028 <==> system by default
>>       256       468         412397590
>>       512       428         450550704
>>       4096      399         482520943
>>       20000     394         489009617
>>       30000     395         488017817
>>       32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>       N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Kemi Wang (2):
>>   mm: Change the call sites of numa statistics items
>>   mm: Update NUMA counter threshold size
>>
>>  drivers/base/node.c    |  22 ++++---
>>  include/linux/mmzone.h |  25 +++++---
>>  include/linux/vmstat.h |  33 ++++++++++
>>  mm/page_alloc.c        |  10 +--
>>  mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  5 files changed, 227 insertions(+), 25 deletions(-)
>>
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-15  9:58     ` Mel Gorman
@ 2017-08-22  3:21       ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-22  3:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017年08月15日 17:58, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cycles    Throughput(88 threads)
>>      32          799         241760478
>>      64          640         301628829
>>      125         537         358906028 <==> system by default (base)
>>      256         468         412397590
>>      512         428         450550704
>>      4096        399         482520943
>>      20000       394         489009617
>>      30000       395         488017817
>>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +++++-
>>  mm/vmstat.c            | 23 ++++++++++-------------
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> -	s8 numa_stat_threshold;
>> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +	s16 numa_stat_threshold;
>> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
Hi Mel
  I am refreshing this patch. Would you pls be more explicit of what "that
structure" indicates. 
  If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
still occupies two caches line after extending s8 to s16/u16, that should
not be a problem. For 32 bits machine, we probably does not need to extend
the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
numa system, and s8/u8 is large enough for it, in this case, we can keep the 
same size of "struct per_cpu_pageset".

 If you mean "s16 vm_numa_stat_diff[]", and want to keep it in a single cache
line, we probably can add some padding after "s8 expire" to achieve it.

Again, thanks for your comments to make this patch more graceful.
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.
> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-22  3:21       ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-22  3:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017a1'08ae??15ae?JPY 17:58, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cycles    Throughput(88 threads)
>>      32          799         241760478
>>      64          640         301628829
>>      125         537         358906028 <==> system by default (base)
>>      256         468         412397590
>>      512         428         450550704
>>      4096        399         482520943
>>      20000       394         489009617
>>      30000       395         488017817
>>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +++++-
>>  mm/vmstat.c            | 23 ++++++++++-------------
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  	struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  	s8 expire;
>> -	s8 numa_stat_threshold;
>> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +	s16 numa_stat_threshold;
>> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
Hi Mel
  I am refreshing this patch. Would you pls be more explicit of what "that
structure" indicates. 
  If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
still occupies two caches line after extending s8 to s16/u16, that should
not be a problem. For 32 bits machine, we probably does not need to extend
the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
numa system, and s8/u8 is large enough for it, in this case, we can keep the 
same size of "struct per_cpu_pageset".

 If you mean "s16 vm_numa_stat_diff[]", and want to keep it in a single cache
line, we probably can add some padding after "s8 expire" to achieve it.

Again, thanks for your comments to make this patch more graceful.
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.
> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-22  3:21       ` kemi
@ 2017-08-22  8:39         ` Mel Gorman
  -1 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-22  8:39 UTC (permalink / raw)
  To: kemi
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 22, 2017 at 11:21:31AM +0800, kemi wrote:
> 
> 
> On 2017???08???15??? 17:58, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cycles    Throughput(88 threads)
> >>      32          799         241760478
> >>      64          640         301628829
> >>      125         537         358906028 <==> system by default (base)
> >>      256         468         412397590
> >>      512         428         450550704
> >>      4096        399         482520943
> >>      20000       394         489009617
> >>      30000       395         488017817
> >>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> >> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +++++-
> >>  mm/vmstat.c            | 23 ++++++++++-------------
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>  	struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>  	s8 expire;
> >> -	s8 numa_stat_threshold;
> >> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +	s16 numa_stat_threshold;
> >> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> Hi Mel
>   I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates. 
>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
> same size of "struct per_cpu_pageset".
> 

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
@ 2017-08-22  8:39         ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-08-22  8:39 UTC (permalink / raw)
  To: kemi
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel

On Tue, Aug 22, 2017 at 11:21:31AM +0800, kemi wrote:
> 
> 
> On 2017???08???15??? 17:58, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cycles    Throughput(88 threads)
> >>      32          799         241760478
> >>      64          640         301628829
> >>      125         537         358906028 <==> system by default (base)
> >>      256         468         412397590
> >>      512         428         450550704
> >>      4096        399         482520943
> >>      20000       394         489009617
> >>      30000       395         488017817
> >>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> >> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +++++-
> >>  mm/vmstat.c            | 23 ++++++++++-------------
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>  	struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>  	s8 expire;
> >> -	s8 numa_stat_threshold;
> >> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +	s16 numa_stat_threshold;
> >> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> Hi Mel
>   I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates. 
>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
> same size of "struct per_cpu_pageset".
> 

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] mm: Update NUMA counter threshold size
  2017-08-22  8:39         ` Mel Gorman
  (?)
@ 2017-08-22  8:53         ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-22  8:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Dave, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Tim Chen, Linux MM,
	Linux Kernel



On 2017a1'08ae??22ae?JPY 16:39, Mel Gorman wrote:
> On Tue, Aug 22, 2017 at 11:21:31AM +0800, kemi wrote:
>>
>>
>> On 2017???08???15??? 17:58, Mel Gorman wrote:
>>> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>>>  Threshold   CPU cycles    Throughput(88 threads)
>>>>      32          799         241760478
>>>>      64          640         301628829
>>>>      125         537         358906028 <==> system by default (base)
>>>>      256         468         412397590
>>>>      512         428         450550704
>>>>      4096        399         482520943
>>>>      20000       394         489009617
>>>>      30000       395         488017817
>>>>      32765       394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>>>      N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>>>
>>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
>>>> ---
>>>>  include/linux/mmzone.h |  4 ++--
>>>>  include/linux/vmstat.h |  6 +++++-
>>>>  mm/vmstat.c            | 23 ++++++++++-------------
>>>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 0b11ba7..7eaf0e8 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>>>  	struct per_cpu_pages pcp;
>>>>  #ifdef CONFIG_NUMA
>>>>  	s8 expire;
>>>> -	s8 numa_stat_threshold;
>>>> -	s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>>>> +	s16 numa_stat_threshold;
>>>> +	s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>>>
>>> I'm fairly sure this pushes the size of that structure into the next
>>> cache line which is not welcome.
>>>
>> Hi Mel
>>   I am refreshing this patch. Would you pls be more explicit of what "that
>> structure" indicates. 
>>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
>> still occupies two caches line after extending s8 to s16/u16, that should
>> not be a problem.
> 
> You're right, I was in error. I miscalculated badly initially. It still
> fits in as expected.
> 
>> For 32 bits machine, we probably does not need to extend
>> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
>> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
>> same size of "struct per_cpu_pageset".
>>
> 
> I don't believe it's worth the complexity of making this
> bitness-specific. 32-bit takes penalties in other places and besides,
> 32-bit does not necessarily mean a change in cache line size.
> 
> Fortunately, I think you should still be able to gain a bit more with
> some special casing the fact it's always incrementing and always do full
> spill of the counters instead of half. If so, then using u16 instead of
> s16 should also reduce the update frequency. However, if you find it's
> too complex and the gain is too marginal then I'll ack without it.
> 

That's fine, it would not be too complex to change s16 to u16, 
I will adopt it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-15  8:45 ` Kemi Wang
@ 2017-08-22 21:22   ` Christopher Lameter
  -1 siblings, 0 replies; 41+ messages in thread
From: Christopher Lameter @ 2017-08-22 21:22 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

Can we simple get rid of the stats or make then configurable (off by
defaut)? I agree they are rarely used and have been rarely used in the past.

Maybe some instrumentation for perf etc will allow
similar statistics these days? Thus its possible to drop them?

The space in the pcp pageset is precious and we should strive to use no
more than a cacheline for the diffs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-22 21:22   ` Christopher Lameter
  0 siblings, 0 replies; 41+ messages in thread
From: Christopher Lameter @ 2017-08-22 21:22 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

Can we simple get rid of the stats or make then configurable (off by
defaut)? I agree they are rarely used and have been rarely used in the past.

Maybe some instrumentation for perf etc will allow
similar statistics these days? Thus its possible to drop them?

The space in the pcp pageset is precious and we should strive to use no
more than a cacheline for the diffs.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-22 21:22   ` Christopher Lameter
@ 2017-08-22 23:19     ` Andi Kleen
  -1 siblings, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2017-08-22 23:19 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Mel Gorman,
	Johannes Weiner, Dave, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Tim Chen, Linux MM, Linux Kernel

Christopher Lameter <cl@linux.com> writes:

> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
>
> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
>
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.

The statistics are useful and we need them sometimes. And more and more
runtime switches are a pain -- if you need them they would be likely
turned off. The key is just to make them cheap enough that they're not a
problem.

The only problem was just that that the standard vmstats which are
optimized for readers too are too expensive for them.

The motivation for the patch was that the frequent atomics
were proven to slow the allocator down, and Kemi's patch
fixed it and he has shown it with lots of data.

I don't really see the point of so much discussion about a single cache
line.

There are lots of cache lines used all over the VM. Why is this one
special? Adding one more shouldn't be that bad.

But there's no data at all that touching another cache line
here is a problem.

It's next to an already touched cache line, so it's highly
likely that a prefetcher would catch it anyways.

I can see the point of worrying about over all cache line foot print
("death of a thousand cuts") but the right way to address problems like
this is use a profiler in a realistic workload and systematically
look at the code who actually has cache misses. And I bet we would
find quite a few that could be easily avoided and have real
payoff. I would really surprise me if it was this cache line.

But blocking real demonstrated improvements over a theoretical
cache line doesn't really help.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-22 23:19     ` Andi Kleen
  0 siblings, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2017-08-22 23:19 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Kemi Wang, Andrew Morton, Michal Hocko, Mel Gorman,
	Johannes Weiner, Dave, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Tim Chen, Linux MM, Linux Kernel

Christopher Lameter <cl@linux.com> writes:

> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
>
> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
>
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.

The statistics are useful and we need them sometimes. And more and more
runtime switches are a pain -- if you need them they would be likely
turned off. The key is just to make them cheap enough that they're not a
problem.

The only problem was just that that the standard vmstats which are
optimized for readers too are too expensive for them.

The motivation for the patch was that the frequent atomics
were proven to slow the allocator down, and Kemi's patch
fixed it and he has shown it with lots of data.

I don't really see the point of so much discussion about a single cache
line.

There are lots of cache lines used all over the VM. Why is this one
special? Adding one more shouldn't be that bad.

But there's no data at all that touching another cache line
here is a problem.

It's next to an already touched cache line, so it's highly
likely that a prefetcher would catch it anyways.

I can see the point of worrying about over all cache line foot print
("death of a thousand cuts") but the right way to address problems like
this is use a profiler in a realistic workload and systematically
look at the code who actually has cache misses. And I bet we would
find quite a few that could be easily avoided and have real
payoff. I would really surprise me if it was this cache line.

But blocking real demonstrated improvements over a theoretical
cache line doesn't really help.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-22 21:22   ` Christopher Lameter
@ 2017-08-23  1:14     ` kemi
  -1 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-23  1:14 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel



On 2017年08月23日 05:22, Christopher Lameter wrote:
> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
> 

I agree that we can make numa stats as well as other stats items that are rarely
used configurable. Perhaps we can introduce a general mechanism to hide such unimportant
stats(suggested by *Dave Hansen* initially), it works like this:

when performance is not important and when you want all tooling to work, you set:

	sysctl vm.strict_stats=1

but if you can tolerate some possible tool breakage and some decreased
counter precision, you can do:

	sysctl vm.strict_stats=0

What's your idea for that? I can help to implement it later.

But it may not a good idea to simply get rid of such kinds of stats.

> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
> 
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.
> 
> 

Andi has helped to explain it very clearly. Thanks very much.

For 64-bit OS:
                                     base       with this patch(even include numa_threshold)
sizeof(struct per_cpu_pageset)	      88 		96

Copy the discussion before from another email thread in case you missed it:

> Hi Mel
>   I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates. 
>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
> same size of "struct per_cpu_pageset".
>

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-23  1:14     ` kemi
  0 siblings, 0 replies; 41+ messages in thread
From: kemi @ 2017-08-23  1:14 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner, Dave,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel



On 2017a1'08ae??23ae?JPY 05:22, Christopher Lameter wrote:
> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
> 

I agree that we can make numa stats as well as other stats items that are rarely
used configurable. Perhaps we can introduce a general mechanism to hide such unimportant
stats(suggested by *Dave Hansen* initially), it works like this:

when performance is not important and when you want all tooling to work, you set:

	sysctl vm.strict_stats=1

but if you can tolerate some possible tool breakage and some decreased
counter precision, you can do:

	sysctl vm.strict_stats=0

What's your idea for that? I can help to implement it later.

But it may not a good idea to simply get rid of such kinds of stats.

> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
> 
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.
> 
> 

Andi has helped to explain it very clearly. Thanks very much.

For 64-bit OS:
                                     base       with this patch(even include numa_threshold)
sizeof(struct per_cpu_pageset)	      88 		96

Copy the discussion before from another email thread in case you missed it:

> Hi Mel
>   I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates. 
>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
> same size of "struct per_cpu_pageset".
>

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
  2017-08-23  1:14     ` kemi
@ 2017-08-23  4:55       ` Dave Hansen
  -1 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2017-08-23  4:55 UTC (permalink / raw)
  To: kemi, Christopher Lameter
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On 08/22/2017 06:14 PM, kemi wrote:
> when performance is not important and when you want all tooling to work, you set:
> 
> 	sysctl vm.strict_stats=1
> 
> but if you can tolerate some possible tool breakage and some decreased
> counter precision, you can do:
> 
> 	sysctl vm.strict_stats=0

My other thought was to try to set vm.strict_stats=0 and move to
vm.strict_stats=1 (and issue a printk) when somebody reads
/proc/zoneinfo (or the other files where the expensive stats are displayed).

We'd need three modes for the expensive stats:

	1. Off by default
	2. On.  (#1 transforms to this by default when stats are read)
	3. Off permanently.  An *actual* tunable that someone could set
	   on systems that want to be able to read the stat files, don't
	   care about precision, and want the best performance.

That way, virtually everybody (who falls into mode #1 or #2) gets what
they want.  The only folks who have to mess with a tunable are the
really weird, picky ones who use option #3.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Separate NUMA statistics from zone statistics
@ 2017-08-23  4:55       ` Dave Hansen
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2017-08-23  4:55 UTC (permalink / raw)
  To: kemi, Christopher Lameter
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Tim Chen, Linux MM, Linux Kernel

On 08/22/2017 06:14 PM, kemi wrote:
> when performance is not important and when you want all tooling to work, you set:
> 
> 	sysctl vm.strict_stats=1
> 
> but if you can tolerate some possible tool breakage and some decreased
> counter precision, you can do:
> 
> 	sysctl vm.strict_stats=0

My other thought was to try to set vm.strict_stats=0 and move to
vm.strict_stats=1 (and issue a printk) when somebody reads
/proc/zoneinfo (or the other files where the expensive stats are displayed).

We'd need three modes for the expensive stats:

	1. Off by default
	2. On.  (#1 transforms to this by default when stats are read)
	3. Off permanently.  An *actual* tunable that someone could set
	   on systems that want to be able to read the stat files, don't
	   care about precision, and want the best performance.

That way, virtually everybody (who falls into mode #1 or #2) gets what
they want.  The only folks who have to mess with a tunable are the
really weird, picky ones who use option #3.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-08-23  4:55 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-15  8:45 [PATCH 0/2] Separate NUMA statistics from zone statistics Kemi Wang
2017-08-15  8:45 ` Kemi Wang
2017-08-15  8:45 ` [PATCH 1/2] mm: Change the call sites of numa statistics items Kemi Wang
2017-08-15  8:45   ` Kemi Wang
2017-08-15  9:49   ` Mel Gorman
2017-08-15  9:49     ` Mel Gorman
2017-08-16  2:12     ` kemi
2017-08-16  2:12       ` kemi
2017-08-15  8:45 ` [PATCH 2/2] mm: Update NUMA counter threshold size Kemi Wang
2017-08-15  8:45   ` Kemi Wang
2017-08-15  9:58   ` Mel Gorman
2017-08-15  9:58     ` Mel Gorman
2017-08-15 16:55     ` Tim Chen
2017-08-15 16:55       ` Tim Chen
2017-08-15 17:30       ` Mel Gorman
2017-08-15 17:30         ` Mel Gorman
2017-08-15 17:51         ` Tim Chen
2017-08-15 17:51           ` Tim Chen
2017-08-15 19:05           ` Mel Gorman
2017-08-15 19:05             ` Mel Gorman
2017-08-16  3:02       ` kemi
2017-08-16  3:02         ` kemi
2017-08-16  2:31     ` kemi
2017-08-16  2:31       ` kemi
2017-08-22  3:21     ` kemi
2017-08-22  3:21       ` kemi
2017-08-22  8:39       ` Mel Gorman
2017-08-22  8:39         ` Mel Gorman
2017-08-22  8:53         ` kemi
2017-08-15 10:36 ` [PATCH 0/2] Separate NUMA statistics from zone statistics Jesper Dangaard Brouer
2017-08-15 10:36   ` Jesper Dangaard Brouer
2017-08-16  3:23   ` kemi
2017-08-16  3:23     ` kemi
2017-08-22 21:22 ` Christopher Lameter
2017-08-22 21:22   ` Christopher Lameter
2017-08-22 23:19   ` Andi Kleen
2017-08-22 23:19     ` Andi Kleen
2017-08-23  1:14   ` kemi
2017-08-23  1:14     ` kemi
2017-08-23  4:55     ` Dave Hansen
2017-08-23  4:55       ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.